WO2008023069A2

WO2008023069A2 - Method of processing data

Info

Publication number: WO2008023069A2
Application number: PCT/EP2007/058850
Authority: WO
Inventors: James Moeskops; Gary Lawson
Original assignee: Millnet Financial Limited
Priority date: 2006-08-25
Filing date: 2007-08-24
Publication date: 2008-02-28
Also published as: GB0616880D0

Description

Method of Processing Data

Field of the Invention

The present invention relates to processing data from a plurality of different data sources, in particular, but not exclusively, e-mail data, such as data relating to a set of e-mails having one or more attached files.

Background of the Invention

In the field of law, legal discovery of documents is often an important part of litigation. Discovery involves document review during which all relevant materials are read and analyzed. Reviewing data for discovery purposes typically involves reviewing data contained in a large number of different documents; this data can have a number of different sources, i.e. it can be in a number of different formats; further it can be stored on different media, or on different electronic folders. Examples of different media for the storage of data include: a computer hard drive, a disc (such as a CD, DVD, HDDVD, or Blue-Ray), or a solid state storage device. Further, data may be stored on non electronic media, such as paper documents. The fact that the data comes from different disparate sources means that the process of discovery is often time consuming, and difficult. Often, all electronic files are printed out, and the data is reviewed on paper. However, documents stored in electronic format often contain more information than a print out of the document. For example, word processed document files contain a history of when they were created and modified and contain data relating to the author of the document, for example.

Documents which are often reviewed during discovery are e-mails. E- mails typically have one or more attached files, known as attachments. The attachments of an e-mail may be files such as word processing files, database files, files containing presentations, media files etc.

E-mails are viewed using an e-mail client software client program, such as Microsoft Outlook™ or Lotus Notes™. In order to view the contents of an attachment it is typically necessary to open or to view the e-mail, and then to open or view the attachment, in separate operations. Further, when the e-mail is viewed in an e-mail client, it is not possible to perform operations in relation to both the e-mail and the attachment. For example, an e-mail client may provide functionality to search a group of e-mails for a search term appearing in the e- mail, or to search the group of e-mails for data relating to when the e-mail was sent or received, and for recipient data. However, the attachments are treated differently. Thus, given the hierarchy of e-mails and their respective attachments, it can be difficult to obtain information from a group of e-mails and their respective attachments, particularly if the group of e-mails is large. Litigation support systems such as CT Summation™, Concordance™ and Ringtail™ are known. System such as these allow the processing of data files to extract content data and metadata from the files. This data is then presented in a database, and the content data of the data files can be viewed in a format such as plain text, or can be viewed in their native file format, using the program in which they were created. However, disadvantages associated with these systems are that they need specialist training in order to use them, and they are expensive to install and run.

WO02091701 relates to a system and method for processing messages stored in multiple message stores in order to identify and categorize duplicate and unique messages, and discusses electronic message stores being produced during the discovery phase of litigation to obtain evidence and materials useful to the litigants and the court. WO02091701 discusses the document review process being time consuming and expensive, as each document must ultimately be manually read. WO02091701 further states that pre-analyzing documents to remove duplicative information can save significant time and expense by paring down the review field, particularly when dealing with the large number of individual messages stored in each of the archived electronic messages stores for a community of users.

US6725228 relates to a computer-based system which catalogues and retrieves electronic messages saved in a message store. The system automatically organizes each saved message into multiple folders based on the contents and attributes of the message, and implements improved methods for manually organizing messages. The system uses lightweight message shortcuts (e.g. message id.) to display the message in multiple folders simultaneously. The system preferably permits messages to be organized by: 1) basic message and attachment properties, e.g. date, status, attachment type; 2) extended message properties that the user can specify, e.g. keywords; and 3) correspondent or bulk mail sender/recipient, with automatic separation of bulk mail from correspondence.

However, none of the prior art provides a way of presenting e-mails, their associated attachments and other electronic files formats for review in a convenient and cost-effective manner.

Summary of the Invention

In accordance with a first aspect of the present invention there is provided a method of processing a plurality of data items stored on one or more data storage media, each of said data items comprising data, wherein said plurality of data items comprises a first data item having first data, and a second data item having second data, said first data and said second data not including e-mail body content data, in which the method comprises the steps of: processing said first data to create first e-mail body content data derived from at least part of said first data; populating an e-mail item with said created first e-mail body content data to output a first output e-mail item; processing said second data to create second e-mail body content data derived from at least part of said second data; populating an e-mail item with said created second e-mail body content data to output a second output e-mail item; and populating a load file for an e-mail client with said first and second output e-mail items.

The invention in this aspect allows data items to be processed by converting data therein into e-mail body content data, which is populated into an e-mail item and added to a set of similarly created e-mail items which is then converted into a load file for an e-mail client. This means that disparate data originating in disparate formats and sources can be reviewed and manipulated together, all within a single, commonly used and familiar graphical user interface, in the form of the e-mail client graphical user interface.

The data items may comprise metadata and content data, and the created e-mail body content data may comprise at least part of said metadata, and at least part of said content data. Thus, the generated output e-mail item can combine the content data with the metadata from a processed data item.

The processing may comprise creating content for one or more e-mail data fields in addition to said created e-mail body content data, said one or more e-mail data fields being fields whereby an e-mail client is capable of performing a sort operation for said output e-mail item. This allows data derived from different data items to be used to sort the output e-mail items with a single sort function.

The one or more e-mail data fields may include a file path data field, the content being derived from file paths associated with the data items being processed.

This means that the data can be sorted according to the file path data.

In a preferred embodiment, the e-mail client is Microsoft Outlook™. Alternatively it may be a Lotus Notes™ e-mail client.

The method may comprise processing a group of input data items, wherein said method further comprises comparing data items from said group to determine whether a part of a data item in said group is a duplicate of a part of any other data item in said group.

This allows duplicate data items to be processed accordingly. Duplicate items can be deleted to reduce the amount of data to be processed or stored. In a preferred embodiment of the present invention said comparing comprises analysing the original content data of said input data items in said group.

This allows duplicate data items having different file names, but the same content data, to be easily identified and processed accordingly. Duplicate items can be deleted to reduce the amount of data being processed or stored. In a further embodiment the method of comparing comprises analysing metadata of said input data items in said group.

This allows duplicate data items sharing the same metadata to be identified and processed accordingly.

The first data item may be in a first file format, and said second data item may be in a second file format, said first and second file formats being different from each other.

Thus, data from differing file formats can be processed.

The output e-mail item may be in an e-mail file format, said e-mail file format being different from said first and second file formats. The first data may be in a first data format, and said second data may be in a second data format, said first and second data formats being different from each other. Thus, data in different data formats can be converted to a common data format associated with an e-mail file.

The e-mail body content data may be in an e-mail body content data format, said e-mail body content data format being different from said first and second data formats.

The processing of said first and second data items may be performed in accordance with at least one predetermined rule. The method may comprise the step of identifying a file format for said first and second data item, and wherein a different predetermined rule is selected in accordance with the identified file format.

The method may further comprise generating association data for associating at least one of said data items with at least one other data item. This means that relationships between the output e-mail items can be identified using the association data.

The association data may comprise a hyperlink. This means that associated data can be accessed from within the body content data of the output e-mail item.

The plurality of data items may comprise at least one input e-mail item, which is adapted to be accessed using an e-mail client, said e-mail item comprising original e-mail body content data and wherein said first data item may comprise an attachment file associated with said input e-mail item.

This means that disparate data items including both e-mail items and associated, non-e-mail, attachments can be processed and received together.

The method may comprise the steps of: processing said input e-mail item to create third e-mail body content data derived from at least part of said original e-mail body content data; and populating an e-mail item with said third created e-mail body content data to output a third output e-mail item, wherein said method comprises populating said load file with both said third output e-mail item and said input e-mail item independently. This means the load file contains both the input e-mail item and the third output e-mail item independently, so that both can be reviewed.

The step of populating an e-mail item to output an output e-mail item may comprise populating said input e-mail item with said first created e-mail body content data. Thus, data derived from the attachment of an input e-mail item can be populated into an output e-mail item associated with the input e- mail item.

The processing may comprise processing said first and second data items from a plurality of different data sources and outputting said first and second output e-mail items in a single output e-mail file.

The plurality of different data sources may include one or more data storage media and/or one or more paper copies.

According to a second aspect of the present invention there is provided a method of processing data which is adapted to be accessed using an e-mail client, said data being indicative of an input e-mail item comprising original e- mail body content data and having at least one attachment file associated therewith, wherein said attachment file comprises attachment data, said attachment data not including e-mail body content data, in which the method comprises the steps of: selecting said attachment file; processing said attachment file to create first e-mail body content data derived from at least part of said attachment data; and populating an e-mail item with said created first e-mail body content data to output an output e-mail item, wherein said attachment data comprises metadata and content data, and wherein said created e-mail body content data comprises at least part of said metadata, and at least part of said content data.

Thus, an e-mail item having one or more attachments can be processed so that data and metadata in an attachment is converted to e-mail body content data.

According to a third aspect of the present invention there is provided a method of processing data which is adapted to be accessed using an e-mail client, said data being indicative of an input e-mail item comprising original e- mail body content data and having at least one attachment file associated therewith, wherein said attachment file comprises attachment data, said attachment data not including e-mail body content data, in which the method comprises the steps of: selecting said attachment file; processing said attachment file to create first e-mail body content data derived from at least part of said attachment data; and populating an e-mail item with said created first e-mail body content data to output an output e-mail item, wherein said processing comprises creating content for one or more e- mail data fields in addition to said created e-mail body content data, said one or more e-mail data fields being fields whereby an e-mail client is capable of performing a sort operation for said output e-mail item.

The invention in this aspect allows new content to be created for data fields when processing the attachment file, such that a sort operation can be performed by the e-mail client on the output e-mail item. In a preferred embodiment of the present invention said one or more e- mail data fields include a file path data field, the content being derived from file paths associated with the data items being processed.

Output e-mail items can thus be sorted and arranged according to the original file paths of the data items. The content for said one or more e-mail data fields preferably include data extracted from metadata in said data items. This allows one to use the extracted metadata to create the content for the sort fields.

The e-mail data fields may include a date field. This has the advantage of allowing a user to arrange output e-mail items according to date derived from the original data items.

The created e-mail data fields may further include a file size field and/or a document title field. This has the advantage of allowing a user to sort output e-mail items according to file sizes and/or document titles derived from the original data items. According to a fourth aspect of the present invention there is provided a method of processing data which is adapted to be accessed using an e-mail client, said data being indicative of an input e-mail item comprising original e- mail body content data and having at least one attachment file associated therewith, wherein said attachment file comprises attachment data, said attachment data not including e-mail body content data, in which the method comprises the steps of: selecting said attachment file; processing said attachment file to create first e-mail body content data derived from at least part of said attachment data; populating an e-mail item with said created first e-mail body content data to output a first output e-mail item; processing said input e-mail item to create a second output e-mail item and; creating association data to associate said first output e-mail item with said second output e-mail item.

This allows a user to associate the first output e-mail item with the second output e-mail item, when subsequently reviewing the output e-mail items.

In a preferred embodiment of the present invention said first and/or said second output e-mail item comprises at least part of said association data. This allows a user to identify how a particular output e-mail item is associated with another, from within an e-mail item in which the association data is stored.

The created e-mail body content data may comprise at least part of said association data. Including the association data in the e-mail body allows a user to easily identify the association between different e-mail items, and in a preferred embodiment of the present invention said association data comprises a hyperlink.

Further features and advantages of the invention will become apparent from the following description of preferred embodiments of the invention, given by way of example only, which is made with reference to the accompanying drawings.

Brief Description of the Drawings

Figure 1 is a schematic diagram showing a system for implementing embodiments of the present invention;

Figure 2 is a flow diagram showing the general operation of a method according to embodiments of the present invention;

Figure 3 is a flow diagram showing how a data file is processed according to embodiments of the present invention; Figure 4 is a flow diagram showing how a database is created and operations relating to the database according to embodiments of the present invention;

Figure 5 is a screen shot showing how fields of a database are selected;

Figure 6 is a screen shot showing a representation of a database in embodiments of the present invention; Figure 7 is a flow diagram showing processing to determine duplicate e- mails; and

Figure 8 is a flow diagram showing processing to determine duplicate files.

Detailed Description of the Invention

The present invention relates to processing e-mails and other electronic documents in a wide range of formats including text documents, spreadsheets, databases, image files, for example. These documents may have originated in electronic form or may have originated in paper form, and been converted to electronic form, by scanning and optical character recognition (OCR), for example.

E-mails are typically viewed, created and edited on an e-mail client. An e-mail client typically provides a user interface. E-mails typically comprise a number of fields: a body, a header, and an attachment field. The body of an e- mail typically comprises body content data, which is entered into the body field by a user. The body content data may comprise text, images or other data. The header field comprises data relating to one or more recipients of the e-mail, either in a "To" field, a "Cc" field or a "Bcc" field. Optionally, the header of the e-mail may also comprise data relating to the sender of the e-mail. The attachment field comprises one or more files which may be attached to the e- mail. The header of the e-mail usually also includes a subject line.

How the e-mail body content data is stored in an e-mail before it is transmitted depends on the particular e-mail client which is being used. For example in Microsoft Outlook™ the e-mails may be stored collectively in a PST file or individually in an MSG file.

An e-mail or other electronic file created within an e-mail client is in simplest terms a record containing a certain collection of fields which may or may not be necessarily populated as a result of the initial creation of such record in the original e-mail client.

A PST file is essentially a database, which has a number of fields relating to different e-mails, each having a number of fields which comprise the header and a body field. An MSG file comprises a header field and a body field. Where the term "e-mail body content data" is used herein it is intended to mean data present in, or to be entered into, the body field of an e-mail item, whether or not the e-mail item is intended to be transmitted. Similarly, other electronic files, such as Microsoft Word™, PDF™ files, etc. have fields for content data, and further comprise fields for metadata. The metadata may relate to when the file was created, the size of the file, the author of the file, the format of the file, for example.

Figure 1 is a schematic diagram showing a system 1 for implementing embodiments of the present invention. The system 1 comprises a first apparatus 2a, and a second apparatus 2b. The first apparatus comprises a display device 3a, a data entry device 4a, an input/output unit 5a, a processing system 6a and a storage system 7a. The first apparatus 2a may be configured as a client terminal, or sever, for example. Similarly, the second apparatus 2b comprises a display device 3h, a data entry device 4b, an input/output unit 5b, a processing system 6b and a storage system 7b. The second apparatus 2b may also be configured as a client terminal, or server, for example. The first and second apparatus 2a, 2b can communicate over a network 8, such as the Internet, as shown schematically in Figure 1. Further, the first and second apparatus 2a, 2b may be connected via a third-party server (not shown in Figure 1).

When e-mails are actually sent they are typically transmitted over the Internet according to the simple mail transfer protocol (SMTP). In SMTP a client (sender), such as the first apparatus 2a shown in Figure 1, communicates with a recipient, such as the second apparatus 2b shown in Figure 1 using commands to determine the location of the recipient specified in the header of the e-mail. Once a connection has been established the client transmits the header information from the e-mail followed by a blank line, followed by the body of the e-mail; i.e. the body content data which has been entered into the body field by the user. The header information transmitted comprises the e-mail address of the sending party. The e-mail body content data is transmitted under the command DATA according to the RFC 822 message format protocol in the SMTP protocol. The body content data is sent as lines of NVT ASCII.

A discussion giving an overview of embodiments of the present invention will now be made, with reference to a specific example relating to e- mails. Figure 2 is a flow diagram showing generally how e-mails are processed according to embodiments of the present invention. The processing of e-mails may be performed on a local client terminal, such as the first apparatus 2a, shown in Figure 1, configured with software according to embodiments of the present invention. Alternatively the processing of e-mails may be performed on a remote server. In step Sl the process polls for the receipt of a trigger, indicating that the processing of e-mails should be initiated. The trigger may comprise data relating to a command to begin processing, for example. In step S2 the process selects an e-mail item to be input.

In the context of the present invention an "e-mail item" comprises data which can be read using an e-mail client. An e-mail item may be an e-mail message which is intended to be or has been transmitted or received. Further, an e-mail item may comprise a data record, resembling an e-mail message when read by an e-mail client, but which is not intended to be transmitted.

In step S3 attachments of the selected input e-mail item are selected, and in step S4 the file format of the attachment is identified. In step S5 the attachment is processed according to a rule for the identified file format. In step S6 the process polls for any further attachments. If further attachments are found steps S3 to S5 are repeated for the further attachments. When no further attachments are found the process outputs an output e-mail item in step S7. The process then polls for further input e-mail items in step S8, and steps S2 to S7 are repeated for the further input e-mail items. In the case where the input item is a data item (i.e. any electronic document, which may include an e-mail item) the process shown in Figure 2 would differ in that the data item would be selected, the file format would be identified, the data item would be processed according to the rule for the identified file format, and an e-mail item would be output.

Thus, in the process of the present invention, data items having disparate sources are processed and are all converted into e-mail items. This means that data from the data items can be stored in a single database, and the data from respective data items can be saved in the same format. Furthermore, in the case where the input data item is an e-mail item any attachments of the e-mail item can also be processed, so that the attachment can be stored in the same format at the input e-mail item. Further, the attachment is stored at the same level as the e-mail item, so that the attachment does not need to be accessed through the e- mail item. Figure 3 is a flow diagram showing the processing of a file, such as an attachment shown in step S5 in Figure 2 in more detail. If the file is a file type supported by the software installed on the user terminal in step SlO the process opens the file in a program according to a rule for the identified file format (for example, determined in step S4 for an e-mail attachment). The file format of an attachment which is anything other than another e-mail will usually be in a format which cannot be read or opened by an e-mail client. Then, in step SIl at least part of the content of the file is extracted from the file. A discussion of how text in the file is extracted is given below, in the section "Text Extraction".

In step S 12 metadata is extracted from the file. A more detailed discussion of the extraction of metadata from various file types is given below in the section "Metadata Extraction". In step S13 association data is generated. The metadata and/or the association data may be used in the output e-mail item with the e-mail body content data derived from the file content data. The metadata may comprise information about the file format of the file, the date the file was created, the size of the file, a filename and path indicative of where the file is stored, the type of content of the file (for example, text, image, media etc), for example. In the case where the file type is not supported, metadata may be extracted from the file without opening the file.

The association data may comprise data indicative of the relationship between different input data items. For example, between an attachment file and an input e-mail item, or between two files which were in the same container file (such as a Zip™ file). For example, the association data may comprise an identification number relating to an e-mail item where the e-mail body content data derived from the attachment content data can be found. Further the association data may comprise an identification number relating to an e-mail item where the original e-mail body content data from the input e-mail can be found. Furthermore, the association data may comprise a hyperlink, so that an input data item can be accessed from the e-mail body content data of the output e-mail. The e-mail items comprising the respective e-mail body content data may be populated with at least part of the association data, so that the relationship between the input e-mail and the original attachment can be seen.

In step S 14 at least part of the extracted file content data is converted to e-mail body content data. This conversion may be done by extracting the file content data in a format associated with the file

At least part of the content of an e-mail, an attachment file, a loose file or paper file which has been scanned and subject to optical character recognition (OCR) is extracted, and converted into e-mail body content data, which is used to populate an e-mail body, to create an output e-mail item. This means that the file content from different files is in a uniform format, associated with an e-mail item so that it is viewable in an e-mail client in a consistent way. The output e- mail item comprising the e-mail body content data derived from the attachment content data may be a new e-mail item created by the process. Alternatively, in the case where the input data item is an e-mail item the input e-mail may be populated with the e-mail body content data derived from the attachment data. This is discussed in more detail below. When a data item is processed in the manner described above the body content field of a corresponding output e-mail is populated with at least one of the following:

(i) A link to the native file of the data item;

(ii) Text derived from text in the file content data;

(iii) Metadata extracted from the data item;

(iv) Generated data;

(v) A link to a parent item (i.e. an item through which the data item being processed can be accessed);

(vi) A link to a child item (i.e. an item which can be accessed through the data item being processed).

In the case of a data item which does not contain text the output e-mail item will not contain item (ii). The generated data, mentioned in point (iv) may comprise a document ID, parent ID, child ID, for example. This is generated by the system. The association data, discussed above may comprise items (i), (iv), (v), (vi).

The following table shows some examples of different file groups, which may be attached to e-mail items, together with examples of the program with which they can be opened in step SlO.

Some files may be password protected, so that a password is needed in order to open them, for example. In order to prevent password protected files from being excluded from the processing it is possible to load a text file containing a list of passwords which have been used for the files. In this case, if the process determines that a password is needed to open a file the passwords from the list can be used to try and open the file.

Text Extraction

The rules for extracting text from an attachment file in step SIl will now be described for various file formats: Each of the methods described below are carried out by a processing system of a user terminal, such as the processing system 6a of the first apparatus 2a, shown in Figure 1.

Database Type File

Text is extracted from a database file (for example an Microsoft Access™ database file) by a software process using the following method:

1. Extract information about forms from all forms in the database file

2. Extract information about reports from all reports in the database

3. Extract information about data access pages from all data access pages in the database 4. Extract information about queries from all queries in the database

5. Extract information and the content of the tables

6. Merge all the text together

7. The text is extracted as UNICODE

Spreadsheet Type File

Text from a spreadsheet file (such as an Microsoft Excel™ file) is built using the following method:

1. Loop through all the worksheets in a workbook

2. Extract the header information and worksheet name 3. Extract the body of the document using the in built SaveAs function.

This function converts all currency and dates to US regional settings 4. Extract all the text from all comments and textboxes and word art, which contain text

5. Extract the footers

6. Loop through all charts in a workbook and repeat steps 2-5 7. Write all above information for the whole workbook into one text file

8. The text is extracted as UNICODE

Web Page Type File

An example of a web page file is an HTML file. The software process extracts the viewable text from these files.

Project Manager Type File

An example of this type of file is an Microsoft Project™ file, and the following method relates to an Microsoft Project™ file. Text is built from such a file by a software process using the following method:

1. Retrieve any field that is a date, text or an ID data type from the following tables: a. Project Table (contains details of the Projects) b. Task Table (contains details of the Tasks) c. Resources Table d. Assignments Table e. Calendars Table f. Custom Fields (Definition) Table g. Custom Field Value Lists Table h. Custom Outline Code Fields Table

2. Extract the Headers and Footers (this may contain repetition of Task Notes or Resource Notes information as these are stored as RTF).

3. Combine the two texts.

4. The text is extracted as UNICODE. E-mail Type File

The e-mail body content data from an e-mail type file, such as an

Outlook™ file is converted by a software process which uses an inbuilt SaveAs function, to save the e-mail type file into a format where the text can be extracted before being inserted into a new e-mail message as new e-mail body content data.

Presentation Type File

An example of a presentation type file is an Microsoft Power Point™ file. Text is built from such a file type by a software process using the following method:

1. Extract Headers and Footers from the Master Slide

2. Extract Text from the Shapes in the Master Slide: a. Table: Extract Cell Text b. Word Art (Text Effect) c. Group: Extract Text from all the Shapes in the Group d. Diagram: Extract Text from all the Nodes in the Diagram e. PlaceHolder: Can be Group, Diagram or TextBox. Extract Text from either of these three options f. Text Frame

3. Loop through the Slides

4. Extract Headers and Footers

5. Extract Text from the Shapes: a. Table: Extract Cell Text b. Word Art (Text Effect) c. Group: Extract Text from all the Shapes in the Group d. Diagram: Extract Text from all the Nodes in the Diagram e. PlaceHolder: Can be Group, Diagram or TextBox. Extract Text from either of these three options f. Text Frame

6. Extract the Comments from the Slide 7. Extract the Notes from the Slide

8. Combine the Text

9. The text is extracted as UNICODE.

Word Processing File Types

Examples of word processing file types include Microsoft Word™. Text from such a file is built by a software process using the following method:

1. Loop through the Shapes in the document

2. Extract Text from the Shapes a. Text Frame b. Diagram: Extract Text from all the Nodes in the Diagram c. Word Art d. Pictures (In Line Shapes) e. Hyper Links 3. Use a built in SaveAs text method to save out text

4. Append the Shape Text to the saved out text

5. Text is extracted as UTF- 8 UNICODE

Other Text-Based File Types Examples of other text-based file types include editable PDF™ files.

Text is only extracted by a software process if the text in the file is searchable. In this case the text is outputted to a UNICODE file.

Drawing File Types An example of a drawing file type is a Microsoft Visio™ file. Text is built from such a file using the following method: 1. Loop through the pages in the document a. Extract the Page Name b. Extract the Page Sheet Name (Master Page) c. Extract the Hyper Link Text from the Master Page d. Extract Text from the Shapes e. Extract Text from the Hyperlinks

2. Extract Headers and Footers

3. Combine the Text

4. The text is extracted as UNICODE

In addition to the above, the process can open Zip™ files to extract the files therein, which are then processed by the rules above.

Extraction of Metadata The extraction of metadata to create various fields used by the process will be described for various types of common metadata found in common file types with an emphasis on metadata and file types which are most useful when reviewing documents in a legal context.

Where metadata is either missing or not available (for example where a paper document has been scanned) the metadata fields listed below may be manually input into the database.

Date and Time Field(s)

It is important to extract metadata relating to dates and times in a consistent manner. The software process will perform validation on the date fields extracted. It will ignore any date which is the same as the current date. It does this as documents which have no dates in a particular field (eg Last Saved

Date) can assume the current date when opened. Some applications will enter bogus dates when it has none. The process aims to eliminate at least some of these.

As different files types may have different metadata relating to various dates, a logical process is applied to obtain a 'master date' field which can then be used as the basis for chronological sorting of the file population. The process will look at each of the following dates and continue down the list until it finds a date which is not empty. When the process finds the first non empty date it will use this as the "master date" field. Further, times and dates can be processed so that they relate to the same time zone, or to the same date format.

File Type Date Fields

Outlook™ Mail, Post, Report and Sent On

Meeting Items. Creation

Received

Last Saved Application

Last Saved FS

Deferred Delivery

Expiry

Start

End

Outlook™ Note and Contact Items Creation

Last Saved Application

Last Saved FS

Outlook Appointment, Task and Creation

Journal Items Last Saved Application

Last Saved FS

Start

End

Extraction of Sender/Recipient Details

Where the input data item is an e-mail the extracted metadata can include the recipients of the e-mail, from the header of the e-mail.

Further, the software process can extract further data from the e-mail item, which is not present as metadata in the e-mail item, but is present in the e- mail body content data of the e-mail item.

In order to do this the software process may search the body of an e-mail for data such as e-mail addresses (which may show that the e-mail has been forwarded from a certain e-mail address, for example), or dates on which the e- mail was forwarded. This data is then treated as if it was present as metadata in the input e-mail item, and is inserted into the body content field of the output e- mail in the manner described above.

Creating a Database

Once a number of data items have been processed in the manner described above to extract metadata and content data, a database is created which includes the created e-mail body content data and created e-mail data fields. Once the database of the output of the processing has been created various functions can be performed on the database. For example, a process to remove duplicate data can be performed (see later for a detailed discussion of this). Utilising a Database

Once the database has been created, and optionally had the duplicate items removed, the database can be dispatched to a third party. For example, the database could have been commissioned by a law firm, as part of a discovery process in litigation. Once the database has been generated it can be dispatched to the law firm. This is done by populating one or more output e-mail items with the created e-mail body content data and the e-mail data fields created from the extracted metadata. The output e-mail items are then populated into a load file for an e-mail client. For example, the database may be converted into an Outlook™ PST load file, for example. The client may then be used to view and manipulate the contents of the load file. The discussion below relates to the functionality of the created database, and the contents of any load file, when displayed in a suitable e-mail client. The e-mail items in the load file are not e- mails in a strict sense, since they are not intended to be sent, but they do comprise e-mail items which are intended to be viewed using an e-mail client.

Converting the attachment content data into e-mail body content data and e-mail data fields in the manner described herein for the different file types has the advantage that one or more functions can be performed in relation to the data contained in the input e-mail and the attachment, in a way that is not possible when the attachment is attached to the e-mail. Further, the output e- mail items can be displayed in an e-mail client, which is already present on a user terminal. This has the advantage that the input data items from disparate sources do not need to be viewed on specialised software. Further, since the use of e-mail clients is widespread, it is unlikely that an individual will need special training to be able to review and manipulate the output e-mail items. Examples of such functions will be described with reference to Figure 4.

Figure 4 is a flow diagram showing an example of how the output e- mails from Figure 3 are processed. In step S20 e-mail items for a group of e- mail messages are output. In step S21 a database for the output e-mail items is created. In step S22 the process detects user input indicative of selection of search criteria. If such search criteria is entered the database contents are displayed according to the search in step S23. For example, the search may relate to keywords in the content of the input e-mail and/or the attachment. Further the search may relate to the recipient(s) or sender of an e-mail or an attachment, when an e-mail and/or attachment was sent, for example.

In step S24 the process detects user input indicative of sort criteria, and in step S25 the contents of the relevant e-mail items in the database are displayed in order of the sort criteria according to the sort criteria applied to a selected one of the created e-mail data fields. Thus, providing a database of the output e-mail items allows a search or a sorting function to be performed in respect of the content of the input e-mail items and their respective attachment contents.

The sort criteria may be any that is supported by the e-mail client on which the database is being viewed. Further examples of the created data fields on which sort criteria can be applied are: file path, sent date/time; creation date/time; received date/time; from; to; and subject, and these can be sorted in an ascending or descending fashion.

Instead of an e-mail item the input data, on which the process is performed, may comprise data files. These data files may start out as electronic files, or may be created by scanning in paper documents. The files are processed in a similar way to the way in which attachments are processed, by determining the file type, and processing the file according to the file type. E- mail body content data is derived from at least part of the file content data and metadata, and an e-mail item is populated with this in the manner described above. The files may be stored on a drive on a user terminal. Where the input data is an input e-mail item the input e-mail item may be an Outlook™ MSG item, or contained in an Outlook™ PST file or a Lotus™ NSF file (the latter of which are both essentially a database of e-mails).

The data which is loaded into the e-mail item from either a file or an attachment may comprise: extracted metadata, coded data, extracted text, data created using optical character recognition (OCR) techniques. Native files (or images from scanned documents) are loaded into a separate folder and are linked to the relevant e-mail item. The output e-mail items may be loaded into Outlook™ via an Outlook™ PST file.

Figure 5 shows a screen shot 10 on a user interface for creating a database from a plurality of output e-mail items (or for creating the fields to be included in a load file). A number of available e-mail data fields, each associated with a particular e-mail item and created during the process of the invention, are listed in a field window 12, which can be selected by the user. As the fields are selected the window 14 showing the selected fields is populated with the names of the fields. Thus, the fields displayed in the database can be tailored to the specifications of a user. Alternatively, the fields which are included in the database may be default fields. A further option in creating the database, which may be presented, is whether the database is to be created with all records, all unique records, or a database with the unique records, together with a file for duplicates. Examples of how the duplicates are detected are discussed below in relation to Figures 7 and 8. Further, the database can be created using only records relating to files having certain formats.

Figure 6 shows a screen shot 20 of a user interface provided by an e-mail client. The database may be stored as a folder, for example an Outlook™ personal folder 22 having user defined sub-folders for example "Records" 24. The database comprises a plurality of e-mail items 26 each having a number of fields. The created e-mail data fields shown in the example are "Document ID", which gives a numerical identifier of the e-mail item; "Subject"; "File type" giving the original file type of the e-mail item; "Master date" giving the date on which the input e-mail was generated, which date is selected as described above; and "Original File Path" giving details of where the original file (for example the original attachment file) can be found in the data sources. The document ID is unique to each record and is made up of a three character alphanumeric volume ID, specific to each data source which is processed in a job, and a six digit ID for the document (alternatively, this ID could also be alphanumeric). The input items can be saved, in their native file formats by the document IDs. This gives a useful way of storing the original files. Further, the "Original File Path" field may also be included in the database. This may be important because it will enable data to be reviewed or disregarded on the basis of the original file path. For example, if it is decided that a certain data storage medium is not relevant to a discovery process, all of the data from this medium can be found, using a search option and the data can be deleted, for example. Thus, if a search reveals that the attachment content data is relevant, the original document can be referred to for further information. In the case where an input item (for example, this may be the case with a top level Outlook™ or Lotus™ item) does not have a file name a file name will be created. The file name may be made up of the subject and any other metadata. The e-mail client of this example includes a viewer window 28, where the body content data of the e- mail items can be viewed.

Further, software according to embodiments of the present invention may allow an e-mail client to display a further window 30, giving user selectable options relating to the viewing of e-mails. For example, the options include "Date View" which allows the date on which the e-mail item was processed to be viewed; "Names View" which displays all of the names fields relating to the e-mail records such as "author", "sender" "recipient", "copyee" etc; " Relationships View" which shows the e-mail items arranged according to the relationships between input e-mail and attachment; "Standard View", which is the view shown in the screen shot of Figure 6, and shows the e-mail items having the different fields; "Messages"; (which is a default view inherent within the e-mail client) "Messages with autopreview"; (a default view inherent within the e-mail client) "Last Seven Days", (a default view) which shows the e-mail items created in the last 7 days; and "Unread Messages in this Folder", which displays only the e-mail items which have not been viewed.

The association data, created when an e-mail item having an attachment is processed, may contain the document ID of any associated records, having grandchild, child, parent and grandparent relationships. For example, if an e- mail item has two attachments, one being a data file, and the other being another e-mail item with a data file attached to it. The record relating to the e-mail item will have the document ID of the child file (i.e. the attachment, and the attached e-mail item) and the document ID of the grandchild file (i.e. the attachment of the attached e-mail item). This association data may also be used where the input item is a file within a container file. For example, a file may be in a Zip™ file.

Figure 7 is a flow diagram showing a process which identifies duplicate e-mails, and deals with the e-mails accordingly; this is referred to as "deduplication". This process can be carried out either on the input e-mails before they are processed as shown in Figure 2, or after they have been processed, so that the data is present in a database. In the example shown in Figure 7 the e-mail items have already been processed as shown in Figure 2.

The criteria for deduplication of e-mails may include any combination of the metadata fields such as "sent on" date, subject, attachment names etc. A hashing algorithm, known as MD5, can be used for computing a condensed representation of a message or a data file. The condensed representation is of fixed length (32 characters) and is sufficiently unique to enable a duplicate to be identified using a match with the MD5 value.

Turning back to Figure 7, in step S30 all e-mail records are selected and ordered by document ID. In step S31 the next unchecked record is selected, and in step S32 it is determined whether a record is found. In the case where a record is not found (i.e. there are no more unchecked records) the process ends. If a record is found in step S33 the process looks for duplicate records of the original e-mail (e.g. the e-mail item which was created first). If no duplicate is found in step S34 the original e-mail is marked as checked in step S35. On the other hand, if a duplicate is found in step S34 it is determined in step S36 whether the Bcc field of the original e-mail is empty. If it is empty in step S37 it is determined whether any of the duplicate e-mails have data in the Bcc field. If the result of this is "yes" the process proceeds to step S31, so that the next unchecked e-mail record is selected. However, if the result of step S37 is "no" in step S38 the duplicate e-mails are marked as checked and as duplicates of the original e-mail.

Turning back to step S36, if it is determined that the Bcc field of the original e-mail is empty the process moves to step S38. In step S39 the parent IDs of duplicate e-mails are obtained and added to the original e-mail's parent ID. Step 40 determines whether a "copy attachments" option is selected; if it is not the process goes to step S35, in which the original e-mail is marked as checked. If the "copy attachments" option is selected in step S41 the nth level child records of the duplicate e-mails are found. If any of the child files have duplicates (which is determined in step S42) these records are marked as child records of the original e-mail. The process then goes to step S35, where the original e-mail is marked as checked. On the other hand, if none of the child files has duplicates, all the child records are marked as duplicates of the original e-mail in step S44, and in step S45 the child records are marked as checked. The process then goes to step S35, where the original e-mail is marked as checked.

In this way, duplicates in the processed e-mail items can be determined, and the original e-mail is marked with the document ID of the duplicate records. Figure 8 is a flow diagram for identifying and processing duplicates of files which have been processed so that they can be processed using an e-mail client. In step S50 all file records are selected. In step S51 the next unchecked record is selected, and in step S52 it is determined whether such a record is found. If a record is not found the process ends. If a record is found in step S53 duplicate records are looked for, and step S34 determines whether these duplicates are found. If no duplicates are found the process proceeds to step S55 in which the main file is marked as checked, and the process goes to step S51, where the next unchecked record is selected. On the other hand, if duplicates are found in step S54 the process goes to step S56 in which it is determined whether the only criteria for deduplication is the MD5 algorithm, and not based on any other metadata, for example. If it is not, the process goes to step S59 where the parent IDs of duplicate records are obtained, and these are added to the parent IDs of the main record. If the criteria is only MD5 whether the main file is excluded is determined in step S57. If it is not the process passes to step S59. However, if the main file is excluded the process passes to step S58 where it is determined whether any of the duplicate files are not excluded? If the result is "yes" the process passes to step S51; if the result is "no" the process passes to step S59. From step S59 duplicate files are marked as checked, and as duplicates of the main file in step S60. The main file is then marked as checked in step S61, and the process passes to step S51.

Thus, records created from files can be processed to determine whether they are duplicates of other files, and the records can be updated accordingly.

Other Arrangements

In a further arrangement of the present invention the processing of an input e-mail item can be conducted by an e-mail client in response to receiving a trigger. The trigger may comprise the receipt of an incoming e-mail, so that any attachments of the incoming e-mail are processed so that data contained therein can be reviewed using the e-mail client. Alternatively, the trigger may comprise receiving data indicative of a user selection of a change of mode. In this way a user can choose to display the contents of a folder in an e-mail client in a conventional format, or in a format in which data contained in attachments can be viewed as separate independent items from the e-mail item to which the attachment was attached.

Further, it is envisaged that the output e-mail item which is populated with data derived from an attachment can be the input e-mail item. Thus, the output e-mail item created is essentially the input e-mail item with added data derived from the attachment. This arrangement is advantageous since any data from the attachment can be viewed together with the e-mail body content data of the input e-mail.

Whilst the e-mail client described in the above embodiment is a Microsoft Outlook™ e-mail client, it should be understood that the invention is applicable to other e-mail clients such as a Lotus Notes™ e-mail client.

The above embodiments are to be understood as illustrative examples of the invention. It is to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.

Claims

1. A method of processing a plurality of data items stored on one or more data storage media, each of said data items comprising data, wherein said plurality of data items comprises a first data item having first data, and a second data item having second data, said first data and said second data not including e-mail body content data, in which the method comprises the steps of: processing said first data to create first e-mail body content data derived from at least part of said first data; populating an e-mail item with said created first e-mail body content data to output a first output e-mail item; processing said second data to create second e-mail body content data derived from at least part of said second data; populating an e-mail item with said created second e-mail body content data to output a second output e-mail item; and populating a load file for an e-mail client with said first and second output e-mail items.

2. A method according to claim 1, wherein said data items comprise metadata and content data, and wherein said created e-mail body content data comprises at least part of said metadata, and at least part of said content data.

3. A method according to claim 1 or 2, wherein said processing comprises creating content for one or more e-mail data fields in addition to said created e-mail body content data, said one or more e-mail data fields being fields whereby an e-mail client is capable of performing a sort operation for said output e-mail item.

4. A method according to claim 3, wherein said one or more e-mail data fields include a file path data field, the content being derived from file paths associated with the data items being processed.

5. A method according to claim 3 or 4 wherein said one or more e- mail data fields include a date field.

6. A method according to any preceding claim, wherein said first data item is in a first file format, and said second data item is in a second file format, said first and second file formats being different from each other.

7. A method according to claim 6, wherein said output e-mail item is in an e-mail file format, said e-mail file format being different from said first and second file formats

8. A method according to any preceding claim, wherein said first data is in a first data format, and said second data is in a second data format, said first and second data formats being different from each other.

9. A method according to claim 8, wherein said e-mail body content data is in an e-mail body content data format, said e-mail body content data format being different from said first and second data formats.

10. A method according to any preceding claim, wherein said load file is an Outlook™ PST load file.

11. A method according to claim 10, wherein the method comprises the step of identifying a file format for said first and second data item, and wherein a different predetermined rule is selected in accordance with the identified file format.

12. A method according to any preceding claim, wherein the method further comprises generating association data for associating at least one of said data items with at least one other data item.

13. A method according to claim 12, wherein said association data comprises a hyperlink.

14. A method according to any preceding claim, wherein said plurality of data items comprises at least one input e-mail item, which is adapted to be accessed using an e-mail client, said e-mail item comprising original e- mail body content data and wherein said first data item comprises an attachment file associated with said input e-mail item.

15. A method according to claim 14, wherein said method comprises the step of: processing said input e-mail item to create third e-mail body content data derived from at least part of said original e-mail body content data; and populating an e-mail item with said third created e-mail body content data to output a third output e-mail item, and wherein said method comprises populating said load file with both said third output e-mail item and said input e-mail item independently.

16. A method according to any preceding claim, wherein said method includes processing a group of input data items, and wherein said method further comprises comparing data items from said group to determine whether a part of a data item in said group is a duplicate of a part of any other data item in said group.

17. A method according to claim 16, wherein said comparing comprises analysing the original content data of said input data items in said group.

18. A method according to claim 16 or 17, wherein said comparing comprises analysing metadata of said input data items in said group.

19. A method according to any preceding claim, wherein said processing comprises retrieving said first and second data items from a plurality of different data sources.

20. A method according to claim 19, wherein said plurality of different data sources include one or more data storage media and/or one or more paper copies.

21. A method of processing data which is adapted to be accessed using an e-mail client, said data being indicative of an input e-mail item comprising original e-mail body content data and having at least one attachment file associated therewith, wherein said attachment file comprises attachment data, said attachment data not including e-mail body content data, in which the method comprises the steps of: selecting said attachment file; processing said attachment file to create first e-mail body content data derived from at least part of said attachment data; and populating an e-mail item with said created first e-mail body content data to output an output e-mail item, wherein said attachment data comprises metadata and content data, and wherein said created e-mail body content data comprises at least part of said metadata, and at least part of said content data.

22. A method of processing data which is adapted to be accessed using an e-mail client, said data being indicative of an input e-mail item comprising original e-mail body content data and having at least one attachment file associated therewith, wherein said attachment file comprises attachment data, said attachment data not including e-mail body content data, in which the method comprises the steps of: selecting said attachment file; processing said attachment file to create first e-mail body content data derived from at least part of said attachment data; and populating an e-mail item with said created first e-mail body content data to output an output e-mail item, wherein said processing comprises creating content for one or more e- mail data fields in addition to said created e-mail body content data, said one or more e-mail data fields being fields whereby an e-mail client is capable of performing a sort operation for said output e-mail item.

23. A method according to claim 22, wherein said one or more e- mail data fields include a file path data field, the content being derived from file paths associated with the data items being processed.

24. A method according to any of claims 22 and 23, wherein said content for one or more e-mail data fields include data extracted from metadata in said data items.

25. A method according to any of claims 22 to 24, wherein said one or more e-mail data fields include a date field.

26. A method according to any of claims 22 to 25, wherein said one or more e-mail data fields include a file size field and/or a document title field.

27. A method of processing data which is adapted to be accessed using an e-mail client, said data being indicative of an input e-mail item comprising original e-mail body content data and having at least one attachment file associated therewith, wherein said attachment file comprises attachment data, said attachment data not including e-mail body content data, in which the method comprises the steps of: selecting said attachment file; processing said attachment file to create first e-mail body content data derived from at least part of said attachment data; populating an e-mail item with said created first e-mail body content data to output a first output e-mail item; processing said input e-mail item to create a second output e-mail item and; creating association data to associate said first output e-mail item with said second output e-mail item.

28. A method according to claim 27, wherein said second output e- mail item is created by creating second e-mail body content data derived from at least part of said original e-mail body content data; and populating an e-mail item with said second created e-mail body content data.

29. A method according to claim 27 or 28, wherein said first and/or said second output e-mail item comprises at least part of said association data.

30. A method according to claim 29, wherein said created e-mail body content data comprises at least part of said association data.

31. A method according to any of claims 27 to 30, wherein said association data comprises a hyperlink.

32. Computer software for performing the method of any preceding claim.

33. Apparatus arranged to perform the method of any of claims 1 to 31.