GB2495698A

GB2495698A - Securely storing data file modifications where the file comprises one or more file elements that collectively represent the whole content of the file

Info

Publication number: GB2495698A
Application number: GB1117275.6A
Authority: GB
Inventors: Hitesh Tewari; Desmond Ennis; Karl Reid
Original assignee: College of the Holy and Undivided Trinity of Queen Elizabeth near Dublin
Current assignee: College of the Holy and Undivided Trinity of Queen Elizabeth near Dublin
Priority date: 2011-10-06
Filing date: 2011-10-06
Publication date: 2013-04-24
Also published as: GB201117275D0

Abstract

A data file, stored on a data file store 232, comprises one or more data file elements. The elements collectively represent the whole content of the data file and each element is associated with a unique set of metadata. A target data file element is identified in which a modification is to be recorded; either an existing element, identified based on its associated unique set of metadata, or a new element. The modification is recorded in the target element and it is determined whether the target element is to be encrypted. The target element is encrypted, if required, using its associated unique set of metadata and a secret key 260. The target element is transmitted to the data file store for storage thereon. The data file may be manipulated by remote clients 210, 220 and transmitted to server 230, with which store 232 may be associated, over network 201.

Description

INTELLECTUAL

*.. . PROPERTY OFFICE Applicalion No. (rBl 117275.6 RTM Dare:9 February 2012 The following terms are registered trademarks and should be read as such wherever they occur in this document: "DropBox", "Windows Live", "Microsoft Office", and "Wi-H" Intellectual Property Office is an operaling name of Ihe Patent Office www.ipo.gov.uk Title

SYSTEM AND APPARATUS FOR SECURELY STORING DATA

Technical Field

This invention relates to the field of networked computing, and in particular to the

field of data file security.

Background

Cloud computing is a term used generally to refer to the use, by a client device, of remote computational infrastructure over a network with a view to meeting the data storage and! or computer processing requirements of the client device. This computational infrastructure may be consolidated in a single location as part of a set of substantial computing resources that are made available as required to disparate clients. These computing resources are considered to reside "in the cloud" -i.e. somewhere over the Internet or across a proprietary network. Advances in network communication technologies have resulted in faster communication speed across networks, and this is one of the factors behind a recent increase in the uptake and adoption of cloud computing technologies.

Cloud computing can be advantageous because it substantially reduces the requirement for data processing and data storage resources locally at a client device, and facilitates scalability of projects that require computing resources by enabling easy allocation of additional resources through the cloud as and when required. Cloud computing is also advantageous because it allows multiple users from disparate locales to work collectively via their client dcviccs, using the cloud infrastructurc as a hub. Many cloud-based service providers, such as DropBoxM, Windows Live SkyDrivc and Box.netTM now offer online file hosting, providing data storage facilities to users over the Internet. Many of these file-hosting services may be accessed through web-based interfaces via a client device web browser, thus ensuring easy accessibility.

I

However, a concern with cloud-based systems is that handling and storing of data over the Internet is inherently less secure than handling and storing data on a secure local system. The degree of security with which data are stored by a remote third party (such as a file hosting service provider) is a factor beyond the control of the proprietor of the data. Furthermore, unintended recipients may intercept data passed between a client and a cloud system over a network. As a result, it is desirable that cloud computing achieves a level of security that more closely approaches that offered by local data handling and storage means.

One way of achieving an increased level of data security in cloud-based file hosting systems is to ensure that all data transmitted by the client to the cloud-based file host, and all data stored on the host, is in encrypted form, with the client retaining the encryption/decryption key(s). In such an arrangement, when data are retrieved from the host, it will remain encrypted until decrypted at a client device. Accordingly, when it comes to using the cloud to securely manipulate data, the following simple paradigm may be followed: 1) select the data to be stored on the cloud; 2) encrypt the selected data using a locally-stored encryption key; and 3) upload the encrypted data to the cloud. It will be readily understood that the same principle may be applied in reverse for retrieving data securely stored on the cloud, namely: 1) download the encrypted data from the cloud; 2) select the locally-stored decryption key; and 3) dccrypt the data. There are a number of software packages that currently offer such encryption functionality, such as EneES and TrueCrypt, and it will be understood that there are a variety of ciphers that may be used in the context of the above paradigm.

Cloud computing systems can, however, provide more than mere file hosting functionality. It is also possible -for example -for cloud computing systems to provide client devices with the functionality of productivity suites (also known as office software suites). The capabilities of such suites include, but are not limited to, functionality for producing documents in word-processor, spreadsheet, and slideshow formats. Such functionality may be provided in the form of browser-based client applications such as Google® Docs or Microsoft® Office Wcb Apps. These browser-based client applications may bc in the form of client-side scripting hosted on a website designed to offer this functionality. It will be understood that in this context, "client-side scripting" is computer program code that may be hosted on a server for retrieval by a client device for execution locally on the client device. Accordingly, the web browser may access such client applications dynamically by navigating to said website and retrieving the client-side scripting, for subsequent local execution on the client device. Alternatively, cloud-based storage functionality may be integrated into otherwise locally stored productivity suites, such as is the case with the Microsoft® Office 2010 suite. A particular advantage of providing productivity suite functionality via the cloud is that multiple users can work on a document concurrently, thereby drawing remote workstations into a collaborative environment.

The provision of productivity suite capabilities constitutes a more dynamic functionality of the cloud, when compared to its use as a mere data storage facility.

This is because the data transfer between the cloud and the client device(s) in such a scenario can be more fluid, potentially non-linear, and may involve transfer of data from a plurality of client devices. When the cloud is used as a data storage facility, typically the upload of a data file is a single operation that takes place after the data file is complete. When the cloud is used to provide productivity suite capabilities, uploads typically continuously occur as the data file is modified. Moreover, these uploads may emanate from multiple discrete sources in concurrent or near-concurrent fashion. It will thus be appreciated that the encryption paradigm for use when simply storing files on the cloud as defined above is not appropriate when the data content of a file may be dynamically changing. This is because changes may be continually made to a data file, and these changes may be emanating from multiple different sources.

Accordingly, there is a need for a method of providing on-the-fly encryption of data that is manipulated via multiuser online document editing applications in an efficient and provably secure manner. It is desirable that the method allow for simultaneous, collision free, multi-user collaboration and preferably comprises a self-contained solution with no need for ancillary files to support the encryptionidecryption process.

Summary of the Invention

One element of the invention provides for a method of recording a modification made to the content of a data file stored on a data file store, wherein the data file comprises one or more data file elements, the one or more data file elements collectively representing the whole content of the data file, and wherein each data file element is associated with a unique set of metadata. the method comprising: identi'ing a target data file element in which the modification is to be recorded, wherein the target data file element is either: a existing data file element, identified based on its associated unique set of metadata; or a new data file element; recording the modification in the identified target data file element; determining whether the target data file element is to be encrypted; if it is determined that the target data file element is to be encrypted, encrypting the target data file element using its associated unique set of metadata and a secret key; and transmitting the target data file element to the data file store for storage thereon.

This allows for on-the-fly encryption of data in an efficient and provably secure manner.

The method may further comprise: if the target data file element is identified as a new data file element, creating the new data file element.

The method may comprise wherein the unique set of metadata comprises at least one timeline-specific value. The timeline-specific value may be a probabilistically unique random or pseudorandom sequence generated at a specific time.

The step of recording may further comprise generating a new unique set of metadata for association with the target element, and the step of transmitting may further comprise transmitting the associated new unique set of metadata for storage. Each data file element may comprise a disjoint, contiguous, variable-length constituent of the content of the data file having a predefined maximum length. The associated unique set of metadata may further comprise the length of the data file element.

The method may further comprise wherein in the event a first modification is to be recorded in an existing data file element, no other modifications maybe recorded in the existing data file element until all the steps of the method have been performed with respect to the first modification. This allows for simultaneous collision-free multi-user collaboration.

The method may comprise wherein the data file elements are stored in a chronological history of stored data file elements on the data file store.

The at least one timeline-specitic value may be the chronological position of the associated data file element in the chronological history of stored data file elements.

The unique set of metadata may further comprise a user identifier.

The unique set of metadata may further comprise a session identifier, and the unique set of metadata may then further comprise the chronological position of the most recent previous operation element recorded using the session identifier comprised in the unique set of metadata. This may allow for simultaneous, collision-free, multi-user collaboration.

The method may further comprise identi'ing a new data file element as the target file element; wherein each data file element may record either an insertion of a data string into the content of the data file or a deletion of a data string from the content of the data file; and the determining step may further comprise determining only to encrypt data file elements that record an insertion of a data string.

Periodically in the chronological history, one or more snapshot data file elements may be created and transmitted to the data file store, the snapshot data file elements comprising the aggregate of all existing data file elements.

Thc snapshot data file elements may comprise a first data file element recording the deletion of the entire content of the data file, and a second data file element recording the insertion of the entire content of the data file.

The encrypting step may comprise encrypting the data file element using a stream cipher.

The keystream for use in the stream cipher encryption may be generated using a seed string derived from the unique set of metadata associated with the data file element and a secret key.

The keystream may be generated from the seed string by running an iterative block cipher encryption algorithm directly on the seed string.

A message digest may be produced by running a hashing algorithm on the seed string, and the keystream may then be generated from the message digest by running an iterative block cipher encryption algorithm on the message digest.

Periodically in the chronological history, a Message Authentication Code data file element comprising a Message Authentication Code keyed with the secret key may be created and transmitted to the data file store in order to confirm the authenticity of the other data file elements.

The data file store may be located rcmotcly from where the method is performed, and is accessed over a network, and the network may comprise the Internet.

The modification may be made to the content of the data file via a client application retrieved over the network from a remote server, and executed from within a web browser, and the step of encrypting may be performed via a plug-in embedded within the web browser.

The modification may be made to the content of the data file through the use of software, and the step of encrypting may be performed via an extension to the software or via a separate application that communicates with both the software and the data file store.

The method may further comprise the initial step of determining whether the modification is to be recorded as a plurality of parts in a plurality of data file elements, each part being recorded in a separate corresponding data file element; the steps of identifying, recording and encrypting may be performed for each of the plurality of data file elements and the step of transmitting may comprise transmitting the plurality of data file elements together as a set of data file elements.

Another element of the invention provides for a method of decrypting a data file that has been encrypted in accordance with the method described above wherein a new unique set of associated metadata is generated for each target data element, the method of decryption comprising: retrieving the data file from the data file store, along with the unique sets of metadata associated with each data file element; dividing the data file into the data file elements based on the unique sets of metadata; dccrypting each data file element using the associated unique set of metadata and the secret key.

Another element of the invention provides for a method of decrypting a data file that has been encrypted in accordance with the method described above wherein the data file elements are stored in a chronological history of stored data file elements on the data file store, the method of decrypting comprising: retrieving all the data file elements in the chronological history; constructing a data architecture from the data file elements by applying each data file element to the data architecture in turn, in accordance with their chronological order, wherein the constructed data architecture comprises one or more pieces, each piece referencing at least a portion of a data file element; and decrypting each piece of the data architecture using the referenced portion of the data file element, the unique set of metadata associated with the data file element, and the secret key.

Collaborating devices in disparate locales may access the content of the data file concurrently, each device having a separate connection to the data file store.

The method may further comprise the subsequent step of relaying the transmitted target data file element from the data file store to all collaborating devices.

Another element of the invention provides for a computer readable storage medium carrying a computer program stored thereon, said program comprising computer executable instructions adaptcd to pcrform any of the methods described above.

An element of thc invention provides for a device for recording a modification made to thc content of a data file storcd on a data file store, whercin thc data file comprises one or more data file elements, the one or more data file elements collectively representing the whole content of the data file, and wherein each data file element is associated with a unique set of metadata, the device comprising: means for identifying a target data file element in which the modification is to be recorded, wherein the target data file element is either: existing data file element, identified based on its associated unique set of metadata; or a new data file element; means for recording the modification in the identified target data file element; means for determining whether the target data file element is to be encrypted; means for encrypting the target data file element using its associated unique set of metadata and a secret key if it is determined that the target data file element is to be encrypted; and means for transmitting the target data file element to the data file store for storage thereon.

The means for identifying may further comprise: means for creating the new data file clement if the target data file elcment is identified as a new data file element.

The unique set of nietadata may comprise at least one timeline-specific value.

The timeline-specific value may be a probabilistically unique random or pseudorandom sequence generated at a specific time.

The means for recording may further comprise means for generating a new unique set of metadata for association with the target element, and the means for transmitting may further comprise means for transmitting the associated new unique set of mctadata for storage.

Each data file element may comprise a disjoint, contiguous, variable-length constituent of the content of the data file having a predefined maximum length.

The associated unique set of metadata may further comprise the length of the data file element.

In one aspcct of the invention, in the event a first modification is to be rccorded in an existing data file element, no other modifications may be recorded in the existing data file clement until all the steps of the method have been performed with respect to the first modification.

The data file elements may be stored in a chronological history of stored data file elements on the data file store.

The at least one timelinc-specific value may be the chronological position of the associated data file element in the chronological history of stored data file elements.

The unique set of metadata may further comprise a user identifier.

The unique set of metadata may further comprise a session identifier.

The unique set of metadata may further comprise the chronological position of the most recent previous operation element recorded using the session identifier comprised in the unique set of metadata.

The means for identi'ing may further comprise means for idcnti1'ing a new data file element as the target file clement; each data file element may record either an insertion of a data string into the content of the data file or a deletion of a data string from the content of the data file; and the means for determining may further comprise means for determining only to encrypt data file elements that record an insertion of a data string.

The device may further comprise means for creating one or more snapshot data file elements periodically in the chronological history, the snapshot data file elements comprising the aggregate of all existing data file elements.

The snapshot data file elements may comprise a first data file element recording the deletion of the entire content of the data file, and a second data file element recording thc insertion of the entire contcnt of thc data file.

The means for encrypting may comprise means for encrypting the data file element using a stream cipher.

The keystream for usc in the stream cipher encryption may be generated using a seed string derived from the unique set of metadata associated with the data file element and a secret key.

The device may further comprise wherein a message digest may be produced by running a hashing algorithm on the seed string, and wherein a keystream may then be generated from the message digest by running an iterative block cipher encryption algorithm on the message digest.

The device may further comprise means for creating, periodically in the chronological history, a Message Authentication Code data file element comprising a Message Authentication Code keyed with the secret key, and means for transmitting the Message Authentication Code data file element to the data file in order to confirm the authenticity of the other data file elements.

The device may further comprise wherein the data file store is located remotely from the device and is accessed by the device over a network, and the network may comprise the Internet.

The device may comprise a web browser and a plug-in embedded in the web browser, wherein the modification is made to the content of the data file via a client application retrieved over the network from a remote server, and executed from within the web browser, and wherein the means for encrypting may comprise the plug-in embedded within the web browser.

The device may comprise software and either an extension to the software or a separate application that communicates with both the locally stored software and the data file store, wherein the modification may be made to the content of the data file through use of the software, and wherein the means for encrypting may comprise either the extension to the software or the separate application.

The device may further comprise means for initially determining whether the modification is to be recorded as a plurality of parts in a plurality of data file elements, each part being recorded in a separate corresponding data file element; and the means for transmitting may further comprise means for transmitting the plurality of data file elements together as a set of data file elements.

Another element of the invention provides for a device for decrypting a data file that has been encrypted with the method described above wherein a new unique set of associated metadata is generated for each target data element, wherein the device is connected to the data file store over a network, the device comprising: means for retrieving the data file from the data file store, along with the unique sets of metadata associated with each data file element; means for dividing the data file into the data file elements based on the unique sets of metadata; and means for decrypting each data file element using the associated unique set of metadata and the secret key.

Another element of the invention provides for a device for decrypting a data file that has been encrypted in accordance with the method described above wherein the data file elements are stored in a chronological history of stored data file elements on the data file store, the device comprising: means for retrieving all thc data file elements in the chronological history; means for constructing a data architecture from the data file elements by applying each data file element to the data architecture in turn, in accordance with their chronological order, wherein the constructed data architecture comprises one or more pieces, each piece referencing at least a portion of a data file element; and means for decrypting each piece of the data architecture using the referenced portion of the data file element, the unique set of metadata associated with the data file element, and the secret key.

Any of the above devices may be one of a plurality of collaborating devices in disparate locales that may access the content of the data file concurrently, each device having a separate connection to the data file store.

The above device may further comprise means for receiving relayed data from the from the data file store, the relayed data comprising one or more target data file elements previously transmitted to the data file store by other collaborating devices.

There is also provided a computer program comprising program instructions for causing a computer program to carry out the above method, which may be embodied on a record medium, carrier signal or read-only memory.

Brief Description

The invention will be more clearly understood from the following description of an embodiment thereof, given by way of example only, with reference to the accompanying drawings, in which: Figure 1 is a schematic of a cloud-based data manipulation system in accordance with

the prior art;

Figure 2 is a schematic of a secure cloud-based data manipulation system in accordance with the claimed invention; Figure 3 is a schematic of the intcrrclationship between web browser software, a client application and a bespoke plug-in in an embodiment of the invention where data manipulation flrnctionality is comprised in the client application, which is retrieved by the web browser over a network and subsequently run within the web browser, and wherein the bespoke plug-in ensures the manipulated data are encrypted before they are relayed to the cloud-based data manipuLation system; Figure 4 is a schematic of the interrelationship between the productivity/office software and the bespoke extension in an embodiment of the invention where data manipulation functionality is providcd by the locally stored productivity/office software, and wherein the bespoke extension ensures the manipulated data are encrypted before they are relayed to the cloud-based data manipulation system; Figure 5 illustrates the process by which the embodiment of the invention depicted in Figure 3 may record, encrypt and transmit manipulated data; Figure 6 illustrates the process by which the embodiment of thc invention depicted in Figure 4 may record, encrypt and transmit manipulated data Figure 7 is a schematic of the universal set of all UTF-8 characters and a number of subsets that exist within this universal set; Figure 8 illustrates the process by which a data architecture representative of the data file may be constructed in accordance with one embodiment of the invention; Figure 9 illustrates the process by which the data architecture representative of the data file in accordance with one embodiment of the invention may be updated with an insertion of additional data content; Figure 10 illustrates the process by which the data architecture representative of the data file in accordance with one embodiment of the invention may be updated with a deletion of existing data content; Figurc II illustrates a second proccss by which the data architecture representative of the data file in accordance with one embodiment of the invention may be updated with a deletion of existing data content; Figure 12 illustrates a third process by which the data architecture representative of the data file in accordance with one embodiment of the invention may be updated with a deletion of existing data content; and Figure 13 is a schematic illustrating how data tile elements may be encrypted using a keystream cipher, and how the keystream cipher may be generated.

Detailed Description

Figure 1 is a diagram illustrating an architecture for providing a cloud-based data manipulation system 100 that accommodates concurrent data manipulation by multiple users in accordance with the prior art. A network 101 facilitates communication between a remote (i.e. "cloud-based") server 130 acting as a document creation, editing and storage facility and a plurality of client devices 110, 120. It will be appreciated that the network 101 may comprise the Internet, a proprietary network, or a combination of the two. Client devices 110, 120 may be disparately located, and may connect to the network 101 by way of one or more of a variety of technologies, such as Ethernet, DSL, ISDN, Wi-Fi, WiMax, 2G, 3G, LTE, 4G, etc. Client devices 110, 120 may be any of a variety of devices including desktop personal computers, laptops, tablet personal computers, personal digital assistants, mobile phones etc. Server 130 is also connected to network 101, and comprises means for providing data manipulation functionality 131 across the network 101. Server 130 is also associated with means for storing manipulated data 132, and means for storing metadata associated with the stored manipulated data 133. It will be appreciated that alternative arrangements exist to the arrangement of the server 130 illustrated in Fig 1. For example, the server 130, the means for storing manipulated data 132 and the means for storing associated metadata 133 may comprise a single server, with shared storage means.

Client devices 110 and 120 may avail of data manipulation functionality by accessing the means for providing data manipulation functionality 131 located on server 130 over the network 101. The client devices 110 and 120 may access the means for providing data manipulation functionality 131 over the network 101 using respective web browser software 15, 125. The data manipulation functionality may comprise browser-based client applications 112, 122 which maybe in the form of client-side scripting. The browsers 115, 125 may dynamically retrieve the respective client applications 112, 122 from the means for providing data manipulation functionality 131, thereby allowing for subsequent local execution of the client applications 112, 122 on respective client devices 110, 120. Examples of such client applications include those provided by the Goog1eDocs or Microsoft Office Web Apps systems.

Alternatively, the means for providing data manipulation functionality 131 may be accessed over the network via bespoke productivity/office software 11 3, 123 located respectively in clients 110, 120. An example of such productivity/office software is Microsofl Office 2010. Any one of a number of protocols may be used to allow access to this data manipulation functionality. In a preferred embodiment, Hypertext Transfer Protocol Secure (HTTPS) may be used, but it will readily understood that any request/response transaction protocol may be appropriate for this purpose. It will be appreciated that although a plurality of client devices are depicted, a plurality of client devices are not essential to the functioning of this arrangement.

Figure 2 is a diagram illustrating an architecture for providing a cloud-based data manipulation system 200 that accommodates concurrent data manipulation by multiple users in accordance with an embodiment of the invention. The architecture of the system 200 is analogous to that of system 100 to the extent that labelled parts 201 to 233 of Fig 2 correspond to labelled parts 101 to 133 of Fig 1. It will be appreciated that the variety of embodiments contemplated for the system of Fig 1 are also analogously contemplated for the system of Fig 2 where applicable, and that -as with Figure 1 -a plurality of client devices arc not essential to the frmnctioning of this arrangement.

The system of figure 2 differs from that of figure 1 in that it ensures only encrypted data are stored on manipulated data storage means 232 associated with server 230.

This is achieved by ensuring that all data manipulated at the clicnts 210, 220 are encrypted before they are transmitted over the network 201 to the server 230. The encryption may be done in real time, as the data are manipulated, using a shared secret key 260, known only to the users of the client devices 210, 220.

in one embodiment of the invention, data may be manipulated in web browsers 215, 225, using the functionality of respective client applications 212, 222, where the client applications have been previously downloaded from the cloud. The manipulated data are then passed to bespoke plug-ins 279, 289, these plug-ins being embedded respectively in the web browsers 215, 225. The bespoke plug-ins 279, 289, encrypt the manipulated data, and then the encrypted data arc passed on to the manipulated data storage means 232 associated with server 230 for storage.

Conversely, when a client device in accordance with this embodiment of the invention retrieves encrypted data from the manipulated data storage means 232 associated with server 230, the data are passed to bespoke plug-ins 279, 289. The plug-ins 279, 289 decrypt the data and then the decrypted data arc passed on to respective web browsers 215, 225, where they may be processed by client applications 212,222 for presentation to the user for subsequent possible manipulation. Decrypted data are only housed locally and temporarily on the client devices, preferably in the cache of the web browsers 215, 225, before subsequent re-encryption and committal to the manipulated data storage means 232 associated with server 230.

In an alternative embodiment of the invention, data may be manipulated in productivity/office software 213, 223. Bespoke extensions to the software 213, 223 may then encrypt the data, and then the encrypted data are passed on to the manipulated data storage means 232 associated with server 230 for storage. The bespoke extension is discussed in greater detail below with reference to Figure 4.

Conversely, when a client device in accordance with this embodiment of the invention retrieves encrypted data from the manipulated data storage means 232 associated with server 230, the data are passed to the bespoke extensions of the software. The extensions decrypt the data and then the decrypted data may be processed by the productivity/office software 213, 223 for presentation to the user for subsequent possible manipulation. In this case, decrypted data are only housed locally and temporarily on the client devices, preferably in the cache of the productivity/office software 213, 223, before subsequent re-encryption and committal to the manipulated data storage means 232 associated with server 230. It will be appreciated that rather than provide the encryption functionality by way of extensions to the productivity/office software, it may alternatively be possible to provide this functionality to productivity/office software by way of a separate application stored on the client device that acts as an intermediary in the communication link between the productivity/office software and the cloud.

Figure 3 depicts the interrelationship between the client application, the web browser software and the bespoke plug-in in the embodiment of the invention in which data manipulation functionality is provided via a client application that may be run from within a web browser. As previously mentioned, examples of client applications that allow data manipulation functionality to be provided in such a way include the client applications made available by the Google® Does and Microsoft® Office Web Apps systems. These browser-based client applications may be in the form of client-side scripting which may be retrieved from a host website by a web browser for subsequent local execution on the client device. A system 300 is depicted wherein a client device 310 is in contact with a cloud-based data manipulation system 330, over a network 301. Web browser software 315 may be run from within client device 310, and may be used to navigate to a location on the World Wide Web where a client application 312 that offers data manipulation functionality may be accessed and retrieved. The web browser software 315 aLso houses a bespoke plug-in 379 that is configured to ensure that all manipulated data transmitted from the client device 310 are transmitted in encrypted form. The client application 312 may have a connection 311 to the cloud-based data manipulation system 330, the connection preferably being over a request/response transaction protocol, which in a preferred embodiment is HTTPS. This connection will hereafter be referred to as the "primary channel". The client application 312 may relay/receive the requests/responses through the web browser software 315. In the embodiment of the invention depicted in Fig 3, the bespoke plug-in 379 maintains an independent connection 314 with the cloud-based data manipulation system 330. This connection 314 may also be over a request/response transaction protocol, which in a preferred embodiment is HTTPS.

This connection will hereafter be referred to as the "secondary channel". This second connection is necessary for reasons that will be discussed in more detail below.

Figure 4 depicts the embodiment of the invention in which data manipulation functionality is provided via bespoke functionality built as an extension to the productivity/office software residing on the client device. The interrelationship between the extension and the productivity'office software is shown. As previously mentioned, an example of such a productivity/office software suite is Microsoft Office 2010. A system 400 is depicted wherein a client device 410 is in contact with a cloud-based data manipulation system 430, over a network 40L Client device 410 hosts productivity/office software 452 that provides data manipulation functionality and also hosts a bespoke extension 459 that ensures such data are encrypted before they are relayed to data manipulation system 430. The productivity/office software 452 may have a connection 411 to the cloud-based data manipulation system 430, the connection preferably being over a request/response transaction protocol, which in a preferred embodiment is HTTPS. This connection will hereafter be referred to as the "primary channel", as it performs essentially the same functions as the primary channel described in the embodiment of the invention referenced with respect to Figure 3. In the embodiment of the invention depicted in Fig 4, the bespoke extension 459 maintains an independent connection 414 with the cloud-based data manipulation system 430. This connection 414 may also be over a request/response transaction protocol, which in a preferred embodiment is HTTPS. This connection will hereafter be referred to as the "secondary channel", as it performs essentially the same functions as the secondary channel described in the embodiment of the invention referenced with respect to Figure 3. As with the embodiment of the invention depicted in Figure 3, the second connection in the embodiment of the invention depicted in Figure 4 is necessary for reasons that will be discussed in more detail below.

Figure 5 illustrates how the embodiment of the invention depicted in Figure 3 may ensure that all manipulated data transmitted from the client device 310 to the cloud-based data manipulation system 330 are in encrypted form. The user of client device 310 may be presented with a data file via web browser 315, and may be afforded the ability to manipulate the data in the data file via client application 312. Referring now to figure 5, when the user manipulates 501 the data in such a data file, it may be regarded as a data manipulation event. The changes to the data file embodied by the data manipulation event are rccordcd 502 by the client application 312 and encoded as a "mutation". Data manipulation events that may be recordcd as mutations include, but are not limited to, discrete data insertion opcrations (comprising the inscrtion into the data file contcnt of a contiguous string of data) and discrete data deletion opcrations (comprising the dcletion from the data file contcnt of a contiguous string of data). Data manipulation events that comprise a chronologically successive set of such discrcte insertion and dclction operations may also be recorded in a single mutation. The amount of data manipulation that may be recorded in a single mutation is a matter of preference, and it will be appreciated that the client application 312 may therefore be configured to record mutations according to such preferences. The point at which a discrete mutation is recorded may be as the result of a function of one or more variables, such as the duration of the data manipulation event to date, the extent of change that has taken place to the data file over the course of the data manipulation event to date, the idle time since the last action by the user andlor as a result of receiving certain prompts from the cloud based data manipulation system 330. It will thus be appreciated that extensive data manipulation sessions may be recorded as a series of data manipulation events.

Once the mutation has been encoded at step 502, the client application 312 may prepare a request for transmitting the mutation to the cloud based data manipulation system 330, and may embed 503 the mutation within the prepared request. The prepared request may, for example, comprise an HTTPS request. The client application 312 may then pass 504 the prepared request to the web browser 315 for transmission.

Prior to transmission of the prepared request from the web browser 315 to the cloud based data manipulation system 330, the bespoke plug-in 379 may capture 505 the prepared request. The bespoke plug-in 312 may then process 506 the mutation embedded in the prepared request, encrypting the data manipulation event recorded within the mutation, and thereby ensuring that all manipulated data transmitted to the cloud based data manipulation system 330 is transmitted in encrypted format. When a data manipulation event is encrypted in this way, the set of individual operations comprising the data manipulation event will be encrypted individually. Operations that involve the addition of new content may have that content encrypted.

Accordingly, the content of insertion operations will be encrypted. It may be possible to also encrypt the information relating to a deletion operation, such as where in a data file the deletion is to be made, the size of the deletion, etc. However, because thcy do not entail the addition of any new content, it is not strictly necessary to encrypt deletion operations. The manner in which individual operations may be encrypted will be described in greater detail below. Once the content of all insertion operations in a mutation have been encrypted, the mutation has been processed.

Once the mutation embedded within a prepared request has been encrypted, the prepared request is then transmitted 507 to the cloud based data manipulation system 330 by the web browser 315 so that the mutation may be committed to the data file stored thereon. The mutation may be committed to the stored data file in a number of ways. In one embodiment of the invention each operation in each mutation is stored individually on the cloud based data manipulation system 330 in a chronological history of such operations. The full history of such operations is representative of the data file in its up to date state. This embodiment is described further in Figures 8-12 below, where each mutation is referred to as a "revision element". In this embodiment of the invention, the cloud-based data manipulation system 330 may subsequently transmit a confirmation to the client device confirming that the mutation has been received the set of operations contained therein have been stored, and informing the client device of each operation's chronological position within the chronological history of stored operations. Alternatively, the mutation -once received by the cloud based data manipulation system 330 -may be directly applied to the data file, and the data file itself stored in an up-to date format.

Figure 6 illustrates how the embodiment of the invention depicted in Figure 4 may ensure that all manipulated data transmitted from the client device 410 to the cloud-based data manipulation system 430 arc in encrypted form. Similar to the previous embodiment, the user of client device 410 may be presented with a data file via productivity/office sofiware 452, which may also afford the user the ability to manipulate the data in the data file. Refening now to Figure 6, when the user manipulates 601 the data in such a data file, it maybe regarded as a data manipulation event. The changes to the data file embodied by the data manipulation event are recorded 602 by the productivity/office software 452 and encoded as a "mutation", as previously described with respect to Figure 5. As with Figure 5, the amount of data manipulation that may be recorded in a single mutation is a matter of preference. With respect to the embodiment depicted in Figure 6, it will be appreciated that the productivity/office software 452 may be configured to record mutations according to such preferences. Examples of considerations that may be taken into account when determining when to record a mutation are described with reference to Figure 5.

Once the mutation has been encoded at step 602, the client application productivity/office software 452 may prepare a request for transmitting the mutation to the cloud based data manipulation system 430, and may embed 603 the mutation within the prepared request. The prepared request may, for example, comprise an HTTPS request.

Similar to the previous embodiment, prior to the transmission of the prepared request from the productivity/office software 452 to the cloud based data manipulation system 430, the bespoke extension 459 may capture 605 the prepared request. The bespoke extension 459 may then process 606 the mutation embedded in the prepared request, encrypting the data manipulation event recorded within the mutation, and thereby ensuring that all manipulated data transmitted to the cloud based data manipulation system 430 is transmitted in encrypted format. The manner in which the mutation is processed proceeds in a fashion analogous to that described with reference to step 506 of Figure 5.

Once the mutation embedded within a prepared request has been encrypted, the prepared request is then transmitted 607 to the cloud based data manipulation system 430 so that the mutation may be committed to the data file stored thereon. The mutation may be committed to the stored data file as described above, with reference to step 507 of Figure 5.

While the Figures 5 and 6 are described in the context of data manipulation events related to an existing data file, it will be appreciated that the creation of a new data file and the first insertion of data into said data file in itself also constitutes a data manipulation event. Accordingly, in the embodiment of the invention described with respect to step 507 of Figure 5, where data manipulation events may be stored on the cloud based data manipulation system as a history of mutations, the creation and initial insertion of content into a data file may be stored as the first mutation in such a history. Accordingly, in such an embodiment, the history of mutations comprises a self-contained representation of the data file. In one embodiment of the invention, mutations may be transmitted to the cloud based data manipulation system over the primary channel.

In order to ensure that the manipulated data are successfully and efficiently encrypted prior to their being relayed to the cloud-based data manipulation system, it is necessary to first ensure that the data to be encrypted do not contain characters that will cause a problem during the encryption process. In one embodiment, the data being manipulated may be in the form of a text document comprising IJTF-8 characters. However, it will be readily appreciated that other data and/or character formats may also be used. In the embodiment where the text document comprises UTF-8 characters, it is necessary to ensure prior to encryption that the document does not contain any characters that might raise an error when handled by the data manipulation functionality, as this could cause problems during the encryption process. Figure 7 depicts the universal set of all IJTF-8 characters 701, which comprises the set of all printable characters 702 and the set of control characters 703.

The set of control characters 703 in turn comprises characters 704 that might raise an error when handled by the data manipulation functionality such as End of Transmission ("EOT"), Bell ("BEL"), Synchronous Idle ("SYN") and Acknowledge ("ACK"), and typographical control characters 705 such as "space", "newline" or "tab". In this embodiment, any error-raising characters in the manipulated data may be stripped out by projecting all characters in the data onto an abridged subset of UTF-8 characters that excludes error-raising characters, thereby allowing for error-free encryption.

In order to obviate the need to re-encrypt an entire data file whenever its content is manipulated, each data file may be represented by a series of discrete elements which, when taken together are collectively representative of the complete data file. These data file elements may, for example, each represent a disjoint, contiguous portion of the data file content, which may be updated when a change is made to the data file content. By way of alternative example, each data file may represent a discrete change to the data file content, with each new change giving rise to a new corresponding data element. Using this model, when changes are made to the data file, it is sufficient to only encrypt the data file element representative of the change and relay it to the cloud-based data manipulation system for storage. As will be described in greater detail below, each data file element may be encrypted on the basis of a secret key and its own unique seed string.. Using a different, unique seed string for each separate data element effectively eliminates the threat that the data file may be compromised via a re-use attack if a stream cipher encryption scheme is used to perform the encryption.

There are a number of ways in which a data file may be divided into data file elements in order to allow efficient and secure encryption. In one embodiment of the invention (hereafter termed the "revision" embodiment), the data file elements may each represent a discrete, chronologically successive data file content manipulation operation. Data file content manipulation operations include discrete data insertion operations (comprising the insertion into the data file content of a contiguous string of data) and also include discrete data deletion operations (comprising the deletion from the data file content of a contiguous string of data). Data file elements in the form of data file content manipulation operations will be referred to as "operation elements".

One or more chronologically successive operation elements may be regarded as a "data manipulation event". In this embodiment of the invention, data manipulation events may be recorded and then applied to the data file content. Data manipulation events recorded in this way will hereafter be referred to as "revision elements".

Therefore, a revision element may comprise a set of one or more successive operation elements. The data file may be stored on the cloud-based data manipulation system in this manner, as a history of successive encrypted operation elements, each belonging to a specific revision element (this history will be referred to as a "revision history").

As such, revision elements may be synonymous with the mutations described in reference to Figures 5 and 6. and operation elements may correspond to individual operations within the mutations of Figures 5 and 6. . Each operation element within each of these revision elements may be associated with a unique combination of relevant metadata. Such metadata may include a unique session identifier relating to a particular session established between a client device and the cloud-based data manipulation system; a user identifier that identifies the user responsible for the data manipulation event; a timestamp; the chronological position of the operation element within the revision history; the position of the data manipulation event within the data file content; the length of the data string being manipulated (the length of the data string being inserted into the data file content in the case of a data insertion operation, or the length the data string being deleted from the data file content in the ease of a data deletion operation). Because each operation element may be stored in a chronological history in this way, it may bc considered to have timeline-specific properties.

It will be appreciated that when a data manipulation event has occurred, and it is intended to commit it to the data file on the cloud-based data manipulation system as a corresponding newly created revision element, the revision element may first be encrypted before it is transmitted over the network. As a new revision element may comprise a succession of newly created discrete operation elements, each operation element may be encrypted in turn, as appropriate. As mentioned above, a unique seed string may be used in the encryption of each operation element. In an embodiment of the invention, each seed string may comprise a unique combination of metadata associated with each operation clement, and may be used as the input of a hashing function to produce a message digest that may then be used in the encryption process.

For example, the unique seed string may comprise a concatenation of a session ID, a user ID, and the chronological position of the operation element in the chronology of all operation elements that have been recorded in the revision history to date.

Encrypting only the revision element is efficient, because only the necessary data (i.e. the manipulated data) are encrypted and sent, rather than thc entire data file.

Consequently, resources are not wasted encrypting and sending parts of the data file that have not undergone any modification during the data manipulation event in question. The revision history may thus comprise a history of successive encrypted revision elements, thereby ensuring that the data file is stored on the cloud-based data manipulation system in a securc fashion. The manner in which thc revision elements may be encrypted are described in greater detail below.

When a user wishes to recall an encrypted data file from storage on the cloud based data manipulation system in the "revision" embodiment of the invention, the data file may be reconstituted from the history of encrypted revision elements. One manner of doing so is by constructing a locally stored data architecture that is representative of the data file. The data architecture may be constructed in stepwise fashion, processing each revision element in turn by individually retrieving them (beginning with the first revision element) and applying their corresponding data manipulation event to the data architecture. If a data manipulation event represents more than one discrete operation element, then these opcration elements are applied chronologically. This construction may continue until all revision elements have been applied and the data architecture is frilly constructed and thus fully representative of the data file as recorded in the retrieved revision history. Use of a data architecture to reconstruct a representation of the data file may assist in efficient processing of the revision history.

In one embodiment of the invention, the revision history may be retrieved on the secondary channel.

In one embodiment, the data architecture may comprise a directory data structure and an associated set of "data file piece" data structures. In this embodiment, each data file piece may store a number of values that allow it to reference a specific string of data content. The piece may store the source of the referenced data content string, and further mitigating values to isolate the referenced data content string from the source if the source comprises a larger string of data. Such mitigating values may include an offset value and a string length. The data content strings referenced by a complete set of data file pieces, when taken together, may collectively make up the complete data file content as embodied in the revision history. To aid in the assembly of the data content strings, the pieces may be listed in the directory in accordance with where their referenced data content strings are to be positioned within the data file content.

For the purposes of explaining this process further, data content strings will be referred to as data file strings once they have been inserted into the data file.

As the constituent operation elements of revision elements are applied to these data structures, new pieces may be added, existing pieces may have their content references modified, existing pieces may have the position of their referenced content within the data file content modified. andlor existing pieces may bc deleted. In each case, the directory is updated accordingly. In this way, each operation element within a revision clement may be applied in turn (and each revision clement may then be applied in tum) to the directory and associated set of data file pieces until all operation elements have been applied in chronological order, and the directory and associated set of pieces arc fully representative of the data tile as recorded in the retrieved revision history.

A directory data structure -if used -may be of any suitable type, for example, a self-balancing binary search tree. A self-balancing binary search tree, as will be readily understood by the skilled person, is a node-based data structure where each node has a value and is connected to no more than two child nodes. Each node may also be connected to a single parent node. Conventionally, child nodes on the left subtree of a given node all have values less than that of the given node, whereas child nodes on the right subtree of the given node all have values more than that of the given node.

As additional nodes are added to the tree, the nodes in the tree may be rearranged to keep the tree height (the number of "generations" of nodes) to a minimum, hence it is self-balancing. In the context of this embodiment of the invention, each node in the self-balancing binary search tree relates to one of the data file piece data structures, and the value of each node is the position of the data content string (referenced by the piece) within the data file content.

The data file may be assembled for viewing from the fully constructed data architecture. In the embodiment where the data architecture comprises a directory and associated set of pieces, the data content strings referenced by the pieces may be amalgamated in accordance with their location within the data file, as dictated by the directory. The data content strings may be decrypted individually prior to assembly, or the assembled data file content (comprising a contiguous set of data file strings) may be decrypted en blocPreferably. each data content string is decrypted individually, as will be described in thither detail below.

Figure 8 illustratcs a mcthod by which the data arch itccture may be constructcd. A client device (as illustrated by 110, 120, 210 and 220 of Figures 1 and 2) prompted to recall an cncrypted data file from thc cloud based data manipulation systcm (130, 230 of Figures 1 and 2) in accordance with the "revision" embodiment of the invention may first request and load 801 a rcvision history from the systcm 130, 230. The devicc may subsequently initialize a data architecture reprcsentativc of the data file, to which the modifications may be made as revision elements from the loadcd revision history arc applied. In the described embodiment, after step 801, a directory data structure in the form of a self-balancing binary search tree data structure is initialized 803, and provision is made for a set of "data file piece" data structures. It will be appreciated that at this stage, before any revision element is applied, no data file pieces will yet exist in the set of data file pieces (because no data file content has yet been obtained from the loaded revision history), and accordingly, the tree will be blank.

Subsequent to the initialization of the binary search tree at 803, the device at step 804 may check whether there are any remaining revision elements yet to be processed. It will be appreciated that in the event that no revision elements have yet been processed, this step will result in the processing of the first revision element in the loaded revision history. In the event all revision elements have been processed, the search tree and associated set of pieces may be stored 809 for use in subsequent assembly for viewing andlor modification of the data file. In the event revision elements exist that have yet to be processed, the device will then set about applying the data manipulation event embodied in the next revision element to the search tree and associated set of data file pieces by proceeding to step 805.

As discussed above, a data manipulation event embodied in a revision element may comprise a plurality of operation elements, and so processing of the revision element may entail the sequential application of these operation elements to the search tree and associated set of data file pieces. Accordingly, after step 804, the device may then check in step 805 whether the revision element currently being processed comprises any outstanding discrete operation elements that have not yet been applied to the search tree and associated set of data file pieces. If all operation elements have been applied, it can be concluded that the revision element in question has been fully processed, and the device returns to step 804. 1-lowever, if there is at least one outstanding operation element that must bc applied, the devicc thcn checks, at step 806, whether the next operation element represents a discrete data string insertion or a discrete data string deletion. In the event an insertion is detected, the insertion is applied in step 807 to the search tree and associated set of data file pieces in the manner described below with reference to Figure 9. Likewise, in the event a deletion is detected, the deletion is applied in step 808 to the search tree and associated set of data file pieces in the manner described below with reference to Figure 10. Once the insertion or deletion has been applied, the device then returns to step 805.

The application of discrete insertion or deletion operation elements will now be described in the context of the embodiment of the invention used in reference to Figure 8.

When the construction of a search tree and associated set of pieces is first initiated, the tree is empty, and no pieces yet exist. The first time an operation element comprising a data insertion operation is applied to the empty tree, a first piece is generated and the data content string it references is the content of this first insertion operation element., It also sets the mitigating values to illustrate that the fill content of the insertion operation element is being referenced, for example by setting the offsct=0 and the lengthn where n is the length of the inserted string. A corresponding node will be generated in the tree, with an established relationship to this piece. The content position of this referenced data content string within the data file content will be recorded in the search free as the node's value. Because this is the first insertion operation in the history of the data file's construction, it will be the first bit of content in the data file. Accordingly, this data content string -when inserted -is to be positioned at the start of the data file content; the "first" position within the data file content. Therefore, the newly created node will be assigned a value corresponding to this first position.

Figure 9a is a visualization of a search tree 901 and associated set of data file pieces 903 that have only undergone a single insertion operation, such as that described in the previous paragraph. As such, the search tree 901 only comprises a single node 902, and the set of data file pieces 903 only comprises a single piece 904. As is represented by the dashed line, the single node 902 is related to the single piece 904.

In accordance with the preceding paragraph, the single piece 904 references the content of the first insertion operation clement as the source of the data content string, and sets its mitigating values to length=n, and offset=0. The data content string referenced by the piece 904 is at this point the only content in the data file represented by tree 901 and associated set of data file pieces 903. Accordingly, this string will be positioned at the start of the data file content, and so node 902 which relates to piece 904 will be assigned the value "1" (i.e. node valuccontent position=1).

Figure 9b illustrates a visualization of the data file 911 as represented by the search tree and associated set of data file pieces of Figure 9a in the event the data file was to be assembled. As can be seen, the assembled data file is comprised of a data file string 914 corresponding to the data content string referenced by the single piece 904 of Figure 9a. The data file string 914 therefore also has a lcngth=n, and is located at content positionl within the data file. In the event that the revision history only comprises a single revision element comprising a single operation element comprising this single insertion operation, then Figure 9b would represent the data file assembled from the fully constructed tree and associated set of pieces. In the event the revision history comprised further operation elements, then these operation elements would also have to be applied before reaching the data file as assembled from the fully constructed tree and associated set of pieces.

Subsequent insertion operations during the construction of a search tree and associated set of data file pieces will now be discussed.

Figure 9c illustrates the visualized data file of Figure 9b where a subsequent insertion operation element is to be applied. While in practice, this operation element will be applied to the tree and corresponding set of data file pieces, the insertion is shown here with reference to the assembled data file for illustrating the procedure on a conceptual level. In Figure 9c, the subsequent insertion operation element is to result in the insertion of a new data content string 945. The inserlion of this new data content string 945 will result in a new data file string 925 having a length (A positioned at content position=k within the data file content. This new data content string 945 is to be inserted into the middle of the existing data file string 914.

Accordingly, the existing data file string 914 is split into two separate data file strings, 928 and 929, and the inserted new data content string 945 becomes new data file string 925 positioned between data file strings 928 and 929. Thus, data file string 928 now starts at content position=1 in the data file and has a length=(k-1); data file string 925 starts at content position=k in the data file and has a length=x; and data file string 929 starts at content position=(k+x+l) in the data file and has a length(n-k+1).

In practice, such an operation element may be applied to the data structures representative of the data file. Figure 9d illustrates how the conceptual example of Figure 9e may be applied in practice to the search tree 901 and associated set of data file pieces 903 depicted in Figure 9a. A new data file piece 905 will be created and added to the set of pieces 903. The new piece 905 will be configured to reference the data content string corresponding to new data file string 925 in a manner analogous to the way piece 904 is configured to reference the data content string corresponding to data file string 914 as described above (i.e. it will indicate the content of this subsequent operation element of the revision history as the source of the data content string, and will set offseto and length=x). A new node 906 is created in the tree 901 that is related to new piece 905, and because the data content string referenced by piece 905 is to be inserted at content positionk in the data file content, the node value of node 906 is in turn set to k.

As this new insertion necessitates the splitting of the existing data file string 914 into two strings 928 and 929, as discussed in the preceding paragraph, existing piece 904 that references the data content string corresponding to data file string 914 is substituted for replacement pieces 908 and 909. Existing piece 904 may be deleted and pieces 908 and 909 may be newly generated and added to the set of pieces 903.

Alternatively piece 904 may be modified to become either one of 908 or 909, in which case only a single additional piece is generated and added to thc set of pieces 903 (this one additional piece becoming the other of the two replacement pieces).

Replacement pieces 908 and 909 will be configured to reference data content strings corresponding to data file strings 928 and 929 respectively. Both pieces 908 and 909 will store a reference to the content of the first operation element of the file revision history that comprises an insertion operation as the source of their referenced data content strings. However, the mitigating values of each of these pieces will be configured to only refer to the relevant portions of this source string. As such, piece 908 will have the mitigating values offset=0 and lcngth=(k-1), and piece 909 will have the mitigating values offset(k-1) and length=(n-k+1). In this way, while both picccs refer to a data content string from the same source, the two strings are in fact different.

A relationship will also be established between each of these replacement pieces 908, 909 and a node in the tree 901, such that there is a one-to-one relationship between nodes in the tree and pieces in the set of pieces. In this example, piece 908 is related to node 902 and piece 909 is related to node 907. Because the data content string 928 referenced by piece 908 is to be inserted at content position= within the data file, the value of related node 902 is set =1. In a similar fashion, the values of nodes 906 and 907 are set =k and =(k+x), respectively.

As a result of the above process, the tree 901 now has three nodes 902, 906 and 907, and their interrelationship may potentially be represented in a number of ways.

However, in accordance with the previously described self-balancing properties of the self-balancing binary search tree implemented in the described embodiment, the tree 901 will rearrange the nodes so that the parent node is node 906 because it is related to the piece 905 that references the data string positioned in the middle of the data file contcnt. As such, node 906 may have onc left child node 902 (having a value less than the parent node), and one right child node 907 (having a value greater than the parent node). This results in a tree of minimum height (i.e. a single "generation" of nodes, where other arrangements might have resulted in two "generations").

in contrast to the insertion operation described with respect to Figures 9c and 9d, if an insertion operation is to be performed on the data structures represented by Figure 9a and visualised in Figure 9b where the data content string is instead to be inserted into the very beginning or the very end of the data file, the process would be less complex.

In either event (insertion at beginning or end), it would not be necessary to replace the existing piece 904 with two pieces 908 and 909, each respectively referencing new data file strings 928 and 929 in place of the existing data file string 914. Tt would be readily understood by the skilled person that in the event the new data content string is to be inserted at the end of the data file, the existing piece 904 and related node 902 may remain unmodified as the new piece and related new node are added. In the event thc new data content string is to be inserted at the beginning of the data file it would be sufficient to modif' the value of existing node 902 to account for the shift in position of data file string 914 within the newly-modified data file content.

It will be readily appreciated that the insertion processes described in the preceding paragraphs with respect to Figures 9c and 9d may equally apply where the tree comprises a plurality of nodes and the set of data file pieces comprises a corresponding plurality of picccs. In the event a plurality of data file strings already exists, the only other consideration is that the position of some of these existing data file strings may have to be moved. In such circumstances, the process will proceed as described above, but will also update the node values relating to the pieces having data content strings that correspond to such data file strings.

Figures 10-1 2 illustrate how a deletion operation may be applied to a data architecture that is representative of the data file in accordance with the embodiment of the invention where the data architecture comprises a self-balancing binary search tree and an associated set of data file pieces.

Figure 1 Oa is similar to Figure 9a in that it depicts a search tree 1001 and associated set of data file pieces 1003 wherein the tree 1001 comprises a single node 1002 and the set of data file pieces 1003 comprises a single piece 1004 related to said node 1002. Only one insertion operation element from the revision history (inserting contcnt of length=n) has been so far applied to these data structures. The piece 1004 references a data content string the source of which is the applied insertion operation element. In this example, no subsequent data deletion operation elements have been applied to either end of the data file, and so the data content string referenced by the piece will be the full content of the insertion operation element. Thus, the mitigating values stored in the piece 1004 maybe offset=0 and length=n.

Figure 1 Ob illustrates a visualization of the data file 1011 as represented by the search tree 1001 and associated set of data file pieces 1003 of Figure lOa in the event the data file was to be assembled from these data structures. As can be seen, the assembled data file is comprised of a data file string 1014 corresponding to the data content string referenced by the single piece 1004 of Figure lOa. The data file string 1014 therefore also has a length=n, and is located at content position=1 within the data file content.

Figure lOe illustrates the visualized data file 1011 of Figure lOb where a deletion operation element is then to be applied. While in practice this deletion will be applied to the tree 1001 and corresponding set of data file pieces 1003 as depicted in Figure I Oa, the deletion is shown here with reference to the assembled data file 1011 for illustrating the procedure on a conceptual level. In Figure lOc, the deletion operation element comprises deletion ofaportion 1025 of the existing data file string 1014. The portion to be deleted 1025 begins at content position=h of the data file content and has a length of (x), thereby extending from content position=h for x positions up to and including content position = (h+x). The next undeleted content position = (h+x+1), will be referred to as (k) for brevity. Therefore, the last deleted content position = (h+x) may also be written as (k-I). Because the portion to be deleted 1025 is to be deleted from the middle of the existing data file string 1014, the existing data file string 1014 is split into two separate data file strings, 1028 and 1029. Thus, data file string 1028 now starts at content position=1 in the data file content and has a length=(h-1), and data file string 1029 starts at content position=h in the data file content and has a length=(n-k).

In practice, such a deletion operation element may be applied to the data structures representative of the data file. Figure lOd illustrates the result of applying the deletion operation element depicted in the conceptual example of Figure 1 Oc to the search tree 1001 and associated set of data file pieces 1003 depicted in Figure lOa. As this new deletion necessitates the splitting of the existing data file siring 1014 into two strings 1028 and 1029, as discussed in the prcccding paragraph, existing piece 1004 that references the data content string corresponding to data file string 1014 is substituted for replacement pieces 1008 and 1009. Existing piece 1004 maybe deleted and pieces 1008 and 1009 maybe newly generated and added to the set of pieces 1003.

Alternatively piece 1004 may be modified to become either one of 1008 or 1009, in which case only a single additional piece is generated and added to the set ofpieces 1003 (this one additional piece becoming the other of the two replacement pieces).

Replacement pieces 1008 and 1009 will be configured to reference data content strings corresponding to data file strings 1028 and 1029 respectively. Both pieces 1 008 and 1 009 will store a reference to the content of the first insertion operation element of the file revision history as the source of their referenced data content strings. However, the mitigating values of each of these pieces will be configured to only refer to the relevant portions of this source string. As such, piece 1008 will have the mitigating values offset=0 and length(h-1), and piece 1009 will have the mitigating values offset(k-1) and lcngth=(n+1-k). In this way, while both pieces refer to a data content string from the same source, the two strings are in fact different.

In contrast to the deletion operation described with respect to Figures lOc and lOd, if a deletion operation element is to be applied to the data structures represented by Figure lOa and visualiscd in Figure lOb where the portion of the data content to be deleted is at the very beginning or the very end of the data file, the process would be less complex. In either event (deletion at beginning or end), it would not be necessary to replace the existing piece 1004 with two pieces 1008 and 1009, each respectively referencing new data file strings 1028 and 1029 in place of the existing data file string 1014. In either event, it would be sufficient to merely modify the mitigation values of existing piece 1004, modifying the length value in the event of a deletion at the end of the data file and modifying both the length and the offset in the event of a deletion at the start of the data file.

it will be appreciated that a deletion operation such as that in the preceding paragraphs with reference to Figures 10c and lOd, the process will fundamentally be the same regardless of how many data file strings are in the data file (and hence, how many pieces are in the data file piece set). In the event a plurality of data file strings already exists, the only other consideration is that the position of some of these data file strings may have to be moved. It is therefore sufficient -in addition to carrying out the procedure described in the preceding paragraphs -to update the node values relating to the pieces having data content strings that correspond to such data file strings.

Figure 1 la depicts a search tree 1101 and associated set of data file pieces 1103 wherein the tree 1101 comprises two nodes 1002, 1005, and the set of data file pieces 1003 comprises two pieces 1007, 1109 related respectively to said nodes 1102, 1105.

In the present example this configuration of data structures is the result of two successive insertion operations, and as such the pieces 1107, 1109 reference data content strings from different insertion operation elements, the mitigating data being set accordingly. It will be appreciated that a configuration of data structures very similar to this could also be the result of an insertion operation, followed by a deletion operation at content position=j in the data file content. However, in that case, both pieces 1107, 1109 would reference data content strings from the same insertion operation element, and the mitigating values of the pieces would be different to those depicted in Figure 1 Ic in order for each piece to identif' the relevant part of the content of the insertion operation element. In this example, no subsequent data deletion operation elements have been applied to the data structures, and so the data content strings referenced by the pieces will correspond to the fill content of the respectively referenced insertion operation elements. As such, the mitigating values stored in piece 1107 may be offset=0 and length=(j-l), and the mitigating values stored in piece 1109 may be offsct=0 and length(n-j).

Figure 1 lb illustrates a visualization of the data file 1111 as represented by the search tree 1101 and associated set of data file pieces 1103 of Figure 1 Ia in the event the data file was to be assembled from these data structures. As can bc sccn, the assembled data file is compriscd of two data file strings 1117 and 1119 corresponding respectively to the data content strings refercnccd by thc pieces 1107 and 1109 of Figure ha. Thc data file strings 1107 and] 109 therefore also have length=(j-l) and length=(n-j) respectively, and arc respectively located at contcnt positionl and content positionj within the data file content.

Figurc 11 c illustrates the visualizcd data file 1111 of Figure b whcrc a deletion operation element is thcn to be applied. While in practice, this deletion will be applied to the tree 1101 and corresponding set of data file pieces 1103 as depicted in Figure 11 a, the deletion is shown here with reference to the assembled data file 1111 for illustrating the procedure on a conceptual level. In Figure 1 ic, the deletion operation element comprises deletion of a portion 1125 of the existing data file content. The portion to be deleted 1125 begins at content positionh of the data file content and has a length of(x), thereby extending from content position=h for x positions up to and including content position=(h+x). The next undeleted content position, (h+x-{-1), will be referred to as (k) for brevity. Therefore, the last deleted content position, (h+x) will may also be written as (k-l). As can be seen, therefore, this portion to be deleted 1125 is directed to the trailing end of end of data file string 1117 and to the leading end of data file string 1119.

Because the portion to be deleted 1125 is to be deleted from the ends of two existing data file strings 1117, 1119, it is not necessary to split these strings. It is sufficient merely to truncate both data file strings in accordance with the deletion operation, and to modify the content position in the data file content of data file string 1119. Thus, while data file string 1117 still starts at content positionH, it now has a length(h-1).

Data file string 1119 now starts at content position=h in the data file and has a length=(n-k).

In practice, such a deletion operation may be applied to the data structures representative of the data file. Figure lId illustrates the result of applying the deletion operation element depicted in the conceptual example of Figure 1 Ic to the search tree 1101 and associated set of data file pieces 1103 depicted in Figure 1 Ia. This new deletion does not necessitate the further splitting of the existing data file strings 1117 and 1119, so it is not necessary to create new nodes or pieces. Rather, it is merely sufficient to modify the mitigating values in the pieces 1107, 1109 corresponding to data file strings 1117 and 1119 to account for their truncation, and to modify the node value of node 1105 that is related to piece 1109 to account for its change in content position within the data file. As such, piece 1107 will have the mitigating values offset=0 and length(h-l), and piece 1109 will have the mitigating values offset(k-j) and length=(n+l-k. In this way, while both pieces still refer to data content strings from the respective sources they had previously referred to, the exact content within these sources has now changed.

It will be appreciated that a deletion operation such as that depicted in Figures 1 Ic and lId where the portion of data file content to be deleted extends over the trailing ends of two contiguous data file strings, the process will be the same regardless of how many data file strings are in the data file (and hence, how many pieces are in the data file piece set). In the event more than two data file strings already exist, the only other consideration is that the content position of some of the data file strings other than those being modified may have to be moved. It is therefore sufficient -in addition to canying out the procedure described in the preceding paragraphs -to update the node values relating to the pieces having data content strings that correspond to such data file strings. It will be appreciated that in the event that values of nodes are altered in this way, the search free may rearrange itself in accordance with the self-balancing principles already described.

Figure 12a depicts a search tree 1201 and associated set of data file pieces 1203 wherein the tree 1201 comprises three nodes 1204, 1205, 1206, and the set of data file pieces 1203 comprises three pieces 1207, 1208, 1209 related respectively to said nodes 1204, 1205, 1206. In the present example this configuration of data structures is the result of three successive inscrtion operations, where the second and third insertion operations were each at the end of the data file. As such, the pieces 1207, 1208. 1209 reference data content strings from different insertion operation elements, the mitigating data being set accordingly. It will be appreciated that a configuration of data structures very similar to this could also be the result a number of other combinations of insertion and deletion operations. For example, a first insertion operation, followed by a second insertion operation at content position=i in the data file would result in this configuration, as would a first insertion operation, followed by a second insertion operation at the end of the data file, followed subsequently by a deletion operation at either contcnt positioni or content position=j of thc data file.

However, in that case, pieces 1207, 1208, 1209 may reference data content strings from the same insertion operation elements, and the mitigating values of the pieces would be different to those depicted in Figure 12c in order for each piece to identify the relevant part of the content of the relevant insertion operation element. In this example, no subsequent data deletion operation elements have been applied to the data structures, and so the data content strings referenced by the pieces will corrcspond to the full content of the respectively referenced inscrtion opcration elements. As such, the mitigating values storcd in piece 1207 may be offset0 and length=(i-1); the mitigating values stored in piccc 1208 may be offset=0 and length(j-i); and the mitigating valnes stored in piece 1209 may be offset0 and length=(n-j).

Figure 12b illustrates a visualization of the data file 1211 as represented by the search tree 1201 and associated set of data file pieces 1203 of Figure 12a in the event the data file was to be assembled from these data structures. As can be seen, the assembled data file is comprised of three data file strings 1217, 1218 and 1219, corresponding respectively to the data content strings referenced by the pieces 1 207, 1208 and 1209 of Figure 12a. The data file strings 1207, 1208 and 1209 therefore also have length(i-1), length(j-i) and length=(n-j) respectively, and are respectively located at content position=1, content position=i, and content positionj within the data file content.

Figure 12c illustrates the visualized data file 1211 of Figure 12b where a deletion opcration element is then to bc applied. While in practice, this delction will be applied to the tree 1201 and corresponding set of data file pieces 1203 as depicted in Figure 12a, the deletion is shown here with reference to the assembled data file 1211 for illustrating the procedure on a conceptual level. In Figure 12c, the deletion operation element comprises deletion ofa portion 1225 of the existing data file content. The portion to be deleted 1225 begins at content positionh of the data file content and has a length of(x), thereby extending from content position=h for x positions up to and including content position=(h+x). The next undeleted content position, (h+x+1) will be referred to as (Ic) for brevity. Therefore, the last deleted content position=(h+x) will also be written as (k-I). As can be seen, therefore, this deletion portion 1225 is directed to the trailing end of data file string 1217, to the entirety of data file string 1218, and to the leading end of data file string 1219.

Because the ends of two existing data file strings 1217, 1219 are to be deleted it is not necessary to split these data file strings. Furthermore, because data file string 1218 is to be deleted in its entirety, this data file string 1218 may simply be removed en bloc.

Therefore, it is sufficient merely to remove data file string 1218, and to truncate both data file strings 1217 and 1219 in accordance with the deletion operation, then to modify the content position in the data file of data file string 1219. Thus, while data file string 1217 still starts at content positionl, it now has a length=(h-1). Data file string 1219 now starts at content position=h in the data file content and has a length(n-k). As can be seen, data file string 1218 has been removed.

In practice, such a deletion operation element may be applied to the data structures representative of the data file. Figure 12d illustrates the result of applying the deletion operation element depicted in the conceptual example of Figure 12c to the search tree 1201 and associated set of data file pieces 1203 depicted in Figure 12a. This new deletion does not necessitate the further splitting of the existing data file strings 1217 and 1219, and necessitates the removal of existing data file string 1218. Therefore it is not necessary to create new nodes or pieces. Rather, it is merely sufficient to modify the mitigating values in the pieces 1207 and 1209 corresponding to data file strings 1217 and 1219 to account for their truncation, to modify the node value of node 1206 that is related to piece 1209 to account for its change in content position within the data file, and to delete piece 1208 from the set of data file pieces 1203 along with deleting its related node 1205 from the search tree 1201. As such, piece 1207 will have the mitigating values offset=0 and length=(h-1), and piece 1209 will have the mitigatingvalues offset=(k-j) and length(n+1-k). In this way, whilebothpieces still refer to data content strings from the respective sources they had previously referred to, the exact content within these sources has now changed.

It will be appreciated that a deletion operation such as that depicted in Figures 1 2c and 12d where the portion of data file content to be deleted extends over the entire length of at least one data file string, the process will be the same regardless of how many entire data file strings are to be deleted -it will simply be a matter of deleting the data piece and related node corresponding to every data file string deleted in this way. Furthermore, the process will proceed analogously, regardless of whether the portion of deleted file content extends over the trailing end of a data file string, or whether it merely ends at a data file string boundary. lithe deleted portion ends at a data file string boundary it merely means that it will not be necessary to amend the mitigating values of the piece corresponding to the data file string bounding the deleted portion. In addition, this process will proceed in a similar fashion, regardless of how many data file strings are in the data file (and hence, how many pieces are in the data file piece set). In the event more than three data file strings already exist, the only other consideration is that the content position of some of the data file strings other than those being modified may have to be moved. It is therefore sufficient -in addition to carrying out the procedure described in the preceding paragraphs -to update the node values relating to the pieces having data content strings that correspond to such data file strings.

It will be appreciated that in the event that values of nodes are altered as a result of any of the versions of the operations described with reference to Figures 9-12 or any combination of operations deriving therefrom, the search tree may rearrange itself in accordance with the self-balancing principles already described.

When the data structures as referred to in Figures 8-12 have been fully constructed from the revision history, the data file may then be decrypted as will be described in further detail below, and then assembled on the client device for viewing and or further modification. Referring back to Figure 2, the data file may be assembled from the fully constructed data structures by client applications 212, 222 and passed to respective web browsers 215, 225 for presentation to the users of respective client devices 210. 220 for viewing purposes. Alternatively, the data file may be assembled from the fully constructed data structures by productivity/office software 213, 223, which also presents the data file to the users of respective client devices 210, 220 for viewing purposes. The presented data file may then be modified by the user via the data manipulation functionality provided by client applications 212, 222 or by productivity/office software 213, 223 as described above with rcfcrcnce to Figures 5 and 6.

As previously mentioned, one of the advantages of a cloud-based data manipulation system is that it allows multiple users to work on a file concurrently. It will be appreciated however, that in the event there are multiple users collaborating via a plurality of client devices and are working on the data file at the same time, it is desirable to dynamically update the data file viewed by each user whenever a new data file element is committed to the data file. In the revision embodiment of the invention, this may be achieved by configuring the cloud based data manipulation system to relay a new revision element to all collaborating client devices once the revision element has been stored in the data revision history. In this way, the collaborating client devices may update their search tree and data file piece structures to account for the new data manipulation event embodied in the new revision element.

The collaborating client devices may then also update the data file accordingly as it is being viewed by each user. In an alternative embodiment, it may be preferable for the client devices configured to periodically request any newly committed revision elements from the cloud-based data manipulation system, rather than for the cloud-based data manipulation system to transmit the revision elements of its own volition.

In an aspect of the invention described with reference to Figures 8-12, the regular revision elements in a data file's revision h[story may optionally be interspersed with "snapshot" revision elements. Snapshot revision elements may contain the entire content of the data file as it was when the snapshot was created. As such, a snapshot revision element may comprise the aggregate of all preceding revision elements. Such snapshot revision elements may be used as a shortcut when reconstituting a file from the revision history. In this embodiment of the invention, a device that is reconstituting a data file from a revision history may begin at the most recent snapshot revision element rather than beginning at the very first operation element in the very first revision element in the revision history. Accordingly, the processing and decryption of all revision elements chronologically preceding the selected snapshot revision element may be deemed unnecessary, and processing resources are conserved as a result. Snapshot revision elements may comprise an identifier to allow the device reconstituting the data file to recognize them when the data file history has been retrieved, in order for them to be used in this way.

Snapshot revision elements may be generated by client application 312, depending on the embodiment of the invention. The generation of a snapshot revision element may be triggered, and performed by the application 312 without the need for user input. In one embodiment, the trigger may be in the font of a response from the cloud-based data manipulation system 330 confirming that a previous revision element transmitted by the application 312 was successfully stored in the data file revision history stored thereon. The application 312 may be configured such that the response only triggers the generation of a snapshot revision element in the event the response meets certain criteria. For example, the cloud based data manipulation system 330 may transmit a response to the application 312 that comprises a value corresponding to the chronological position within the revision history of the newly stored operation clement comprised in the revision clement. In such a case, the triggering criteria may be set such that a trigger only occurs if the value is a multiple of a predetermined fixed-value integer. It will be appreciated that while the above example is discussed in the context of the embodiment of the invention set out in Figure 3, the snapshot revision clement feature may equally be implemented in the context of the embodiment of the invention depicted in Figure 4, in which case productivity/office software 452 fulfils the role of client application 312 and the cloud based data manipulation system is referenced by numeral 430. As has previously been mentioned, revision elements may correspond to the mutations referred to in Figures 5 and 6. Therefore, regular revision elements may be transmitted to the cloud based data manipulation system using the primary channel. However, contrary to this, snapshot revision elements may be transmitted instead over the secondary channel.

When a data file is retrieved at the start of a communication session between a client device and the cloud based data manipulation system, it will be understood that it would be possible to commence construction of the search tree and set of pieces from the most recent snapshot revision element because it contains all the data file content up to the point that the snapshot was recorded. 1-lowever, in the event there are multiple users collaborating via a plurality of client devices and are working on the data file at the same time, it is desirable to dynamically update the data file viewed by each user whenever a new revision element is stored as described above. This is equally the case when a new snapshot revision element is generated. In the event a snapshot revision element is generated by a client device and it is desirable to update the search trees and data file piece sets of all other collaborating client devices in real time with the newly-created snapshot revision element (as might have been generated by one of the client devices), it is necessary to ensure that the all existing nodes in the search trees and all corresponding existing data file pieces in the data file piece sets of all other collaborating client devices are purged. In one example, this may be achieved by encoding a snapshot revision element as a pair of operation elements: an initial deletion of the entire contents of the data file; and a subsequent insertion of the entire contents of the data file, in the embodiment described with respect to Figures 8- 12 above, this would entail deleting all nodes from the search tree and all the related data file pieces from the piece set, followed by a inserting a single node and related data file piece which are an aggregate of all the deleted nodes and related data file pieces.

As stated above, the revision embodiment of the invention relies on a unique combination of metadata particular to each operation element comprised within a revision element to generate a seed string for use in the encryption process.

Consequently, in order to successfully decrypt content encrypted in such a way, it must be possible to successfiully identify the metadata used in the encryption process.

One example of a seed string for use in encryption of an operation element as given above is a concatenation of a session ID, user ID and the predicted chronological position of the operation element in the chronology of all operation elements that have been recorded in the revision history to date (hereafter referred to as "historical operation number"). However, in the event multiple users are working on a data file at the same time, it is possible that at any single time, two collaborators may both incorporate the most up-to-date historical operation number into the seed string used in the encryption of the operation element they each respectively transmit to the cloud based data manipulation system, resulting in a "collision". It follows that while one collaborator's operation element will be assigned the predicted historical operation number, the other collaborator's revision will be assigned the subsequent historical operation number. As such, a revision will have been stored in systcm having an operation element encrypted using historical operation number "n", whereas attempts to decrypt this operation element will be carried out using historical operation number "n+1". This would clearly lead to an incorrect decryption, and therefore presents a problem.

One solution to this problem wou'd be to use a different seed string. Instead of using thc historical operation number, it could bc possible to use thc chronological position of the operation clement in the chronology of all opcration elements that have been recorded with that scssion ID (hereafter refcrrcd to as thc "session operation number"), and to instead concatenate this value with the session ID and user ID. This obviates the danger (as outlined above) of using an incorrect value in the encryption process. As the session operation number can only be updated via submissions from the client device associated with the session in question, it would not be possible to make a mistaken assumption about the next session operation number that is to be ascribed to an operation clement. For decryption purposes, it will be possible to derive the session operation number by counting the number of pre-existing operation elements having a given session ID in the revision history. In terms of the cryptographic robustness of this approach, an actively attacking cloud-based data manipulation system may subvert uniqueness by issuing either non-unique session Ids or by selectively omitting certain members of the revision history as a means to taint the session operation number. 1-lowever diligent client devices may monitor session Ids and the revision history to guard against such an eventuality.

While the solution in the above paragraph presents a solution to the collision problem, it effectively precludes the use of snapshots, because complete revision histories are integral to its thnctionality. In order to avoid collisions but to allow for the use of snapshots, the historical operation number of the most recent previous operation element recorded using that session ID could be used, and this value concatenated with the session ID and the user ID. This technique allows collisions to be avoided while also avoiding the need for the full revision history, thereby allowing snapshots to be used.

There are other ways in which a data file may be divided into data file elements in order to allow efficient and secure encryption. In another embodiment of the invention (hereafter termed the "chunk" embodiment), the data file elements may each represent a contiguous, variable-length constituent of the current data file content with a limit imposed on its maximum length. The data file content constituents of such data file elements are disjoint in nature (i.e. they have no data file content constituents in common). Such a data file element will hereafter be referred to as a "chunk element". In contrast to the revision elements described above which may refer to historical content that is no longer present in the current data file content, chunk elements will only reference content that is present in the up-to-date version of the data file. Thus, the chunk elements, when all taken together, may collectively represent the complete content of the data file in its current state. This manner of representing the data contained in the data file may be applied to the data file for encryption and decryption purposes, regardless of how the data file is actually stored on the cloud-based data manipulation system. However, in order to ensure that all current and future users share a common view of the data content in terms of constituent disjoint, contiguous chunk elements, it is necessary to reserve a portion of the data file content for the storage of metadata and state information. The metadata and state information may detail how the content of the data file has in practice been divided into chunk elements, such that each chunk element references a disjoint, contiguous portion of the data file content. The portion of the data file containing the metadata and state information will hereafter be referred to as the "header". All updates to the header may take place over the secondary channel. In one embodiment, of the invention, the header may comprise a table, the table comprising fixed-size records which each describe onc of the chunk elements. In an embodiment, the header may comprise the length of each chunk element, the order of the chunk elements relative to one another as they apply to the data file content, and an initialization vector for each chunk element. The initialization vector is a randomly or pseudorandomly generated value that is probabilistically unique, employed in the encryption of the associated chunk.

In the event the data file content is manipulated, it is necessary to adjust one or more chunk elements in order to account for this change. It may also be necessary to add new chunk elements or remove existing chunk elements. If a contiguous string is inserted at a point between two existing chunk elements (i.e. on a chunk element boundary), then a new chunk is created to reference this string, and it is positioned between the existing chunk elements. If a data string is inserted at a point in the data file content such that it falls within an existing chunk, the string is regarded as forming part of that existing chunk, and the length of the existing chunk is increased.

In the event that the length increase results in the existing chunk exceeding the maximum chunk element length limit, a new succeeding chunk clement is created and references the spillover content portion formerly belonging to the existing chunk. If a contiguous string is deleted, chunk elements may have their length truncated, or complete chunk elements may be removed, as appropriate.

It will be appreciated that in the event of an insertion or a deletion operation, the length of chunks that have been subject to manipulation may change, and these changes will have to be recorded in the data file header. In the event chunk elements are created or deleted, these changes will also have to be recorded in the data file header. Changes to the absolute positioning of each chunk element within the data file may be recorded implicitly as a function of the recorded relative order of the chunk elements as they apply to the data file content and the recorded length of each chunk clement.

If data file content referenced by a chunk clement is in any way modified (for example, content deleted or content inserted), a new initialization vector must be produced for the chunk element, and the entire content of the chunk must be re-encrypted and submitted to the cloud-based data manipulation system, completely replacing the corresponding chunk already stored thereon. Because a new initialization vector is generated whenever a chunk clement is modified, the initialization maybe regarded as timeline-specifle. This is necessary in order to ensure the encrypted data is not susceptible to reuse attacks in the event that a stream cipher encryption scheme is used. Therefore, in the context of Figures 5 and 6, mutations stored in the cloud-based data manipulation system in the chunk embodiment will always comprise portions of encrypted data content corresponding to complete chunk elements. In other words, in the chunk embodiment, mutations comprise a set of one or more chunk elements. Furthermore, whenever a new initialization vector is generated for a given chunk element, the data file header must be updated to reflect the new chunk element in question.

As with the revision embodiment of the invention, in the event there are multiple users collaborating via a plurality of client devices and arc working on the data file at the same time using the chunk embodiment of the invention, it is desirable to dynamically update the data file viewed by each user whenever a new revision element is stored in the cloud based data manipulation system. This may be achieved by configuring the cloud based data manipulation system to notify all collaborating client devices that the content of the data file and therefore the structure of the chunk elements has changed. As a result, the collaborating client devices may obtain the up to date data and then update the data file accordingly. In an alternative embodiment, it may be preferable for the client devices to be configured to periodically request an updated data file from the cloud-based data manipulation system, rather than for the cloud-based data manipulation system transmitting the revision elements of its own volition.

As with the revision embodiment, potential collisions are also a possibility with the chunk embodiment of the invention where there are multiple collaborators simultaneously working on a data file. In order to avoid this problem, it may be necessary to impose constraints on collaborators; in particular, a requirement that only one collaborator is editing a chunk at any one time. Since the maximum length of a chunk may be configured to be comparatively small, fine-grained collaboration is not particularly impeded.

Figure 13 illustrates the process of generating a keystream cipher for use in encrypting the data in accordance with an embodiment of the invention. As described above, when data within a data file is manipulated, the data file element containing the manipulated data (the "target" data file element) is encrypted and relayed to the cloud-based data manipulation system for storage. The data contained in the target data file element (the "target plaintext") may be encrypted using a keystream cipher.

In a preferred embodiment, the keystream cipher may be generated from a seed string comprising metadata unique to the data file element that is to be encrypted, by using a hashing algorithm on the seed string to produce a message digest, and running a block cipher encryption algorithm on the digest to produce a pseudorandom keystream that may be combined with the target plaintext. This keystream cipher may be referred to as a keystream "block". As discussed above, such metadata may include, but is not limited to: a unique session identifier relating to a particular session established between a client device and the cloud-based data manipulation system; a user identifier that identifies the user responsible for the data manipulation event; a timestamp the chronological position of the data file element; the position of the data manipulation event within the data file content; the length of the data string being manipulated (the length of the data string being inserted into the data file content in the ease of a data insertion operation, or the length the data string being deleted from the data file content in the case of a data deletion operation).

In an alternative embodiment, the keystream cipher may be generated from a block cipher encryption algorithm that has been run based on a seed string comprising an initialization vector that is unique to the data file element to be encrypted. The initialization vector may be a randomly or pseudorandomly generated sequence that is probabilistical ly unique.

In a preferred embodiment, the keystream block and target plaintext are combined by way of an XOR operation to produce an encrypted form of the plaintext, termed the ciphcrtext. In the event that thc targct plaintext is longer than the kcystrcam block that is generated in the way described above, encryption may be achieved by running successive iterations of hashing and block encryption functions to produce successive keystream blocks, and encrypting successive portions of the target plaintext of corresponding length with the successive keystream blocks until the entire target plaintext has been encrypted. in a preferred embodiment, this succession of actions may be performed in what is known as counter (CTR) mode encryption, but it will be appreciated that other methods may be employed to ensure full encryption of the target plaintext.

In CTR mode encryption, it is first determined how many successive keystream blocks will be required to allow the full target plaintext to be encrypted by comparing the length of the target plaintext with the length of a keystream block. Then, as depicted in Figure 13, in step 1301 metadata unique to the target data file element associated with the target plaintext is used as a seed string, and a variable representing the number of blocks is initialized at zero. It is then determined in step 1302 whether sufficient blocks have already been generated to fully encrypt the plaintext. This is determined by referring to the block number variable, which can be compared to the known number of blocks required. In the event more blocks are needed, the seed string is concatenated with the block number variable, and the result is input into a hashing function in step 1303 to produce a message digest. In a preferred embodiment the hashing function may be SHA-256, but it will be appreciated that other cryptographic hashing functions may be used. The digest may then be fed into a suitable block encryption algorithm in order to generate a kcystream block (step 1304). It will be appreciated that a block encryption algorithm requires the use of a secret key, and therefore it would not be possible to reproduce the keystream blocks without knowledge of this key. In a preferred embodiment, the block encryption algorithm is AES25Ô, but it will be appreciated that other block encryption algorithms may also be used. Once a keystream block has been produced, it is combined in step 1305 with the next available unencrypted portion of the target plaintext to produce an encrypted form of this portion of the target plaintext (referred to as "ciphertext"), the ciphertext having the same length as the unenerypted portion of plaintext. The portion of target plaintext in question is then replaced by the newly produced ciphertext and the block number variable is incremented by one (step 1306). At this point, the method loops back to step 1302, where it is once more determined whether sufficient keystream blocks have been generated. However, this time the determination is made using the newly incremented block number variable. In the event that it is still the case that insufficient keystream blocks have been produced, steps 1303-1306 are repeated to encrypt the next portion of target plaintext. In the event that enough keystream blocks have been generated, the ciphertext that is the result of complete encryption of the target plaintext, is applied in step 1307 to the target data file element, replacing the target plaintext contained therein. The target data file element may then be relayed to the cloud-based data manipulation system for storage, along with any necessary associated metadata, including the data required to produce the unique seed string used in the encryption process. In one embodiment, the necessary associated metadata may be transmitted along with the target data file element over the primary channel to the cloud-based data manipulation system. In alternative embodiments, at least some of the necessary associated metadata must be transmitted over the secondary channel to the cloud-based data manipulation system.

it will be appreciated that the process of generating the kcystream as described above may equally be applied when decrypting a data file. The roies of bespoke plug-in 379 and bespoke extension 459 in the decryption process are analogous to their roles in the encryption process. When it is desired to decrypt a data file, it (or its representative data architecture) will be passed to the bespoke plug-in 379 or extension 459 to perform the decryption. Depending on the embodiment of the invention, the data file elements themselves, or the data architecture generated from the data file elements may be decrypted.

As a data file is being decrypted in the chunk embodiment of the invention, it will be necessary to decrypt each chunk element in turn. When it is desired to decrypt an individual chunk element, the unique seed string for the chunk element will be generated from the initialization vector associated with the chunk element in question, and the keystream will be generated using this seed string along with the secret key known to the user. The kcystream may then be applied to the encrypted content of the data file referred to by the chunk element in question to produce the decrypted plaintcxt version of that portion of the data file. Attention may then turn to the next portion of data file content corresponding to the next chunk clement. As the next chunk element will have been encrypted using a different unique seed string, the decryption process must begin afresh on this chunk element.

With respect to the revision embodiment of the invention, decryption may take place after the data architecture has been fully constructed from the revision elements, as the data file is being assembled from the data architecture. In this revision embodiment, each data file piece may be decrypted in turn. Each data file piece refers to a data content string sourced from the content of a specific insertion operation element. In order to decrypt the data content string of a data file piece, the insertion operation element to which the data content string refers is identified, and the unique metadata associated with said insertion operation element is obtained. The unique seed string associated with said insertion operation element is then generated, and the keystream used to encrypt said insertion operation element is thus obtained, using the seed string and the shared secret key. Because the data content string of the piece being decrypted may only refer to a portion of the content of the insertion operation element, it is then necessary to identify the corresponding relevant portion of the keystream. The mitigating values of the piece in question are used to do this -in one example, using offset and ength values. Matching pieces of ciphertext and keystream are thus isolated and applied to one another to retrieve the plaintext version of the data content string for that piece; a string corresponding to the related data file string constituent of the data file content. As the data content string of the next data file piece will have been encrypted using a different unique seed string, the decryption process must begin afresh on this next piece.

In order to assert integrity of the data file and to prevent against tampering by a maheious cloud-based data manipulation system, or equally empowered intruder, Message Authentication Codes keyed with the shared secret key may be periodically added to the data file using the secondary channel. In this way, users can be assured of the authenticity of modifications to the data file. In the revision embodiment, a Message Authentication Code may be recorded as a revision element (hereafter rcfcrred to as a MAC revision element), and the periodic addition may comprise transmitting MAC revision elements and standard elements for storage in an interleaved fashion. A MAC revision element comprising a valid MAC that follows a standard revision element confirms the authenticity of the standard revision element.

Anther thrcat scenario may also exist with respect to the embodiment of the invention set out in Figure 3, where the retrieved client application has been surreptitiously compromised, allowing the plaintext content from the client device to be extracted and stealthily relayed. A countermeasure would entail the bespoke plug-in being configured to provide known trusted code and ignore server-supplied code, or instead, to depend on a dedicated application that replicates the cloud based data manipulation system's client-side functionalities.

The embodiments in the invention described with reference to the drawings comprise a computer apparatus and/or processes performed in a computer apparatus. However, the invention also extends to computer programs, particularly computer programs stored on or in a carrier adapted to bring the invention into practice. The program may be in the form of source code, object code, or a code intermediate source and object code, such as in partially compiled form or in any other form suitable for use in the implementation of the method according to the invention. The carrier may comprise a storage medium such as ROM, e.g. CD ROM, or magnetic recording medium, e.g. a floppy disk or hard disk. The carrier may be an electrical or optical signal, which may be transmitted via an electrical or an optical cable or by radio or other means.

The invention is not limited to the embodiments hereinbefore described but may be varied in both construction and detail.

Claims

<claim-text>CLAIMS1. A method of recording a modification made to the content of a data file stored on a data file store, wherein the data file comprises one or more data file elements, the one or more data file elements collectively representing the whole content of the data file, and wherein each data file clement is associated with a unique set of metadata, the mcthod comprising: identifying a target data file element in which the modification is to be recorded, wherein the target data file element is either: a existing data file clement, identified based on its associated unique set of metadata; or a new data file element; recording the modification in the identified target data file element; determining whether the target data file element is to bc encrypted; if it is determined that the target data file element is to be encrypted, encrypting the target data file element using its associated unique set of metadata and a secret key and transmitting the target data file clement to the data file store for storage thereon.</claim-text> <claim-text>2. The method of claim 1 wherein the step of identi'ing further comprises: if the target data file element is identified as a new data file element, creating the new data file element.</claim-text> <claim-text>3. The method of any preceding claim wherein the unique set of metadata comprises at least one timeline-specific value.</claim-text> <claim-text>4. The method of claim 3 wherein the timeline-specific value is a probabilistically unique random or pseudorandom sequence generated at a specific time.</claim-text> <claim-text>5. Thc method of any preceding claim wherein the step of rccording further comprises generating a new unique set of metadata for association with the target clement, and the step of transmitting further comprises transmitting the associated new unique set of metadata for storage.</claim-text> <claim-text>6. The method of claim S wherein each data file element comprises a disjoint, contiguous, variable-length constituent of the content of the data file having a predefined maximum length.</claim-text> <claim-text>7. The method of claim 6, wherein the associated unique set of metadata further comprises the length of the data file element.</claim-text> <claim-text>8. The method of any of claims 5 to 7, wherein in the event a first modification is to be recorded in an existing data file element, no other modifications may be recorded in the existing data file element until all the steps of the method have been performed with respect to the first modification.</claim-text> <claim-text>9. The method of claim 3 wherein the data file elements are stored in a chronological history of stored data file elements on the data file store.</claim-text> <claim-text>10. The method of claim 9 wherein the at least one timeline-specific value is the chronological position of the associated data file element in the chronological history of stored data file elements.</claim-text> <claim-text>11. The method of claim 10 wherein the unique set of metadata further comprises a user identifier.</claim-text> <claim-text>12. The method of claim 10 or 11 wherein the unique set of metadata further comprises a session identifier.</claim-text> <claim-text>13. The method of claim 12 wherein the unique set of metadata further comprises the chronological position of the most recent previous operation element recorded using the session identifier comprised in the unique set of metadata.</claim-text> <claim-text>14. The method of any of claims 98 to 13, wherein the identifying step ifirther comprises identifying a new data file element as the target file element; wherein each data file element records either an insertion of a data string into the content of the data file or a deletion of a data string from the content of the data file; and wherein the determining step further comprises determining only to encrypt data file elements that record an insertion of a data string.</claim-text> <claim-text>15. The method of any of claims 9 to 14, wherein periodically in the chronological history, one or more snapshot data file elements are created and transmitted to the data file store, the snapshot data file elements comprising the aggregate of all existing data file elements.</claim-text> <claim-text>16. The method of claim 15 wherein the snapshot data file elements comprise a first data file element recording the deletion of the entire content of the data file, and a second data file element recording the insertion of the entire content of the data file.</claim-text> <claim-text>17. The method of any preceding claim, wherein the encrypting step comprises encrypting thc data file element using a stream cipher.</claim-text> <claim-text>18. The method of claim 17, wherein the keystream for use in the stream cipher encryption is generated using a seed string derived from the unique set of metadata associated with the data file element and a secret key.</claim-text> <claim-text>19. The method of claim 18, wherein the keystrcam is generated from the seed string by running an iterative block cipher encryption algorithm directly on the seed string.</claim-text> <claim-text>20. The method of claim 18, wherein a message digest is produced by running a hashing algorithm on the seed string, and the keystream is then generated from the message digest by running an iterative block cipher encryption algorithm on the message digest.</claim-text> <claim-text>21. The method of any of claims 9 to 16 wherein, periodically in the chronological history, a Message Authentication Code data file element comprising a Message Authentication Code keyed with the secret key is created and transmitted to the data file store in order to confirm the authenticity of the other data file elements.</claim-text> <claim-text>22. The method of any preceding claim wherein the data file store is located remotely from where the method is performed, and is accessed over a network.</claim-text> <claim-text>23. The method of claim 22 wherein the network comprises the Internet.</claim-text> <claim-text>24. The method of claim 22 or 23 wherein the modification is made to the content of the data file via a client application retrieved over the network from a remote server, and executed from within a web browser, and wherein the step of encrypting is performed via a plug-in embedded within the web browser.</claim-text> <claim-text>25. The method of claim 22 or 23 wherein the modification is made to the content of the data file through the use of software, and wherein the step of encrypting is performed via an extension to thc software or via a separate application that communicates with both the software and the data file store.</claim-text> <claim-text>26. The method of any preceding claim: wherein the method further comprises the initial step of determining whether the modification is to be recorded as a plurality of parts in a plurality of data file elements, each part being recorded in a separate corresponding data file element; wherein the steps of identifying, recording and encrypting are performed for each of the plurality of data file elements; and wherein the step of transmitting comprises transmitting the plurality of data file elements together as a set of data file elements.</claim-text> <claim-text>27. A method of decrypting a data file that has been encrypted in accordance with any of claims 5 to 8, comprising: retrieving the data file from the data file store, along with the unique sets of metadata associated with each data file element; dividing the data file into the data file elements based on the unique sets of metadata; decrypting each data file element using the associated unique set of metadata and the secret key.</claim-text> <claim-text>28. A method of decrypting a data file that has been encrypted in accordance with any of claims 9 to 16, comprising: retrieving all the data file elements in the chronological history; constructing a data architecture from the data file elements by applying each data file element to the data architecture in turn, in accordance with their chronological order, wherein the constructed data architecture comprises one or more pieces, each piece referencing at least a portion of a data file element; and decrypting each piece of the data architecture using the referenced portion of the data file element, the unique set of mctadata associated with the data file element, and the secret key.</claim-text> <claim-text>29. The method in accordance with any preceding claim wherein collaborating dcviccs in disparate locales may access the content of the data file concurrently, each device having a separate connection to the data file store.</claim-text> <claim-text>30. The method of claim 29, further comprising the subsequent step of relaying the transmitted target data file clement from the data file store to all collaborating devices.</claim-text> <claim-text>3 1. A computer readable storage medium canying a computer program stored thereon, said program comprising computer executable instructions adapted to perform the method steps of any of Claims I to 30 when executed by a processing module.</claim-text> <claim-text>32. A device for recording a modification made to the content of a data file stored on a data file store, wherein the data file comprises one or more data file elements, the one or more data file elements collectively representing the whole content of the data file, and wherein each data file element is associated with a unique set of metadata, the device comprising: means for idcntifying a target data file element in which the modification is to be recorded, wherein the target data file element is either: a existing data file element, identified based on its associated unique set of metadata; or a new data file clement; means for recording the modification in the identified target data file element; means for determining whether the target data file element is to be encrypted; means for encrypting the target data file element using its associated unique set of metadata and a secret key if it is determined that the target data fi'e element is to be encrypted; and means for transmitting the target data file element to the data file store for storage thereon.</claim-text> <claim-text>33. The device of claim 32 wherein the means for identifying further comprises: means for creating the new data file element if the target data file element is identified as a new data file element.</claim-text> <claim-text>34. The device of claim 32 or 33 wherein the unique set ofmetadata comprises at least one timeline-specific value.</claim-text> <claim-text>35. The device of claim 34 wherein the timeline-specific value is a probabilistically unique random or pseudorandom sequence generated at a specific time.</claim-text> <claim-text>36. The device of any of claims 32 to 35 wherein the means for recording further comprises means for generating a new unique set of metadata for association with the target element, and the means for transmitt[ng further comprises means for transmitting the associated new unique set of metadata for storage.</claim-text> <claim-text>37. The device of claim 36 wherein each data file element comprises a disjoint, contiguous, variable-length constituent of the content of the data file having a predefined maximum length.</claim-text> <claim-text>38. The device of claim 37, wherein the associated unique set of metadata further comprises the length of the data file element.</claim-text> <claim-text>39. The device of any of claims 36 to 38, wherein in the event a first modification is to be recorded in an existing data file element, no other modifications may be recorded in the existing data file element until all the steps of the method have been performed with respect to the first modification.</claim-text> <claim-text>40. The device of claim 34 wherein the data file elements are stored in a chronological history of stored data file elements on the data file store.</claim-text> <claim-text>41. The device of claim 40 wherein the at least one timeline-specific value is the chronological position of the associated data file element in the chronological history of stored data file elements.</claim-text> <claim-text>42. The device of claim 41 wherein the unique set of metadata further comprises a user identifier.</claim-text> <claim-text>43. The device of claim 41 or 42 wherein the unique set of metadata frirther comprises a session identifier.</claim-text> <claim-text>44. The device of claim 43 wherein the unique set of metadata further comprises the chronological position of the most recent previous operation element recorded using the session identifier comprised in the unique set ofmetadata.</claim-text> <claim-text>45. The device of any of claims 34 to 44, wherein the means for identifying further comprises means for identifying a new data file element as the target file element; wherein each data file element records either an insertion of a data string into the content of the data file or a deletion of a data string from the content of the data file; and wherein the means for determining further comprises means for determining only to encrypt data file elements that record an insertion of a data string.</claim-text> <claim-text>46. The device of any of claims 34 to 45, further comprising means for creating one or more snapshot data file elements periodically in the chronological history, the snapshot data file elements comprising the aggregate of all existing data file elements.</claim-text> <claim-text>47. The device of claim 46 wherein the snapshot data file elements comprise a first data file element recording the deletion of the entire content of the data file, and a second data file element recording the insertion of the entire content of the data file.</claim-text> <claim-text>48. The device of any of claims 32 to 47, wherein the means for encrypting comprises means for encrypting the data file element using a stream cipher.</claim-text> <claim-text>49. The device of claim 48, wherein the keystream for use in the stream cipher encryption is generated using a seed string derived from the unique set of metadata associated with the data file clement and a secret key.</claim-text> <claim-text>50. The device of claim 49, wherein the keystream is generated from the seed string by running an iterative block cipher encryption algorithm directly on the seed string.</claim-text> <claim-text>51. The device of claim 49, wherein a message digest is produced by running a hashing algorithm on the seed string, and the keystream is then generated from the message digest by running an iterative block cipher encryption algorithm on the message digest.</claim-text> <claim-text>52. The device of any of claims 40 to 47 further comprising means for creating, periodically in the chronological history, a Message Authentication Code data file element comprising a Message Authentication Code keyed with the secret key, and means for transmitting the Message Authentication Code data file element to the data file in order to confirm the authenticity of the other data file elements.</claim-text> <claim-text>53. The device of any of claims 32 to 52 wherein the data file store is located remotely from the device and is accessed by the device over a network.</claim-text> <claim-text>54. The device of claim 53 wherein the network comprises the Internet.</claim-text> <claim-text>55. The device of claim 53 or 54 comprising a web browser and a plug-in embedded in the web browser, wherein the modification is made to the content of the data file via a client application retrieved over the network from a remote server, and executed from within the web browser, and wherein the means for encrypting is performed via the plug-in embedded within the web browser.</claim-text> <claim-text>56. The device of claim 53 or 54, comprising software and either an extension to the software or a separate application that communicates with both the locally stored software and the data file store, wherein the modification is made to the content of the data file through use of the software, and wherein the means for encrypting is performed either via the extension to the software or via the separate application.</claim-text> <claim-text>57. Thc dcvice of any of claims 32 to 56: wherein the dcvicc further comprises means for initially determining whether the modification is to bc recorded as a plurality of parts in a plurality of data file elements, each part being recorded in a separate corresponding data file element; and wherein the means for transmitting comprises means for transmitting the plurality of data file elements together as a set of data file elements.</claim-text> <claim-text>58. A device for decrypting a data file that has been encrypted in accordance with any of claims S to 8, wherein the device is connected to the data file store over a network, the device comprising: means for retrieving the data file from the data file store, along with the unique sets of metadata associated with each data file element; means for dividing the data file into the data file elements based on the unique sets of metadata; and means for decrypting each data file clement using the associated unique set of metadata and the secret key.</claim-text> <claim-text>59. A device for decrypting a data file that has been encrypted in accordance with any of claims 40 to 47, wherein the device is connected to the data file store over a network, the device comprising: means for retrieving all the data file elements in the chronological history; means for constructing a data architecture from the data file elements by applying each data file element to the data architecture in turn, in accordance with thcir chronological order, wherein the constructed data architecture comprises one or more pieces, each piece referencing at least a portion of a data file element and means for decrypting each piece of the data architecture using the referenced portion of the data file element, the unique set of metadata associated with the data file element, and the secret key.</claim-text> <claim-text>60. The device in accordance with any of claints 32 to 59 wherein the device is one of a plurality of collaborating devices in disparate locales that may access the content of the data file concunently, each device having a separate connection to the data file store.</claim-text> <claim-text>61. The device of claim 61, further comprising means for receiving relayed data from the from the data file store, the relayed data comprising one or more target data file elements previously transmitted to the data file store by other collaborating devices.</claim-text> <claim-text>62. A method as substantially described herein, with reference to or as illustrated in accompanying figures 2 to 13.</claim-text> <claim-text>63. An apparatus as substantially described herein, with reference to or as illustrated in accompanying figures 2 to 13.</claim-text>