WO2014118560A1 - Procédé et système de stockage de données - Google Patents
Procédé et système de stockage de données Download PDFInfo
- Publication number
- WO2014118560A1 WO2014118560A1 PCT/GB2014/050269 GB2014050269W WO2014118560A1 WO 2014118560 A1 WO2014118560 A1 WO 2014118560A1 GB 2014050269 W GB2014050269 W GB 2014050269W WO 2014118560 A1 WO2014118560 A1 WO 2014118560A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- domain
- store
- decomposed
- data
- implemented method
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
- G06F16/1748—De-duplication implemented within the file system, e.g. based on file segments
- G06F16/1756—De-duplication implemented within the file system, e.g. based on file segments based on delta files
Definitions
- the invention generally relates to fields of computing concerned with structured data storage and retrieval. BACKGROUND OF THE INVENTION
- parsing (decomposing) of data aims to break a data block into smaller chunks by following a set of rules, so that it is more easily interpreted, managed, or transmitted by a computer. These rules are typically aimed at matching strings or patterns in stored source data.
- Many tools for structured data storage exist but, typically, require a model to be provided before data can be housed.
- Some data storage tools have data import and export facilities, and may even allow a model to be derived from source during import.
- these import and/or export facilities do not cater for data sources as complex as, for example, a vmdk file.
- a vmdk file may store data associated with a virtual disk component of a VMWare virtual computer or virtual session.
- the term virtual is used to mean that the disk or computer does not correspond directly to the underlying physical hardware of the disk or computer, which may be shared amongst multiple virtual computers or disks.
- the virtual disk may store an operating system, program files and user data for the virtual computer that is managed by an administrator who, for example, may control the software programs supported on each virtual computer, which may be different for different groups of users.
- the management of data stores such as virtual disks may be a time- consuming activity especially where multiple data stores are being managed.
- Data storage tools may not be especially suited to storage of the types of relationships found in complex data stores, such as vmdk files, nor do data storage tools expressly provide a historical perspective to data, unless the data itself is supplanted with historical information.
- Data storage tools often have associated reporting capabilities and are often heavily dependent on the user expressing not only required data, but also how that data should be retrieved in terms of the relationships between different aspects of the data. Again these reporting systems tend not to provide historical perspective and there is tendency to surmise periods and suggest trends, often without reference to actual events on a time line.
- data analytics aims to identify correlations and patterns in data, often using complex algorithms, but has yet to be applied with success on data sources as large and complex as a vmdk file.
- a computer implemented method of storing data comprising: accessing a domain specification for the data comprising structural metadata indicative of a structure of the data and property metadata indicative of properties of at least one entity type; selecting an instance of the entity type from the data using the structural metadata; determining whether a decomposed version of said instance of the entity type exists in a domain store; and selectively storing the decomposed version in the domain store according to the determination of whether the decomposed version_already exists in the domain store.
- the data resides in a data store, and wherein the determining is based on a comparison of at least one value of at least one of the properties of the decomposed version and the property metadata.
- the decomposed version is selectively stored in the domain store such that the domain store is a de-duplicated version of a data store.
- the method performs storing the decomposed version in the domain store and adding a tracking entry to tracking information in a tracking store associated with the domain store; or if the decomposed version exists in the domain store, the method determines whether a tracking entry for the decomposed version exists and, if not, adding a tracking entry for the decomposed version to the tracking store.
- the tracking entry for the decomposed version exists in a preceding decomposition of the data store.
- the method may further perform determining a key; and determining whether the decomposed version exists in the domain store based on one or more properties of the instance of the entity type and a key definition of the key provided by the structural metadata.
- the key definition may comprise a key strategy for combining the plurality of properties to form the key.
- the method may also provide for comparing the key value of the key against one or more keys each associated with decomposed versions of instances in the domain store to determine whether the decomposed version_exists in the domain store.
- one or more property values associated with the instance of the entity type and one or more property values of a decomposed version in the domain store have the same key value.
- the method may also include adding the decomposed version to the domain storage and a tracking entry for the decomposed versions to the tracking store when the properties associated with the decomposed versions differ from the properties of the one or more decomposed versions in the domain store having the same key value.
- the properties associated with the instance of the entity type match the properties of the decomposed versions in the domain store, determining whether a tracking entry exists in the tracking store that is associated with a preceding decomposition for the decomposed version in the domain store.
- the method performs adding a tracking entry for the decomposed version to the tracking store.
- the tracking information identifies that each said decomposed version is a version having the same key value.
- adding the tracking entry comprises determining an identifier for the decomposed version_and adding the identifier to the tracking information.
- the identifier is an AutoID value
- the tracking information is structured as a tree based on inheritance information provided by the structural metadata.
- the tree may be a graph, and optionally a structure of the graph is based on inheritance information provided by the structural metadata.
- the inheritance information may identify a parentage for a plurality of entity types.
- the tracking information comprises an identifier of one or more parent entity types.
- the method may also include determining an identifier associated with the decomposition and storing the identifier in a log.
- the identifier is typically a NodelD.
- the tracking entry for each decomposed version may comprise the identifier for the decomposition.
- the log is structured as a tree and the identifier is inserted in the decomposition log associated with a parent decomposition. This parent decomposition may be identified based upon context information provided by a user.
- the method may also include determining a domain type associated with the data store and selecting the domain specification from amongst a plurality of domain specifications based on the determined domain type.
- a computer implemented method of decomposing a data store comprising: accessing a domain specification for the data store comprising structural metadata indicative of a structure of the data store and property metadata indicative of properties of at least one entity type; selecting an instance of the entity type from the data using the structural metadata; determining whether a decomposed version of said instance of the entity type exists in a domain store; and selectively storing the decomposed version in the domain store according to the determination of whether the decomposed version_already exists in the domain store.
- a computer implemented method of forming a composite image of at least part of a data store comprising: selecting decomposed instances of entity types in a domain store; determining locations in the data store of the decomposed instances of entity types from the tracking entries associated with the domain store; and storing instances of the entity types in the data store at locations determined by the tracking information.
- Figure 1 illustrates a decomposition system according to an embodiment of the invention
- Figure 2 illustrates a schematic representation of an exemplary structure of a virtual disk
- Figure 3 illustrates a computer implemented method of storing data according to an embodiment of the invention
- Figure 4 illustrates decomposed instances of entity types and associated tracking information relationships according to an embodiment of the invention.
- Figure 5 illustrates decomposed instances of entity types and tracking information after decomposition of a plurality of virtual disks according to an embodiment of the invention
- Figure 6 illustrates a computer implemented method of composing a data store according to an embodiment of the invention
- Figure 7 illustrates a tree structure of lists of NodelDs according to an embodiment of the invention
- Figure 8 illustrates an embodiment of a computer implemented method of forming a composite image according to an embodiment of the invention
- Figure 9 shows a computer implemented method of storing data, in accordance with another embodiment of the present invention.
- FIG. 1 illustrates a decomposition system 100 according to an embodiment of the invention.
- the decomposition system 100 is arranged to store data or decompose a data store 150 that is typically, but not necessarily, a virtual disk.
- the decomposition system 100 comprises a decomposition engine 110 and a domain library 120.
- the decomposition engine 110 receives context information 155 from the user regarding the decomposition of the data store 150, or alternatively the context information 155 can relate to the decomposition of a block of data to be stored.
- the decomposition system 100 decomposes the data store 150, or processes decomposed data, the results of which are stored in a domain store 160 along with tracking information that is stored in a tracking store 170 and a decomposition log 180.
- the data store 150 is a store of data having a particular format.
- the data store 150 may be any type of structured data, such as medical records, genome sequence information, financial market data, system audit or logging information or a physical disk i.e. a hard drive or optical drive storing one or more files.
- the data store 150 is a virtual disk.
- the virtual disk emulates a data storage medium, such as a hard drive or optical drive, storing one or more files such as binary or text files.
- An example of a virtual disk is a file stored in a Virtual Machine Disk (VMDK) format developed by VMware, Inc.
- VMDK Virtual Machine Disk
- other file formats may be used for virtual disk storage.
- the decomposition system 100 is arranged external to a virtual environment associated with the data store 150.
- the decomposition system 100 may not reside or execute within a virtual machine or computer which accesses the data store 150 which can be a virtual disk.
- the decomposition system 100 may be arranged to operate on the data store 150 when it is in a steady state i.e. the data is at rest such as when the virtual environment is not accessing or operating on the data store 150.
- the decomposition system 100 may be arranged within a virtual environment before decomposition operations are performed and whose decomposed data is stored in the domain store 160 and tracking store 170 that may be located externally to the virtual environment being decomposed.
- the decomposition system 100 comprises the domain library 120.
- the domain library 120 provides information relating to the structure and properties of the data store 150, such that the decomposition is performed based on the knowledge of the structure and properties of the data store 150, rather than agnostically.
- the domain library 120 comprises one or more domain specifications 121, 122, 123.
- the domain library 120 is shown in Figure 1 as comprising three domain specifications 121, 122, 123 it will be realised that this is merely exemplary.
- the domain specification 121 will be described as relating to a New Technology File System (NTFS), developed by Microsoft, Inc.
- the additional domain specifications 122, 123 may each relate to a different file system type.
- Each domain specification 121, 122, 123 stores metadata providing information associated with the data store 150.
- the metadata includes structural metadata which provides information indicative of the structure of the data store 150, and property metadata which provides information on the properties of entities of the data store 150.
- Such metadata may also include expressions intended to advise processing such that processing may remain as flexible as possible. This is especially true for processing of text based stores where metadata may indicate how to: detect opening of a start tag, detect closing of start tag, detect an end tag, collect a tag name, collect attributes.
- Valid property delimiter character and entity delimiter character advice may also be found, along with indication of presence of property names on the initial row of a text store.
- Metadata may include many indicators that allow a domain specification to be correctly associated with a data store, for example, the expected file name of the data store, whether the data store is binary or text based.
- metadata might include details of "signature bytes" to be found within the file.
- metadata might include the type of text store, for example, a tag based type such as XML or HTML, or fixed width or delimited types.
- Metadata related to text based stores may also include many sets of expected, perhaps initial, characters, and/or expected root tag.
- a domain specification might also be a default domain, that is, a domain absent of structural or property metadata or associated entities, instead merely serving as a default set of indications as to how to process a particular file or set of files such that a more specific domain specification might be dynamically created. Consequently, domain specifications may be indicated as being static, or dynamically created. Hashes or checksums of property names may also be used in identifying domains, by comparing a hash or checksum of property names in the data store with those given in the domain specifications.
- the structural metadata defines entity types forming the structure of data store 150.
- entity types, or entities are essentially identifiable items of data, code or data structures and for the remainder of this specification entity or entity type(s) will be interchangeable with identifiable items of data, code or data structures.
- the structural metadata may define each entity type which may form the data store 150.
- some embodiments of the structural metadata define one or more of: an identifier (ID); an entity type; whether an entity type is singular or plural i.e.
- a child entity type has one or more properties based upon properties of the parent entity type, such as a start position of the child entity type. It should be noted that where structural metadata relates to a dynamically created domain, the entity type ID is typically in a separate range to that of the entities of pre-defined domains. This allows well known domains to have fixed ID values wherever they are used in contrast to dynamically created domains that are likely to be different at different points of use.
- the structural metadata also defines whether the entity type has one or more keys.
- a key is a value determined from the entity type which is used for de-duplication of the data store 150.
- An entity type may have zero or one or more keys. For example, if an entity type is plural (there may be multiple instances of the entity type within the data store) then a key for the entity type may be an index value for the entity type, for example indicative of an element in an array holding the plural entity types. If there are multiple keys for an entity type then the structural metadata may also define a key strategy for the entity type. The key strategy provides a strategy for combining the multiple keys to form a single data value. For example, in some embodiments a key value for an entity type may be defined as an int64 value.
- the key strategy may define an amount of bit shifting for one or more keys such that the keys may be combined to form the int64 value.
- a checksum or hash representing many values may be used as property values, index values, or otherwise.
- the key for an entity type is used to determine whether an instance of an entity type is a duplicate. For dynamically created domains, it is necessary to decide upon a scheme for determination of a key as properties are not known in advance.
- Such schemes may include, but are not limited to, one or more of: use of an index representing order of appearance of the entity type within its parent; a hash or checksum of all properties; the value of the first numeric property containing key or id in its property name; a hash of the value of the first string property containing key or ID; and the value of the first numeric property; or even the value of just the first property.
- an index representing order of appearance of the entity type within its parent a hash or checksum of all properties
- the value of the first numeric property containing key or id in its property name a hash of the value of the first string property containing key or ID
- the value of the first numeric property or even the value of just the first property.
- domain store 160 for the compatible entity types occurs in just one of the domains even though the entity types are from two different domains.
- domain store 160 for the entity type in the "local" domain will be empty, whilst the domain store 160 for the compatible entity type in the other ("foreign") domain will likely increase slightly in size.
- the compatible entity types are free to extract their data differently so long as their properties are in the same order, or can be mapped by name, and are of the same property types.
- the entity type in the "local” domain need not express a full set of properties as long as all of its properties can be found in the entity type of the "foreign" domain (where the storage actually occurs).
- the tracking store 170 is used in the same way as for only one "local" domain and can be found for the entity types of both domains. Many domains can express “foreign” storage in this way, each domain benefitting the other with further data that may beneficially cause further de-duplication.
- the property metadata defines, for each entity type, one or more properties of the entity type.
- the property metadata may define one or more of: an ID number for the property; a name of the property; a data type of the property i.e. a data type required to hold the property's value; a data format of the property such as length and whether the property has any decimal places; whether the property is allowed to be null; and a location of the property value within the entity data i.e.
- the domain specification 121 is associated with the NTFS file system.
- the NTFS domain specification comprises structural and property metadata associated with, amongst other entities, a Dos Partition Entry (DPE), boot sector and Master File Table (MFT) entities associated with NTFS.
- DPE Dos Partition Entry
- MFT Master File Table
- the domain specification 121 comprises structural metadata for the DPE which defines the structure of the NTFS DPE.
- the ID of the DPE entity type may be defined as 1, although it will be realised that other values may be used for the DPE entity type and other entity types may assume values other than the exemplary values provided.
- the structural metadata may indicate that the DPE is singular i.e. occurs only once in the data store 150 and a start position of the DPE within the data store 150.
- the DPE entity type may be plural and, in some embodiments, one of the plural DPE entities may be selected.
- the start position may be indicated in bytes i.e. byte 446.
- the domain specification 121 further comprises property metadata for the DPE.
- the DPE may be associated with one or more properties, such as bootable flag, start and end Cylinder-Head-Sector (CHS) addresses, partition type, start Lowest Bit address and size in sectors.
- CHS Cylinder-Head-Sector
- partition type partition type
- start Lowest Bit address size in sectors.
- some operating systems do not use CHS addresses and therefore some embodiments may omit the start and end CHS addresses.
- the MFT structural metadata defines an ID of the MFT entity type to be, for example, a value of 5.
- the MFT structural metadata comprise data indicative of an end term of the MFT, since the MFT is plural i.e. forms an array.
- the MFT structural metadata further comprises a key in order to allow for de-duplication of the data store 150.
- the key may be defined as an index value of the MFT within the virtual disk, wherein the index value indicates an order of appearance of the MFT within the virtual disk.
- the domain store 160 typically stores decomposed data from the data store 150.
- the domain store 160 stores decomposed instances of entity types which are without structure. By decomposed it is meant that, should the data store 150 include a plurality of copies of an instance of an entity type, a decomposed instance of the entity type will only reside in one copy in the domain store 160. In the remainder of this specification the terms "decomposed instance of the entity type” and “decomposed version” will be used interchangeably. [0028] If multiple data stores are decomposed comprising the same entity type, the decomposed instance of the entity type will only be stored in the domain store 160 once.
- the tracking store 170 stores a structure of the data store 150 i.e.
- the decomposition log 180 stores information about each data store 150 that is decomposed.
- the decomposition log 180 stores an identifier (NodelD) for each decomposed data store and information indicative of the relationship of that data store to other decomposed data stores.
- the decomposition log 180 stores information identifying whether the decomposed data store is a child of a previously decomposed data store, or a sibling of a previously decomposed data store.
- the information in the decomposition log 180 may be determined, at least in part, from the context information 155 provided by the user.
- FIG. 2 is a schematic representation of an exemplary structure of an NTFS virtual disk 200. It will be realised that the structure of the virtual disk 200 is only a portion of the NTFS virtual disk and is intended as a simplified representation for the purpose of explanation.
- the virtual disk 200 is an embodiment of the data store 150, and comprises a DPE 210 which is followed by a boot sector 220 and a plurality of MTFs (MFT0-3) 230, 240, 250, 260.
- MTFs MFT0-3
- a computer implemented method 300 of storing data according to an embodiment of the invention and illustrated in Figure 3 will now be explained.
- the method 300 may be performed by the system 100 shown in Figure 1 and is suitable for decomposing and storing data in the data store 150 which by way of example is the virtual disk 200.
- the virtual disk 200 Prior to the method 300 being executed, the virtual disk 200 is provided.
- the virtual disk 200 is structured according to NTFS as previously described.
- a user of the method provides the context information 155 to assist the decomposition.
- the context information may indicate that, for example, the decomposition is of the virtual disk such as a Golden Image of an installation.
- a Golden Image is a "template" installation i.e. comprising an operating system and a basic set of software and/or files.
- a file in the data store 150 (the virtual disk 200 ) is assessed to determine if it is binary, or text based file, in which case the encoding used for the file may also be determined.
- a domain of the virtual disk 200 is determined.
- the domain indicates the file system used by the virtual disk 200.
- the domain is determined by analysis of one or more attributes of the virtual disk 200.
- the one or more attributes may be compared against the one or more domain specifications 121, 122, 123 stored in the domain library 120.
- the decomposition engine 110 accesses each domain specification 121, 122, 123 in turn and then compares one or more attributes of the virtual disk 150 against one or more reference attributes specified in the respective domain specification 121, 122, 123.
- the NTFS domain specification 121 may indicate a predetermined byte position within the virtual disk 200 which is to be compared against a reference attribute indicative of the NTFS file system.
- the position may be byte 450 in the VMDK file and the reference attribute may be 7 which is indicative of the VMDK file storing the NTFS virtual disk 150.
- Other byte positions and reference values may be associated with other types of file system, as will be appreciated.
- the method 300 has accessed a domain specification for the data store 150, wherein the domain specification comprises structural metadata indicative of a structure of the data store 150 and property metadata indicative of properties of at least one entity type in the data store 150.
- the domain store 160 is created.
- the domain store 160 is created to store a decomposed version of the virtual disk 200. If the decomposition is to be added to an existing domain store 160 then step 315 may be omitted or may comprise selecting the existing domain store 160.
- an identifier is created to identify the decomposition of the virtual disk 200 within the domain store 160.
- the identifier uniquely identifies the decomposition of the virtual disk 200.
- the identifier may be referred to as a NodelD.
- the NodelD may be assigned on a random or pseudo-random basis. However for the purpose of illustration a NodelD of 1 for a first decomposition will be utilised, although it will be realised that this is merely exemplary.
- an instance of the entity type is selected from the virtual disk 200.
- the structural metadata from the domain specification is used in order to, for example, select a first instance of the entity type present within the virtual disk 200 during a first iteration of step 340.
- the DPE 210 may be selected.
- Step 340 may comprise determining the location of the instance of the entity type within the virtual disk 200 i.e. determining the location of the start of the instance of the entity type within the binary file (or other file) forming the virtual disk 200.
- the structural metadata may indicate that the DPE has no parent and thus its start location is determined without reference to any other instance of the entity types within the virtual disk 200.
- the structural metadata may indicate an absolute value of start position for the DPE 210 within the virtual disk 200, such as at byte 446, although this is merely an exemplary value.
- the method 300 has selected and accessed the instance of the entity type from the virtual disk 150 using the structural metadata.
- one or more properties of the instance of the entity type selected in step 325 are determined.
- the properties of the instance of the entity type may be determined by moving to an appropriate position within the virtual disk 200 and reading one or more values indicative of the properties from the virtual disk 200.
- Information about the properties to be determined is obtained from the property metadata of the domain specification 121.
- the information utilised to determine the properties of the instance of the entity type may particularly include the property metadata defining the data format of the property and the location of the property value within the data of the instance of the entity type, as discussed above.
- a temporary instance of the entity type currently selected is created, such as by being held temporarily in memory of the decomposition engine 110. Also, the properties of the temporary instance of the entity type are populated with values corresponding to those of the selected instance of the entity type within the virtual disk 200. In this way, a copy of the selected instance of the entity type within the virtual disk 200 is created within the decomposition engine 110.
- step 340 it is determined whether a key is defined for the instance of the entity type. As noted above, for some entity types one or more keys are defined within the structural metadata. Firstly within step 340 it is determined whether the currently selected instance of the entity type has one or more keys defined in the structural metadata. If no keys are defined for the current instance of the entity type the method moves to step 355, discussed below. However if there are one or more keys defined then the method moves to step 345.
- step 345 a key is prepared for the instance of the entity type.
- the key is used to determine whether a decomposed instance of the entity type already exists in the domain store 160.
- preparation of the key may comprise selecting a property of the decomposed instance of the entity type to form the key where one key of the decomposed instance of the entity type is defined in the structural metadata.
- step 345 comprises preparing a combined key from the two or more properties defined as keys in the structural metadata.
- the structural metadata comprises a key strategy to define how to combine the multiple keys to form a single data value.
- step 345 may comprise bit shifting of one or more keys, such that the combined key is prepared. It will be realised that the use of bit shifting is merely exemplary and that other key strategies are envisaged.
- checksums or hashes may be prepared once a key value is prepared. Typically, a decomposed version may be represented via a checksum, or hash, of all Key and Property values, optionally including a Root Node ID (referred to as the Key and Property checksum). Similarly, key usage may be represented via a checksum, or hash, of the key and Node ID (referred to as the Key and Node checksum). Further to the above, it will be appreciated that methods other than checksums could be used for reducing several values to a single value.
- step 350 the domain store 160 is checked for a key matching that prepared in step 345. If a matching key is found in the domain store in step 350 then this indicates that the instance of the entity type currently selected from the virtual disk 200 has already been decomposed into the domain store 160 and the method moves to step 355. However if no matching key is found then this indicates that the instance of the entity type has not already been decomposed and is thus not already present in the domain store and the method moves to step 365.
- the steps 340,345,350 and 355 perform a process of determining whether a decomposed instance of the entity type exists in the domain store 160. This determining is based on at least one of the values of the properties of the decomposed instance of the entity type and the property metadata.
- the steps 340,345,350 and 355 may also be reproduced using a Key and Property checksum, and a Key and Node checksum.
- the steps 350 and 355 are performed first in a single step by comparing the Key and Properties checksum already prepared for the instance, with those in the domain store 160 (or tracking store 170). Hence, if an exact match is found it represents an exact duplicate. Typically, this is further confirmed by checking that the matched version exists in a decomposition that is parental to the current decomposition. If the version does not exist in a parental decomposition, and is thus "foreign to this branch", the method would proceed to step 370 as it is now known that a tracking entry must be added to represent the version in the branch of the current decomposition.
- a step 365 is arrived at in the case that either no key in the domain store 160 matches that prepared in step 345, or a key matches but the properties of the decomposed instance of the entity type in the domain store 160 do not match that of the entity from the virtual disk 200.
- This therefore indicates that the current decomposed instance of an entity type does not exist, or is a version not already existing in the domain store 160 whose key matches but properties do not, as a result the decomposed instance of the entity type must be added to the domain store 160 as a new decomposed instance.
- the decomposed instance of the entity type is therefore selectively stored in the domain store 160, for later use, according to the determination of whether the decomposed instance of the entity type already exists in the domain store 160.
- the decomposed instance of the entity type is selectively stored in the domain store 160 such that the domain store 160 is a de- duplicated version of the data store 150.
- a tracking entry is added to the tracking information store 170.
- the tracking store 170 stores information identifying the entity occurrences present in the decomposition and their location within the decomposition.
- the tracking information store 170 can be contrasted with the domain store 160 which provides data and the tracking information store 170 which provides content and structure information.
- the tracking store 170 identifies, for each decomposition, which instances of an entity type that virtual disk 200 comprises in addition to their structure.
- the tracking store 170 comprises tracking information associated with each decomposed instance of an entity type in the domain store 160 and includes the NodelD identifying the decomposition with which it is associated.
- the decomposition log stores information associated with the decomposition, such as the NodelD, a relationship of the decomposition to any other decompositions i.e. child, sibling etc.
- the tracking information in the tracking store 170 comprises an identifier for each decomposed instance of an entity type.
- the identifier may be referred to as an AutoID.
- the AutoID may be assigned on a random or pseudo-random basis to each entity. However, in some embodiments of the invention the AutoID is assigned sequentially i.e. in incremental increments of one, although other increments may be chosen. AutoIDs are assigned on a per-entity-type basis. In other words, different types of decomposed instances of entity types may be allocated the same AutoID.
- the tracking store 170 also stores for each decomposed instance of an entity type parent information.
- the parent information may be the AutoID of the parent entity instance.
- the parent information represents a graph in which each child may have multiple parents.
- the tracking information in store 170 may also include checksums and change types.
- step 365 when boot sector 220 is selected, the structural metadata of the domain store 121 defines the boot sector as a child of the DPE 410. Therefore the boot sector 420 is added to the domain store 160 and identified in the tracking store 170 as the child of DPE 410.
- step 365 MFT0 230 and MFT1 240 from the virtual disk 200 are added as MFT0 430 and MFT1 431. This process of adding new decomposed instances of entity types repeats thereby gradually forming the tree or graph according to the parental relationships defined in the structural metadata.
- step 355 one or more properties of the currently selected decomposed instance of the entity type are compared against the properties of either all of the decomposed instances in the domain store 160, or any instances in the domain store 160 having matching keys. Where one or more decomposed instances in the domain store 160 have a matching key, step 355 determines whether the current decomposed instance of the entity type is a version of the existing one or more decomposed instances of entity types in the domain store 160. A version has a matching key but one or more properties that differ.
- the new version of the decomposed instance of the entity type is added to the domain store 160 in step 365 and tracking information in store 170 is added in step 370.
- tracking information in store 170 is added in step 370.
- the new version of MFTO 430 is added as a further child of the boot sector 420.
- step 360 it is determined whether the decomposed instance of the entity type found in the domain store 160 having matching properties is tracked in a preceding decomposition.
- a preceding decomposition is a "parent" decomposition on which the virtual disk 200 was based. For example, a virtual disk representing an installation on a computer system having one or more programs installed based on a "clean" golden image decomposition for the computer system will be expected to be highly similar to the golden image decomposition.
- Step 360 thereby provides for avoiding re-listing so that only changes in the decomposition are tracked.
- tracking can be used to discover changes to a particular decomposed instance of the entity type and, collectively across all decomposed instances.
- step 375 it is determined whether the virtual disk 200 comprises any further instances of entity types.
- Step 375 may firstly determine whether the virtual disk 200 comprises any further instances of entity types of the same type as currently being considered, such as MFTs. In this case, step 375 comprises determining whether a terminator value for a last of plural instances of entity types of the same type has been found. Secondly, step 375 may comprise determining whether the virtual disk 200 comprises any further instances of entity types of different types. If the virtual disk 200 either comprises further instances of the same type or instances of a different type, then the method returns to step 325 where the next instance of an entity type within the virtual disk 200 is selected. In one embodiment the method considers in step 375 whether there are child instances of entity types and returns to step 325 to process the child before selecting further instances of the same type.
- Figure 4 illustrates decomposed instances of entity types and associated tracking information relationships according to an embodiment of the present invention.
- the domain store 160 is a repository for decomposed instances of entity types.
- the domain store 160 does not contain information regarding a hierarchy or structure of decomposed instances of entity types.
- the domain store 160 merely comprises decomposed instances of entity types which have been extracted from decomposed virtual disks 200.
- the tracking store 170 stores content information for the virtual disk 200 i.e. what instances of entity types form the virtual disk 200 from those in the domain store 160 and a relationship of those entities such as parent and child.
- the instances of entity types forming the virtual disk 200 may be presented as being as structured as a tree using the tracking information in the store 170.
- the decomposed instances of entity types may be structured as a graph to allow the decomposed instances of entity types in the domain store 160 to have more than one parent.
- Figure 4 presents the tracking information in the store 170 on the left having a structure, whereas the decomposed instances in the domain store 160 are presented on the right.
- An ordering of decomposed entity types in the domain store 160 is insignificant since, as noted above, the domain store 160 does not have a hierarchy or structure being merely a repository or store of decomposed instances of entity types.
- Figure 5 illustrates decomposed instances of entity types and tracking information after decomposition of a plurality of data stores 150 such as more than one virtual disk 200.
- the first virtual disk 200 is decomposed and allocated a NodelD of X as shown in Figure 4.
- a further virtual disk 200 is decomposed and allocated a NodelD of Y.
- Tracking information in store 170 is also created for the new decomposed instances of entity types including an AutoID for each new decomposed instance, as described above.
- the tracking information in the tracking store 170 defines the structure of the domain data 160.
- the tracking information may be visualised in a tree having a root and one or more branches.
- the tracking information in store 170 branches according to the parent child relationships defined between the entity types but manifested with real data during actual decomposition and forms sibling entities with new domain key occurrences within a plural entity type.
- the specific manifestation of one of a range of possible child entity types during decomposition is production of a dynamic sub-typing arrangement. Sub typing has historically been a difficult arrangement to manage using standard database design and SQL techniques.
- the methods of decomposition described herein are applicable to text and binary data entity types in which the text may be in different formats. It can be further appreciated that after execution of the decomposition process on a data store (such as a virtual disk), using an initial domain, the decomposition process can be re- executed on a constituent, or constituents, of the initial domain. This will therefore allow for the production of further structuring using different domains as dictated by the domain determination techniques described earlier with reference to step 310. The re-executions of the decomposition process may share the same Node ID which allows the successively detailed structuring of the data store to be managed as a single unit. Often the presence of an initial structure is necessary for subsequent structuring.
- One example of this, having already decomposed an NTFS disk, is to further decompose one, or many, of the file datum associated with the MFT records.
- a suitable "registry" domain definition in place in the domain library 121, 122,123 it is possible to produce a single decomposition containing all NTFS related changes and registry changes by further decomposition of select file data emanating from the first NTFS decomposition.
- This technique, of further structuring an initial structure all under one manageable unit using multiple decompositions may be termed "introspection”. [0052] From the foregoing, it can be determined from the tracking information in store 170 for a decomposition of a virtual disk those entities which differ from a previous decomposition of a virtual disk 200.
- added or changed entities in a subsequent decomposition can be identified. It can also be determined whether any entities have been deleted from the virtual disk 200 based on one or more previous decompositions.
- the preceding decompositions may be all decompositions from the root to the decomposition immediately preceding the current decomposition.
- the decomposed instances of entity types encountered in the current decomposition may then be compared against a union from the previous decomposition(s), as will be explained.
- Embodiments of the invention utilise a "delta" of tracking information in store 170.
- the term delta is utilised to mean a change. Therefore the delta of tracking information in the store 170 is intended to mean any changes, such as added, changed or deleted which are properties within the tracking information.
- a union is formed based on a plurality of decompositions, such as the one or more preceding decompositions referred to above. The union may be formed from only deltas within the same branch of tracking information. If a Node ID from outside a branch is desired in a union i.e. a sibling, then a conflict check may be performed prior to the union. If there are no conflicts the sibling can be included within the union.
- tracking information is added to the union.
- the amount of tracking information added to the union may depend upon the purpose for which the union is being formed. For example to determine whether a decomposed instance of an entity type is present in the union and for deletion analysis only the inclusion of the AutoID for each decomposed instance of an entity type may be required, although parental information will be needed if the deletion analysis is for a child entity type that has multiple parents i.e. appears in a graph structure..
- the union may include additional tracking information such as parental information and/or NodelD for each entity.
- a parental decomposition is selected.
- the parental decomposition may be the first decomposition at the root of the branch.
- all tracking information in store 170 required, such as AutoID, for entities in that decomposition is added to the union.
- the tracking information of store 170 only for entities which are either added or changed in the decomposition may be added to the union.
- the tracking information for subsequent decompositions in the branch is added to the union.
- Decompositions may be added to a union in one of two ways according to the intended usage of the union. In one formation methodology everything encountered is added into the union whether this is an Add, Edit or Delete, making it convenient to access historical changes, as in historical reporting.
- the union can be formed so that only one version exists for each domain key in the union. Where the union begins from root, this 2nd type of union reproduces a set of entities equivalent to the totality of the binary file as was decomposed in the decomposition occurring last in the union. .
- each visited decomposed instance in the union may be marked as deleted or removed from the union. Once all decomposed instances have been marked or removed, those decomposed instances unmarked or remaining in the union indicate those decomposed instances which have been deleted in the virtual disk 200.
- the Key and Node checksum previously described can be utilised to find previous usage of a given key.
- a Key and Node checksum may be prepared for each parental decomposition and a seek function can be performed against Key and Node checksums already stored within the tracking store 170.
- the determination of an add function or edit function is then possible by comparison against the type of change associated with the previous usage of the key, i.e. was the earlier version, if it exists, an add, edit or delete. Where an earlier version is absent, the new version is clearly an add function.
- a delta or collection of decomposed instances differing between a current decomposition of a virtual disk and one or more previous decompositions may be determined.
- the delta may represent those decomposed instances having changed (addition, modification or deletion) between decompositions of virtual disks.
- AttrH the child decomposed instance
- MFT0 430 the parent decomposed instance
- earlier versions of the parent are determined and children of those earlier versions are found. Some embodiments of determining the parent are based on the keys discussed above. The parent may be determined by searching for other parents whose keys are based on the same parent entity instance, but with an earlier NodelD.
- a union may be utilised to determine the parent, such as the parent of AttrH 540 or the parent of MFT0 430.
- the union may be formed by including the AutoID, domain key and parent information of all decomposed instances. NodelD may also be included in some embodiments as this may be useful to record the last decomposition on which the union is formed.
- the union is then utilised to reconcile the child with its parent using any of the embodiments described earlier but with the benefit of having all, and only, pertinent data immediately to hand in one place. This gives both performance benefits as well as convenience of implementation.
- the present invention may employ a traversal function which is the ability to move from a given parent entity instance to child entity instance with corresponding data even with apparently orphaned entities, that are the side effect of de-duplication.
- traversal is used in data retrieval activities, notably reporting and often begins with a domain key i.e. a version agnostic representation of some desired data.
- a domain key i.e. a version agnostic representation of some desired data.
- the first task in traversal is to determine available entity versions related to that key. This typically restricts these to decompositions occurring on the branch of interest in the data retrieval and prior to the point, usually given as a Node ID, representing the decomposition of interest.
- the result is a list of Auto IDs that represent the available versions.
- Structural metadata already indicates which entity types are child to the current entity type, and each of these child entity types may be searched for entities where the Parent Auto ID matches one of those in the list of versions (of the parent) already attained. This is done on a per Parent AutoID basis such that child entities are correctly associated with their parent. It may be appreciated that repeating this process of searching for child entities at each level quickly yields a "network" of possible related parent to child versions. All that remains is to select appropriate versions from the "network". Selection is achieved by choosing a version that, ideally, has a Node ID that matches the decomposition of interest in the data retrieval.
- the highest preceding Node ID should be used, which will then yield children whose own Node ID will be equal to this earlier Node ID or higher.
- the present invention may also employ the use of packages that facilitate the distribution of customised environments. For example, updates to existing virtual systems such as including one or more additional software programs, operating system changes, system administrator changes, user profiles etc.
- a package is a collection of data which allows a data store to be created or updated.
- a package may be provided for allowing an update to a virtual disk to be made.
- a package may comprise one or more decompositions and allows those decompositions to be introduced to the virtual disk.
- each delta comprises domain data in the form of one or more decomposed instances of entity types, and tracking information in store 170 indicating the structure of the decomposed instances, as previously discussed.
- the package data comprises information identifying one or more previous decompositions on which the package is based.
- the information may be the NodelD of the associated virtual disk decompositions.
- a package comprises a list of decompositions forming the package and each decomposition may be identified by its respective NodelD. However, the decompositions may be identified by other, either alternative or additional, identification information for distribution of the package.
- the other identification information may be a PermanentID value or an Enterprise Unique Permanent ID (EUPID).
- NodelD entries within the package may be renumbered, along with any data in the package that uses the Node ID, when the package is moved.
- the package may also comprise one or more of: domain information identifying the domain specifications used in the package, which may also be accompanied in some embodiments by associated domain specifications 121, 122, 123; dependent package information identifying packages on which the current package is dependent i.e. are also required; incompatible package information identifying any packages which are incompatible with the current package; compatible or "allowed" packages which are suitable for use with the present package; and retrieval information which identifies one or more ways in which additional data can be obtained.
- the retrieval information may define one or more ways in which any necessary domain specifications may be obtained; tracking and domain data may be obtained if not contained in the package (a package comprising all necessary information and data may be known as a freestanding package) and data blocks may be obtained. Certain entities such as files have associated data i.e. the content of the file itself which may be stored as associated data blocks. The data may be obtained from within the package, in the case of a freestanding package, from a file system local to the package, or from a server. Various combinations of the retrieval mechanism may be envisaged, such as domain and tracking data from within the package and other data from a server.
- the system 100 comprises a reporting engine.
- the reporting engine is provided to generate reports identifying changes to the virtual disk 200.
- the reporting engine may produce a report indicative of a change to a virtual disk by installation or execution of software, such as intentionally installed software or malicious software such as a virus.
- the reporting engine may be arranged to produce a report identifying changes between versions of a virtual disk i.e. pre- and post- installation/execution of the software.
- the report may identify, based on the tracking information of store 170 entities within the virtual disk which have been modified, such as deleted, added or changed, by the installation/execution.
- the reporting engine may be provided with one, or many, NodelDs representing decompositions of interest in the report. Where the NodelDs are within a predetermined number of branches, which may be one branch, a historical report may be produced, and the keys in the associated decompositions can be considered as consistent. If an entity present in one decomposition has the same key(s) as other decomposition(s), such as a parent decomposition, the key(s) indicate whether the entity is the same as a potential version in the other decomposition(s). AutoIDs may be disregarded in this type of report as (evolutionary) versioning is provided by NodelD.
- the tracking information in store 170 associated with the deepest NodelD in the branch provides a source of entities of different entity types that have a common parent entity type, considered as the root entity type of interest in the report.
- MFTs are often the root entity type for reports about virtual disks containing an NTFS structure.
- MFT is the shallowest plural entity type in the NTFS structure.
- a root domain key can be established through the child's parental lineage.
- root domain keys When root domain keys are established in this way for the deepest NodelD given, it produces a list of domain keys that can be applied to the shallowest and subsequent NodelDs providing the potential for a report that is focussed on the deepest Node, but that can show associated evolution through the preceding nodes.
- the process of establishing a root key list may be called matching, and filtration may be applied to the matching process. It should be noted that this root key list of domain keys represents a version agnostic list of data of interest in the historical report.
- an expansion process may begin to provide related information for each of the root keys listed. Usually this starts at the shallowest NodelD or, rather, a union that stands in for the shallowest NodelD and may start with the absolute root of the branch and finish at the deepest NodelD. Expansion retrieves information associated with each root domain key and then uses the traversal techniques described above, and may utilise the union, to provide information about children in such a way that they are related to the root domain key that is their parent. Use of the union guarantees that a complete initial set of information can be provided for each root key. Typically, for subsequent NodelDs, only partial information need be displayed as this represents the evolution that is of interest and that is inherently partial in each Node given that changes are unlikely to have occurred in ALL entity types.
- a conflict may arise between two decompositions where a decomposition includes a change to a decomposed instance of an entity type which is at least partly contradictory, or incompatible with, the other decomposition(s). Where provided NodelDs are siblings, comparison reporting is implied, as is the case where multiple branches containing multiple NodelDs are provided. Where the NodelDs have a common parent NodelD the keys within the decompositions may be considered consistent.
- a decomposed instance of an entity type present in one decomposition has the same key(s) as other decomposition(s), such as a parent decomposition or a sibling decomposition
- the key(s) indicate whether the entity might have a potential version, or match in the other decomposition(s).
- keys may have different meanings and disparate mapping techniques must first be employed (described elsewhere) to first achieve key consistency.
- a comparison report identifies conflicts according to the conflict check rules given below. These are the same rules as are used to identify conflicts between sibling decompositions prior to their inclusion in a composition.
- comparison reports may include: an overlap report identifying same and conflicting entries; a similarity report as before but without conflicting entries; and a non-overlapping report showing entries that are not shared, but are discrete to each node in the comparison.
- Additional historical reports may include: a forward from point report showing forward progression of changes that are present in the chosen node (as opposed to standard historical which shows previous changes based on those in the last node); an all history report which includes prior changes outside of those given in the last node, as well as those in the last node; and a first and last, or timeline report, which shows the preceding and following changes based on a chosen mid-way node.
- Metrics reporting is another technique related to historical reporting.
- a Metric takes the form of a Boolean expression that is applied to a delta or delta set.
- the metrics expression itself may often be picked from a library of such metrics expressions that would typically accompany the domain specification.
- the entities of the delta, or delta set are each tested to see if the metrics expression is true or false. Subsequently a count of entities returning true is performed for each metric and a count returned. It is efficient to calculate many such metrics simultaneously provided that the metrics in question have a common aggregate root: that is, a parent entity type that is common to all of the child entity types represented by the informational and expression properties used in the metrics report. Each metric count may be returned independently and used as a scalar result.
- the informational and expression fields associated with the metric' s expression may be listed, along with supplementary informational fields if desired, for each occurrence of the metric being true. This provides the opportunity to drill into the data behind the single number that is a metric. Similarly, the details of many metrics may be reported at once if columns indicating true or false for each metric are shown.
- the matching and expansion processes associated with the historical report that underlies metrics, adapt according to the expression properties needed for the metrics expression. Any supplementary informational properties required by the user are also taken into consideration by the matching expansion processes which always select the minimum number of entities and columns of data needed to satisfy the requirements of the report. As metrics expressions often utilise the same properties metrics reports containing multiple metrics are often extremely efficient.
- the invention may employ disparate mapping which is a technique that may benefit decompositions that are not based on the same root, such as a golden image decomposition, but may wish to be considered as similar. For example, where two golden images are prepared separately and one proves to be stable, the other unstable, it may be necessary to analyse the difference in original configuration to correct the instability. As the two are separate installations, the position of files within the MFT table cannot be guaranteed and so keys that are normally reliable in non-disparate situations may not be reliable.
- Disparate mapping uses information about "human readable" properties expressed within the structural metadata 121, 122,123 to find commonality between the otherwise disparate keys.
- the mapping usually maps the keys of the root-most plural entity in the domain structure, for example MFT, whose associated key is MFT index, in the case of NTFS.
- MFT the key of the root-most plural entity in the domain structure
- MFT index MFT index
- NTFS the key of the root-most plural entity in the domain structure
- the mapping first establishes commonality amongst the human readable properties before writing the keys associated with both decompositions for that entity to a storage structure such as table (for sake of efficiency and consistency).
- This mapping table may accompany the disparate decomposition in packages, and other Node uses, as it correctly establishes common meaning between the keys in this decomposition and its dependent decompositions.
- mapping table Via reference to the mapping table, ordinary processes such as union and comparison can be performed as though the decomposition was non-disparate. Disparate mapping may be particularly useful in, but not limited to, allowing decomposition within the same branch (dependent decomposition) where the provided data store is uncommon but expected to be similar. This allows comparison reports to be created between unrelated branches, whereby the mapping table would be of a temporary nature.
- Changes to an instance of an entity type may be deemed to be consistent (not conflicting) when one or more decompositions: add an entirely new decomposed instance of an entity type (new domain key); add exactly the same version (as another add or edit); perform an edit producing a same version of the decomposed instance of an entity type in all decompositions in which it occurs; or delete rendering the instance of an entity type completely absent from all decompositions being compared.
- conflict is generally indeterminate without recourse to preceding (parental) nodes.
- Figure 6 illustrates a computer implemented method 600 of composing a data store according to an embodiment of the invention.
- the method 600 is described, by way of example, as composing the virtual disk 200 based on an initial virtual disk as a starting point.
- Embodiments of the invention may compose a virtual disk entirely from scratch i.e. without the initial virtual disk.
- a virtual disk even of a golden installation, comprises a large number of files and metadata it may be considered expedient to use an existing base virtual disk, such as for the golden image.
- the base virtual disk is not necessarily a golden image.
- the base image may be a virtual disk having one or more changes from the golden image, such as installed programs, user profiles, etc.
- step 605 one or more packages are selected.
- the packages may be selected by the user, such as using an appropriate user interface.
- the packages may be selected by the user indicating, via the user interface, one or more characteristics of a desired virtual session which is to be established using the virtual disk being created by the method 600.
- the characteristics may indicate desired software of the virtual session.
- Step 605 may comprise obtaining selected one or more packages that may be obtained from one or more storage devices or computer systems, such as a server.
- step 610 a master NodelD list is determined or compiled.
- Step 610 may comprise, for each package, adding to the master NodelD list any NodelDs of decompositions utilised by the package to form the distinct master NodelD list.
- step 610 may be formed as a loop which sequentially examines each package and adds the relevant NodelDs to the master NodelD list.
- NodelDs of any user session changes associated with any previous user sessions i.e. previous executions of the composed disk with these characteristics) are included in the master NodelD list.
- the user session changes are any modifications as a result of the user session, such as a virtual session executing in association with the virtual disk. It is envisaged that changes made by a user in a session, being an execution of a composed disk with particular characteristics, would also be captured using decomposition and thus be represented by one Node ID per session (unless the sessions are combined together for efficiency reasons).
- the NodelDs associated with the user session changes are included in the master NodelD list.
- step 620 one or more lists of NodelDs are determined, typically in a tree structure. All possible branches are determined for each of the required nodes resulting from steps 605 to 615.
- Figure 7 illustrates the complete nodal tree associated with two exemplary package selections by the user concerned with Nodes 3 and 4. As Nodes 3 and 4 are siblings and both dependent on Nodes 1 and 2, where 2 is child of 1, 4 possible branches are determined being, 1; 1,2; 1,2,3; 1,2,4.
- step 625 it is determined that the base image is associated with Nodes 1 and 2, which may be removed from the possible branches, leaving 2 branches one consisting of Node 3, the other of Node 4.
- steps 620 and 625 establish full and proper context for Nodes that are dynamically selected in steps 605 to 615 and exclusive of any Nodes that may be pre- existent in a base image as determined in 625.
- step 628 prior to forming a master union also referred to as the composite delta, the tracking information of the current (new) Node is conflict checked against the composite delta. As noted in the conflict checks section above, a conflict exists where the current (new) Node would introduce inconsistent changes into the composite delta.
- step 630 a composite delta is formed. The delta may be formed by forming a union based on the NodelDs, as explained previously.
- step 635 the NodelDs included in the composite delta are removed from the lists or branches of NodelDs produced in step 625.
- step 640 it is determined whether there are remaining nodes in any of the branches to be included. If there are further nodes remaining the method returns to step 628.
- step 650 the virtual disk is composed based on the tracking information (composite delta derived from at least steps 628- 640.
- each row of the tracking information (composite delta) is sequentially selected and processed, as shown in Figure 8. Once a row of the tracking information has been selected the corresponding entity is obtained from the domain data and added to the virtual disk being composed based on the tracking information. Once all entities have been added to the virtual disk the method moves to step 655 where any post-processing required is performed.
- the post-processing may include updating any data, such as directory indexes required for the domain e.g. NTFS.
- a computer implemented method 800 of forming a composite image is one embodiment of performing the step 650 of the method 600.
- This method 800 is one embodiment of performing the step 650 of the method 600.
- a decomposed instance of an entity type is selected from the tracking information for the composite image.
- the decomposed instance may be selected by selecting a row of the tracking information.
- domain data corresponding to the selected decomposed entity type is obtained.
- the domain data may be obtained based on an AutoID obtained from tracking information.
- a location for the instance of the entity type in the composite image is determined from the tracking entries. That is, the location for the instance of the entity type when added to the composite image, such as the virtual disk being created, is determined.
- Step 830 may comprise calculating a byte position within the composite image for the instance of the entity type. The calculation may be based upon the byte positions of entities already present in the composite image. For example, in the case of an MFT entity, the byte position for a further MFT entity may be calculated with reference to the domain specification 121 providing data regarding an offset of MFTs in the domain and the location of a last MFT in the composite image.
- step 830 essentially stores instances of the entity types in the data store 150 at locations determined by the tracking information.
- step 840 key values associated with the instance of an entity type are updated if necessary.
- the key values are updated by referring to the structural metadata of the domain specification 121 along with key values in the domain data and those observed in the existing composite image which may suggest a new offset for the data with potential corresponding shift in key value.
- the key values are calculated based upon, for example, the location of the instance of an entity type in the composite image according to the structural metadata 121.
- step 850 any further data associated with the instance of an entity type is updated based on the domain specification and attributes of the entity within the composite image.
- Step 850 may include calculating offsets, filling in constants etc. necessary to create a complete NTFS entity.
- pseudo composition may offer a preferred method for updating a data store. Typically this is true where there is technological difficulty, or advantage, in obtaining or updating the data store.
- the patch data from a synthesizer keyboard must first be obtained using a system exclusive dump, as defined in the Musical Instrument Digital Interface (MIDI) protocol. Updating the synthesizer, also via system exclusive, may offer single patch updates and so represents a natively atomic method of update, albeit technology and domain specific.
- MIDI Musical Instrument Digital Interface
- a delta prepared in one branch may be extracted from any other branch, where other branches may be identified by their having differing root nodes. For example the delta from the foreign branch may provide a potentially desirable corrective action to a composition. Also, it may be possible to remove the effects and source of a virus in potentially infected branches and so represents an alternative way of applying virus templates that mitigates both source and effects of viruses. This is unlike current virus techniques that attempt to quarantine virus sources in order to prevent effects as the effects cannot currently be eradicated by virus tools.
- disparate mapping may first be used to provide a compatible set of keys for the delta being removed.
- An AutoID is then sought for each of the entities in the delta potentially utilising the mapped keys. Any references to the AutoIDs are then removed from any tracking information 170.
- the 1st subsequent edit following a removal is upgraded to the change type that belonged to the removed entity, whilst a decomposed instance of an entity type following the removal of a delete are upgraded to the change type of edit.
- the decomposed instance of an entity type may also be removed from domain store 160, as its data may hold a virus.
- the extract delta process can be customized so that deletion of the domain store is forced for certain decomposed instances of an entity type within the delta.
- Accompanying removal of all references in tracking information is required in this case and is tantamount to removal from all branches, following the processes already described.
- much of the data, that is parsed out of the presented binary file and persisted into domain store 160 is converted to numbers or strings of some sort. However, some data is simply binary in nature. For example the actual binary data blocks associated with a file are to be found via the data attribute of each MFT record.
- the domain store may also accommodate binary data.
- Embodiments of the invention employ features for the binary data type that ensure management of the binary data remains efficient. Actual retrieval of binary data can occur in a subsequent pass or iteration, depending on settings within the metadata associated with the binary data type property. For example the 2nd pass, if utilised, only iterates over changed things that are associated with binary data types. The 2nd pass is useful where rules governing de-duplication and persistence of the binary data rely on data that might be subsequent in the parse. For example, a 2nd pass is important if directory paths are to be used in the binary data rules (see below), because part of the path may occur in an MFT that is subsequent.
- one or more rules govern the de- duplication and persistence of the binary data, as explained below.
- a pre-selection rule defines whether the binary data is to be de-duplicated and persisted.
- a direct persist rule defines whether the binary data can be persisted without need to compare data blocks.
- a Boolean result may be determined indicative of whether the binary data can be sent direct to the domain store without any further (performance intensive) data block comparisons.
- a binary de-duplication rule defines whether any binary block checks are needed. If the result of this rule is true, a running byte for byte check between the data in the binary file and versions of the binary data already in domain store that are associated with the current (domain) key will be performed. Optionally, equivalent comparisons using hashes or checksums maybe performed. If a match is found, no persistence takes place and the AutoID for the matching row is returned, otherwise the binary data is persisted according to usual domain storage and tracking entry methods.
- Dynamic domain creation An alternative to decomposition using a static domain specification as per decomposition of the NTFS data store according to domain specification 121 previously described is dynamic domain creation.
- the domain specification is "discovered" from the self-describing data store 150.
- the technique applies especially to text based stores such as a previously unknown XML file.
- Dynamic domain creation does rely on default domain specification 124 in which only default values are given in this exemplary embodiment that define how XML might be decomposed, generally. These default values relate to metadata, as described earlier, that allows processing to remain as flexible as possible.
- the metadata may include: expected file name of the data store; the type of data store; and how to detect opening of a start tag.
- the below example 1 illustrates an exemplary xml file in the data store 150 whose domain is previously unknown.
- Example 1 an exemplary xml file.
- FIG. 9 there is illustrated a computer implemented method 900 of storing data, in accordance with another embodiment of the present invention.
- the method 900 is suitable for and provides for decomposing a file in a data store in which the domain is previously unknown.
- the method 900 will therefore be explained, by way of example only, with reference to decomposition of the exemplary xml file of example 1.
- the previously unknown xml file of example 1 residing in the data store 150, which can be the virtual disk 200, is provided.
- the structure of the unknown xml file 150 is not known, nor is it initially known that the file is an xml file.
- context information 155 is also provided.
- step 902 the unknown xml file in the data store 150 is assessed to determine if it is binary, or text based file, in which case the encoding used for the text may also be determined.
- step 904 a domain of the unknown xml file, in the virtual disk 200, is determined as being the default domain 124 indicating that the file should be treated in a default manner.
- domain 124 indicates default xml treatment in this exemplary embodiment.
- This default treatment includes creation of a new domain, initially containing no entity types in step 906 and, typically, the identifier of this new domain specification would be in a different numerical range to known domain specifications as this differentiation is advantageous when converting a domain from its dynamically created status to a status according a known domain specification.
- step 906 all aspects of the default domain specification are copied through to the newly created domain specification, identified as 126 in this example, except that domain specification 126 is not a default domain.
- the method 900 has accessed a domain specification for the data store 150.
- the domain specification comprises structural metadata indicative of a structure of the data store 150 and property metadata indicative of properties of at least one entity type in the data store 150. More specifically, only when a close tag is discovered in the xml file (assuming that a file of this domain type was not encountered before) is an entity type and the associated property metadata created in the domain specification.
- Step 908 operates as for all decompositions and creates identifier, also referred to as NodelD, to identify the decomposition.
- any content up to the opening tag is collected and is stored in an internal variable within the active entity. However there is no active entity, initially, as tags have yet to be encountered as will be seen.
- the process advisory metadata now present in the created domain specification 126 provides instruction for selection of an initial start tag from the text file and collection of its tag name, "MyRoot.”
- Step 914 is performed after the tag name has been collected, and when it is determined that the "MyRoot" tag just discovered is an open tag, the method 900 continues to step 916.
- a new empty entity (“MyRoot") is created without any (dynamic) properties and is placed on a stack as the currently active entity.
- Step 918 determines if "MyRoot" has attributes and if there are no attributes the method 900 proceeds to a test step 944 to determine is a search for further tags is required. When test step 944 determines a search for further tags is required the method returns to steps 910 to 914 where the "MyEntity" tag is discovered. If the tag is determined at step 914 to be an opening tag and element name collected, then another new entity is created and added to the stack at step 916. The stack now contains "MyRoot" and "MyEntity” with the latter being an active entity.
- step 918 determines that "MyEntity” has attributes they are collected at step 920 resulting in an "@attrl” property being sought within the active entity "MyEntity” (attribute names may be prepended to form property names as is known to a person skilled in the art). If no corresponding property is found a new dynamic property is created within the active entity and is set to the value just collected. [00102] A further iteration of steps 910 to 914 (via step 944), leads to discovery of "Propl”. A "Propl" entity is thus created and placed on the stack at step 916 as the currently active entity.
- step 912 Upon closure of the "Propl" opening tag, a search for further tags continues as determined by test step 944 and steps 910 to 914 are again repeated until the next opening. Consequently the "Propl” property value "My Prop” is collected and assigned to an internal variable within the "Propl” element that is currently represented as the active entity.
- step 912 discovers the "Propl” is not an open tag (it is a closing tag) and thus after step 914 the method 900 proceeds to a test step 922 where the tag is checked to determine if it is actually a closing tag. After step 922 determines that the tag is a closing tag a test step 924 check where the "Propl" entity on the stack is determined to have no properties within it.
- step 926 where the "Propl" name, value and any other related data such as any hash or checksum, are retrieved (copied) from the "Propl” entity.
- the entity is then popped from the stack and a new property created in the now active "MyEntity” entity into which the retrieved name and value are set along with any associated data.
- step 922, 924 determines that "ChildPropl” entity lacks properties.
- “ChildPropl” must be a property itself and so the step 926 converts the "ChildPropl” entity into a property within the now active "MyChildEntity” as usual.
- the method discovers the "MyChildEntity" close tag via steps 912, 914 and 922. However, at step 924 the active entity (being closed) is discovered to contain property "ChildPropl".
- a step 928 is executed as it is assumed that "MyChildEntity" must be an entity and, as the entity is now closed, it can be committed to domain store 160.
- the current entity is persisted to the domain store 160 which consists of ensuring that the entity type exists and that the entity type has all properties defined for it. If the entity type or any of its properties do not exist, they are created. It is then possible for the entity to be persisted to the storage just allocated for the entity type within domain store 160 and persistence proceeds as per standard decomposition and as given in steps 340 through 370.
- step 928 is invoked to persist "MyEntity” as has been described previously. Only “MyRoot” is present on the stack following the persistence and pop of "MyEntity”, and “MyRoot” understands that it has a child of "MyEntity” at this point.
- step 928 the method 900 has selected and accessed an instance of an entity type from the data store 150 using structural metadata.
- step 944 Further iteration via step 944 leads to discovery of the 2nd occurrence of "MyEntity", and in the same manner as for the 1 st occurrence of "MyEntity”, “attrl” and “Propl” with their corresponding values “My2ndEntityAttrib” and "My 2nd Entity Prop” are retrieved from the xml data store and ultimately become properties in the active "MyEntity” entity.
- step 928 when step 928 is executed in response to the close tag of "MyEntity", values from the temporary "MyEntity” entity on the stack are copied to an instantiation of the true “MyEntity” domain store 160. These values are copied after first applying any rules that might be indicated regarding how entities and properties present in the data store 150, but not present in domain store 160, might be treated. Where a domain (domain specification) for the data store is known prior to decomposition, it is possible to instruct that entities and/or properties retrieved from the data store that are not present in the domain specification are rejected. Options over how processing of the data store as a whole may also be supplied in order that data stores containing invalid xml, for example, might be rejected completely. Actual persistence, otherwise takes place as per steps 930 to 940.
- the steps 930, 932,934,938 and 940 perform a process of determining whether a decomposed instance of an entity type exists in the domain store 160. This determining is based on at least one of the values of the properties of the decomposed instance of the entity type and property metadata.
- a data store "investigation" step determines all information pertinent to processing of delimited and fixed width text, and in accordance with any processing advisory metadata; including numbers of properties; property names; property data types; and delimiters and text qualifiers used.
- the Domain store can be prepared after the "investigation" step and decomposition can then proceed substantially as described in association with Figure 3.
- the decomposed instance of each entity type is selectively stored in the domain store 160 at step 936, for later use, according to the determination of whether the decomposed instance of the entity type already exists in the domain store 160.
- the decomposed instance of the entity type is selectively stored in the domain store 160 such that the domain store 160 is a de- duplicated version of the data store 150.
- a tracking entry is added to the tracking information store 170, at step 942, for a respective decomposed instance of an entity type.
- the decomposed instance of each entity type can be processed or used in any procedure or function as descried above such as forming composed data store images, reporting, identifying deltas and providing packages.
- instances of entity types may be presented to the domain store 160 in a form that is already structured, and without direct recourse to any particular data store 150.
- data store 150 there is no need to decompose such data store 150.
- SQL Structured Query Language
- direct decomposition of a Structured Query Language (SQL) Server Database file would be a complex and unnecessary undertaking given that the SQL Server already provides structured access to its database file through facilities such as views, stored procedures and associated data readers. Consequently, an alternative form of a domain specification may express the desired structure of the domain store 160 only, without necessarily providing information on how to extract instances of entity types from a data store 150.
- SQL Structured Query Language
- entity types would be created in the domain specification corresponding to tables, views or stored procedures of interest, and properties created in each entity type accordingly.
- entity types would likely be presented to the domain store 160 via an Application Programming Interface (API) or service calls provided by embodiments of the invention, subsequently referred to as API calls.
- API Application Programming Interface
- Committing external structured data to the domain store 160 in this way, an instance at a time, may be beneficial, for example, in providing: a historical perspective that was lacking in the original storage format; unique types of report; an alternative more efficient (de-duplicated) form of storage; an efficient means for gaining an audit trail of data and more natural sub- typing and graph relationships.
- Presentation of instances of entity types to the domain store 160 operates much as already described, and as seen in Figure 3, with the following possible exceptions.
- the step 305 of determining a file type does not apply and at step 310 the presenter of instances would likely determine the domain by enumerating available domains using API calls and would choose the domain that best fits the overall meaning of instances about to be presented.
- the presenter of instances may also create or enhance domains to suit the structures of the instances being presented. Determining of the actual domain at step 310, any context information that the presenter might wish to provide, and creation of the domain store of step 315 would all likely occur at the same time as step 320 in a single API call. In such an API call a Node ID would be established for use in association with one or more subsequent presentations of instances.
- an instance of an entity type is provided and so selection is not necessary, other than to ensure that the presented instance is associated with the correct entity type in domain store 160.
- the API calls that enumerate entity types (and properties), typically for a given domain, facilitate correct association of instance with entity type typically via likeness between entity type name and properties, as compared to table name and columns, to continue the SQL server example.
- property values are already provided in the presented instance and so selection is not necessary other than to ensure that provided properties correspond with those defined for the entity type.
- dynamic structure techniques apply, to either, reject the rogue presentation, or to dynamically adapt to the presentation by allowing properties to be created as needed. Dynamic structure techniques may be similarly applied to rogue entity types.
- step 375 would also differ in that continuation of the presentation of instances of the current, or different, entity types would be at the discretion of the presenter of these instances.
- one or more parental contexts may be provided with each presentation, and each presentation may return a context to the presenter as provision for the next parental context.
- the parental context may include entity type and order of occurrence in addition to usual identifiers such as an Auto ID.
- this parental context especially where it is based on the very specific Auto ID, implies that presentation of a parent must be followed by presentation of all children (progressing from shallowest to deepest), before another instance at the initial (parent) level can be presented.
- domain store 160 is to be populated from presentation of instances, this "one parent at a time, with all depth" order of presentation may be limiting, and so a context based on Key Value, or Values, for one or more parents may also be utilised, with API calls being provided to return key value, or values, for a given entity in order to form a context.
- This method of presenting context allows greater flexibility of presentation order, but it incurs some performance overhead as the exact context must be recovered from the combination of key value and Node ID in a fashion similar to that described for traversal.
- a consequence of a more flexible order of presentation is that current index, i.e. the position of a given presentation amongst its sibling presentations, may also be required in the context.
- the binary or text field has the option of storing its data in compressed form.
- Any such software may be stored in the form of volatile or non-volatile storage such as, for example, a storage device like a ROM, whether erasable or rewritable or not, or in the form of memory such as, for example, RAM, memory chips, device or integrated circuits or on an optically or magnetically readable medium such as, for example, a CD, DVD, magnetic disk or magnetic tape.
- the storage devices and storage media are embodiments of machine-readable storage that are suitable for storing a program or programs that, when executed, implement embodiments of the present invention. Accordingly, embodiments provide a program comprising code for implementing a system or method as claimed in any preceding claim and a machine readable storage storing such a program. Still further, embodiments of the present invention may be conveyed electronically via any medium such as a communication signal carried over a wired or wireless connection and embodiments suitably encompass the same.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1412737.7A GB2512782B (en) | 2013-01-31 | 2014-01-31 | Method and system for data storage |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GBGB1301692.8A GB201301692D0 (en) | 2013-01-31 | 2013-01-31 | Method and apparatus for data store managment |
GB1301692.8 | 2013-01-31 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014118560A1 true WO2014118560A1 (fr) | 2014-08-07 |
Family
ID=47988454
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2014/050269 WO2014118560A1 (fr) | 2013-01-31 | 2014-01-31 | Procédé et système de stockage de données |
Country Status (2)
Country | Link |
---|---|
GB (1) | GB201301692D0 (fr) |
WO (1) | WO2014118560A1 (fr) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204136A (zh) * | 2023-05-04 | 2023-06-02 | 山东浪潮科学研究院有限公司 | 一种数据存储、查询方法、装置、设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164529A1 (en) * | 2007-12-21 | 2009-06-25 | Mccain Greg | Efficient Backup of a File System Volume to an Online Server |
US20100070725A1 (en) * | 2008-09-05 | 2010-03-18 | Anand Prahlad | Systems and methods for management of virtualization data |
WO2011109534A2 (fr) * | 2010-03-02 | 2011-09-09 | Storagecraft Technology Corp. | Systèmes, procédés et supports lisibles par ordinateur pour la sauvegarde et la restauration d'informations informatiques |
US20120054477A1 (en) * | 2010-08-31 | 2012-03-01 | Iron Mountain Incorporated | Providing a backup service from a remote backup data center to a computer through a network |
US20120137054A1 (en) * | 2010-11-24 | 2012-05-31 | Stec, Inc. | Methods and systems for object level de-duplication for solid state devices |
-
2013
- 2013-01-31 GB GBGB1301692.8A patent/GB201301692D0/en not_active Ceased
-
2014
- 2014-01-31 WO PCT/GB2014/050269 patent/WO2014118560A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090164529A1 (en) * | 2007-12-21 | 2009-06-25 | Mccain Greg | Efficient Backup of a File System Volume to an Online Server |
US20100070725A1 (en) * | 2008-09-05 | 2010-03-18 | Anand Prahlad | Systems and methods for management of virtualization data |
WO2011109534A2 (fr) * | 2010-03-02 | 2011-09-09 | Storagecraft Technology Corp. | Systèmes, procédés et supports lisibles par ordinateur pour la sauvegarde et la restauration d'informations informatiques |
US20120054477A1 (en) * | 2010-08-31 | 2012-03-01 | Iron Mountain Incorporated | Providing a backup service from a remote backup data center to a computer through a network |
US20120137054A1 (en) * | 2010-11-24 | 2012-05-31 | Stec, Inc. | Methods and systems for object level de-duplication for solid state devices |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116204136A (zh) * | 2023-05-04 | 2023-06-02 | 山东浪潮科学研究院有限公司 | 一种数据存储、查询方法、装置、设备及存储介质 |
CN116204136B (zh) * | 2023-05-04 | 2023-08-15 | 山东浪潮科学研究院有限公司 | 一种数据存储、查询方法、装置、设备及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
GB201301692D0 (en) | 2013-03-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10324909B2 (en) | Omega names: name generation and derivation utilizing nested three or more attributes | |
JP7170701B2 (ja) | 高速コピー可能データベースを効率的に実装するための方法及び機器 | |
US9009195B2 (en) | Software framework that facilitates design and implementation of database applications | |
US11461304B2 (en) | Signature-based cache optimization for data preparation | |
US8332359B2 (en) | Extended system for accessing electronic documents with revision history in non-compatible repositories | |
US20170109378A1 (en) | Distributed pipeline optimization for data preparation | |
US7523141B2 (en) | Synchronization operations involving entity identifiers | |
US7769719B2 (en) | File system dump/restore by node numbering | |
US8156090B1 (en) | Maintaining file name uniqueness in an application development environment of a computing system | |
CA2398148C (fr) | Systeme et methode de gestion de liens bidirectionnels entre objets | |
US9483508B1 (en) | Omega names: name generation and derivation | |
US10642815B2 (en) | Step editor for data preparation | |
EP3362808B1 (fr) | Optimisation d'antémémoire pour préparation de données | |
KR20160019863A (ko) | 소프트웨어 애플리케이션을 구성하기 위한 방법 및 장치 | |
US20070112802A1 (en) | Database techniques for storing biochemical data items | |
WO2014118560A1 (fr) | Procédé et système de stockage de données | |
US20210056090A1 (en) | Cache optimization for data preparation | |
US11288447B2 (en) | Step editor for data preparation | |
US20220335030A1 (en) | Cache optimization for data preparation | |
WO2023276212A1 (fr) | Système de mise à jour de composant logiciel et procédé de mise à jour de composant logiciel | |
CN117950720A (zh) | 基于资源引用关系模型实现低代码全量资源的重构方法 | |
JP2004062735A (ja) | ディレクトリ情報への更新情報生成システム、更新情報生成プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
ENP | Entry into the national phase |
Ref document number: 1412737 Country of ref document: GB Kind code of ref document: A Free format text: PCT FILING DATE = 20140131 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1412737.7 Country of ref document: GB |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 14702933 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 14702933 Country of ref document: EP Kind code of ref document: A1 |