GB2517688A - Storage system and method for data object storage managing in a storage system - Google Patents
Storage system and method for data object storage managing in a storage system Download PDFInfo
- Publication number
- GB2517688A GB2517688A GB1315180.8A GB201315180A GB2517688A GB 2517688 A GB2517688 A GB 2517688A GB 201315180 A GB201315180 A GB 201315180A GB 2517688 A GB2517688 A GB 2517688A
- Authority
- GB
- United Kingdom
- Prior art keywords
- data object
- storage
- data
- grouping
- ranking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0689—Disk arrays, e.g. RAID, JBOD
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/185—Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0647—Migration mechanisms
- G06F3/0649—Lifecycle management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0652—Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A storage system includes a storage pool (200) with at least one storage media (210, 220, 230) and storage management, wherein the storage management stores a received new data object in the at least one storage media (210, 220, 230) of the storage pool (200); wherein the storage management comprises an analytic engine (300) analyzing the new data object based on content of the new data object; wherein the analytic engine (300) comprises a classification component (310) classifying the new data object into predefined data object type classes (312, 314, 316, 318, e.g. picture, text, audio, video); a grouping component (320) creating a data object specific grouping vector for the new data object, which comprises at least one content related scalar (e.g. time of picture, number of people in picture, persons in picture), and grouping data objects of a corresponding type class (312, 314, 316, 318) in different data object groups (322, 324, 326) based on corresponding grouping vectors of the data objects (e.g. part match); and a ranking component (330) ranking the data objects of a corresponding data object group (322, 324, 326) based on a data object specific ranking vector comprising at least one quality scalar (e.g. sharpness of picture etc) for each data object group (322, 324, 326). The ranking results may be used to execute migration policies.
Description
DESCRIPTION
STORAGE SYSTEM MID METHOD FOR DATA OBJECT STORAGE MANAGING IN A
STORAGE SYSTEM
BACKGROUND OF THE INVENTION
Field of the Invention
The present invention relates in general to the field of data storage management, and in particular to a storage system and a method for data object storage managing in a storage system.
Still more particularly, the present invention relates to a data processing program and a computer program product for data object storage managing in a storage system.
Description of the Related Art
The amount of digitally stored information is dramatically increasing. Recent studies estimate that this growth will continue by a factor of approximately 40 within the upcoming 10 years. In parallel the storage capacity of storage systems offered by the industry is increasing, however significant problems like limited life time of the storage systems (7 to 10 years), significant time for data migration, e.g. approximately 3 years for a present 14 PB storage system, potential upcoming technology changes, etc., are upcoming, when considering that the stored information should be made available for users on a longer time frame like e.g. decades.
Present technology like magnetic storage, optical storage, SOD storage has physical limitations to store information on a long term, due to the inherent possibilities of bit flips.
Prior Art exposes methods to reduce the amount of stored data by data compression and data deduplication. During the deduplication process data objeots are split in chunks of fixed or variable size and redundant information Is identified within a storage pool and removed, so every chunk is stored a single time only. This is performed on block level or on a higher level. If chunking is performed file based one chunk represents one file. If chunking is performed block based a data object is chunked into blocks. In any case an identical match will be identified by a deduplication algorithm. Duplicate information will be removed by using pointers to the identical pattern.
Referring to FIG. 1, which shows the principle of data deduplication of a prior art storage system 1, a data object 3 or data stream is subject for deduplication. The data object 3 is split in chunks A, B, C, D, E, F of fixed or variable size by a chunking engine 10 and for each chunk an identity character is determined. A deduplicatiori engine 20 determines duplicate chunks A, F, D by referencing identical chunks using a reference pointer, for example. The deduplication engine 20 stores non-identical chunks A, B, C, D, F, F or single instances in a storage component 30. Additional data compression may be performed.
FIG. 2 shows the prior art principle of deduplication. Referring to FIG. 2 a first data object 3A comprises twelve identical chunks 12a, a second data object 3b comprises nine identical data chunks 12b, and a third data object 30 comprises ten identical chunks 120. Redundant chunks 12A, 12B, 120 are identified and replaced by appropriate pointers, wherein every chunk 127k, 12B, 120 is stored a single time only. Duplicate information will be removed by using pointers to the identical pattern 12A, 12B, 120.
FIG. 3 shows prior art technoiogy for data compression.
Referring to FIG. 3 a data stream 5A applied to a storage system 1 from at least on user 7 via a network 2 oomprises four bit streams, wherein two bit streams are identical. A compressing engine 40 eliminates identical information in the data stream 5A and outputs a oompressed data stream 53 via cache storage 32 to a storage component 30. The principle of compression is identical to deduplication. However here the footprint is limited to the data steam 5A and no references to the entire storage is used. Further the storage component 30 outputs the compressed data stream 5B via the cache storage 32 to the compression engine 40, which decompresses the compressed data stream 53 and output a decompressed data stream SC via the network 2 to the requesting user 7.
However the above two technologies alone are not capable to manage the outlined dramatic growth of digitally stored information cii a long term scale. Iii addition Prior Art technology focusses on data reduction only. It does not provide the possibility to identify high value information or low value information.
Summary of the Invention
The technical problem underlying the present invention is to provide a storage system and a method for data object storage managing in a storage system, which are able to improve data object storage management by conserving high value information and eliminating redundant information and low value information and to solve the above mentioned shortcomings and pain points of prior art data object storage managing in a storage system.
According to the present invention this problem is solved by providing a storage system having the features of claim 1, a method for data object storage managing in a storage system having the features of claim 6, a data processing program for data object storage managing in a storage system having the features of claim 14, and a computer program product for data object storage managing in a storage system having the features of claim 15. Advantageous embodiments of the present invention are mentioned in the subclaims.
Accordingly, in an embodiment of the present invention a storage system comprises a storage pool with at least one storage media and a storage management. The storage management stores a received new data object in the at least one storage media of the storage poo1. The storage management comprises an analytic engine analyzing the new data object based on content of the new data object. The analytic engine comprises a classification component classifying the new data object into predefined data object type classes; a grouping component creating a data object specific grouping vector for the new data object, which comprises at least one content related scalar, arid grouping data objects of a corresponding data object type class in different data object groups based on corresponding grouping vectors of the data objects; and a ranking component ranking the data objects of a corresponding data object group based on a data object specific ranking vector comprising at least one quality scalar for each data object group.
In further embodiments of the present invention, the storage manager uses the ranking result to execute different migration policies to each data object of the corresponding data object group.
Tn further embodiments of the present invention, the storage manager assigns each storage media to at least one rank of a corresponding data object group based on performance quality of the at least one storage media.
Tn further embodiments of the present invention, the storage manager migrates data objects with a ranking higher than a certain first threshold to a storage media with a highest performance quality and data objects with lower ranking to storage media with lower performance quality.
Tn further embodiments of the present invention, the storage manager marks data objects with ranking lower than a certain second threshold for deletion.
Tn another embodiment of the present invention, a method for data object storage managing in a storage system comprising a storage pool with at least one storage media and a storage manager, wherein a received new data object is stored in the at least one storage media of the storage pool, comprises the steps of: Notifying an analytic engine of the storage manager about a new data object to be stored in the storage system and starting an analyzing process of the new data object based on content of the new data object; classifying the new data object into predefined data object type classes; creating a data object specific grouping vector for the new data object, which comprises at least one content related scalar, grouping data objects of a corresponding data object type class in different data object groups based on corresponding grouping vectors of the data objects; and ranking the data objects of a corresponding data object group based on a data object specific ranking vector comprising at least one quality scalar for each data object group.
In further embodiments of the present invention, a data object type is determined by analyzing a data object extension during the classifying process of the new data object.
Tn further embodiments of the present invention, a part match principle is applied to corresponding data object specific grouping vectors of the data objects during the grouping of the data objects of a corresponding data object type class defining a threshold value for matching the at least one content related scalar of each data object belonging to the same data object group.
Tn further embodiments of the present invention, variable and user defined matching parameters are used in the part match principle to determine if two data objects belong to the same data object group.
In further embodiments of the present invention, the at least one guality scalar of the data object specific ranking vectors provides a quality measure in a certain value range, wherein ranking of the data objects of a corresponding data object group is executed on a quantitative analyses of the data object specific ranking vectors of the data objects.
In further embodiments of the present invention, each data object of the corresponding data object group is migrated to a storage media of the storage pool based on the ranking result and a corresponding migration policy.
Tn further embodiments of the present invention, data objects with ranking higher than a first threshold are migrated to a storage media with a highest performance quality and data objects with lower ranking (re migrated to storage media with lower performance quality.
In further embodiments of the present invention, data objects with ranking lower than a certain second threshold are marked for deletion.
Tn another embodiment of the present invention, a data processing program for execution in a data processing system comprises software code portions for performing a method for data object storage managing in a storage system when the program is run on the data processing system.
In yet another embodiment of the present invention, a computer program product stored on a computer-usable medium, comprises computer-readable program means for causing a computer to perform a method for data object storage managing in a storage system when the program is run on the computer.
All in all, embodiments of the present invention disclose an analytical approach to eliminate both redundant and low value information out of huge file storage data pools with the intention to conserve high value information with appropriate mechanisms, and to eliminate redundant information and low value information.
Embodiments of the present iriveiitiori focus on reduction and/or more intelligent management of stored data objects especially in network attached storage system environments.
The main idea of the present invention is to classify data objects and to identify similar data objects and to create data object groups of similar data objects, wherein the data objects are ranked within the single data object groups. Advantageously migration policies may be applied to the data objects based on the ranking, e.g. delete all data objects with rank > 3. This migration policies might apply automatically or semi-automatic with human intervention.
Embodiments of the present invention propose a new approach for managing data object storage using an analytical engine. The analytic engine performs classifying, grouping and ranking of data objects based on the actual data object content. This in turn offers multiple possibilities for storage management, like long term arohiving of identified high value data objects, deletion of low value data objects, and pooling of the data objects based on their rank.
The above, as well as additional purposes, features, and advantages of the present invention will become apparent in the
following detailed written description.
Brief Description of the Drawings
A preferred embodiment of the present invention, as described in detail below, is shown in the drawings, in which FIG. 1 is a schematic block diagram of a prior art principle of
data deduplication of a prior art storage system;
FIG. 2 is a conceptual representation of prior art principle of deduplicatiori; FIG. 3 is a conceptual representation of prior art technology for data compression; FIG. 4 is a schematic block diagram of a storage system, in accordance with an embodiment of the present invention; FIG. 5 is a schematic block diagram of an analytic engine for the storage system of FIG. 4 in greater detail, in accordance with an embodiment of the present invention; FIG. 6 is a schematic flow diagram of a method for data object storage managing in a storage system, in accordance with an embodiment of the present invention; and FIG. 7 is a schematic representation of the functionaiity of a grouping process as performed by the analytic engine of FIG. 5, in accordance with an embodiment of the present invention.
Detailed Description of the Preferred Embodiments
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," "module" or "system." Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (RaM), an erasable programmable read-only memory (EPROM or Flash memory) an optical fiber, a portable compact disc read-only memory (CD-ROM) , an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. Tn the context of this document, a computer readable storage medium may be any -10 -tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or devioe.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wirelirie, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalitalk, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) -11 -Aspects of the present invention are described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
-12 -FIG. 4 shows a storage system, in aocordance with an embodiment of the present invention; FIG. 5 shows an analytic engine for the storage system of FIG. 4 in greater detail, in accordance with an embodiment of the present invention; FIG. 6 shows a method for data object storage managing in a storage system, in accordance with an embodiment of the present invention; and FIG. 7 shows the functionality of a grouping process as performed by the analytic engine of FTG. 5, in accordance with an embodiment of the present invention.
Referring to FIG. 4 and 5, the shown embodiment of the present invention employs a storage system 100 comprising a storage pool with at least one storage media 210, 220, 230 and a storage management 250, which stores a received new data object 130 in the at least one storage media 210, 220, 230 of the storage pool 200. The storage management 250 comprises an analytic engine 300 analyzing the new data object 130 based on content of the new data object 130. The analytic engine 300 comprises a classification component 310, a grouping component 320, and a ranking component 330.
As in prior art technology data objects 130 are stored within the storage pool 200. The data objects 130 include pictures 132, text 134, audio or music files 136, video files 138, etc., for example. The analytic engine 300 performs a storage poo1 and new data objects 130 analysis, and classifies, groups and subseguently ranks the data objects 130. The analytic engine 300 performs the grouping of the data objects 130 based on content similarities. This is completely different to prior art technologies, where 100% matching redundant chunks are identified by an algorithm.
The classification component 310 of the analytic engine 300 classifies the new data object 130 into predefined data object type classes 312, 314, 316, 318. The grouping component 320 of -13 -the analytic engine 300 creates a data cbject specific grcuping vector for the new data object 130, which comprises at least one content related scalar, and groups the data objects of a corresponding data object type class 312, 314, 316, 318 in different data object groups 322, 324, 326 based on corresponding grouping vectors of the data objects. The ranking component 330 of the analytic engine 300 ranks the data objects of a corresponding data object group 322, 324, 326 based on a data object specific ranking vector comprising at least one quality soalar for eaoh data object group 322, 324, 326.
In the shown embodiment the storage manager 250 uses the ranking result to execute different migration policies to each data object of the corresponding data object group 322, 324, 326.
Therefore the storage manager 250 assigns each storage media 210, 220, 230 to at least one rank Rank 1, Rank 2, Rank N of a corresponding data object group 322, 324, 326 based on performance quality of the at least one storage media 210, 220, 230. The storage manager 250 migrates data objects with a ranking higher than a first threshold to a storage media 210 with a highest performance quality and data objects with lower ranking to storage media 220, 230 with lower performance quality. Additionally the storage manager 250 marks data objects with ranking lower than a certain second threshold for deletion.
Referring to FIG. 6, in step 3400 a new data object 130 is received and stored in step 3410 in the storage pool 200. Tn case a new data object 130 is sent or updated to the file storage pool 200 the analytic engine 300 receives an automated notification in step 3420. In turn the analytic engine 300 starts an analysis process of the new data object 130 in step 3500 on the content of the new data object 130.
Tn step 5510 the analytic engine 300 performs the classification of the new data object 130 using the classification component -14 - 310, which determines a type of the new data object 130, e.g. picture 132, text 134, audio file 136, or video file 138.
Therefore the olassification oomponent 310 analyzes a data object extension being a suffix to the name of the new data object 130, e.g. separated from the data object name by a dot.
Examples of data objeot name extensions for pictures are "png", "jpg", "gif", "bmp", "tiff", etc. Examples of data object name extensions for text files are "txt", "dcc", "docx", "odt", etc. Examples of data object name extensions for audio files are "mp3", "ots", "way", "wma", etc. Based on this initial analysis the analytic engine 300 organizes the data objects 132, 134, 136, 138 in different data object classes 312, 314, 316, 318.
Tn step 5520 the analytic engine 300 performs the grouping of the data objects 132, 134, 136, 138 of each data object class 312, 314, 316, 318 using the grouping component 320. In general the grouping component 320 generates an n-dimensional grouping vector G for each data object 132, 134, 136, 138, representing the data object 132, 134, 136, 138. Based on the grouping vectors G the analytic engine 300 generates groups 322, 324, 326 on a part match principle. A variable and user defined matching parameter are used to determine, if two data objects 132, 134, 136, 138 belong to the same group 322, 324, 326. This could be a matching by 98% for example. The mechanism of the matching principle will be outlined in the following examples.
Tn general the generation of the grouping vector G is different for each data object class 312, 314, 316, 318, e.g. pictures 312, text 314, music 316, or video 318. In the following the grouping process is outlined for pictures 312 in a first example and text files 314 in a second example.
Example 1:
The first example relates to the analytic engine grouping process based on contend analysis for pictures 312. To execute -15 -the grouping the analytic engine 300 generates the grouping vector 0. The grouping vector 0 is built by the following scalars, for example: gi: time stamp g2: number of persons g3: person identified by face recognition technology, for example. Each person is represeuted by a number generated by a set of face dimensions, e.g. eye distance, ear distance, head diameter, etc. g4: number of objects, e.g. buildings, vehicles, etc. g5: Objects. Each type of object will be represented by a number.
Tn alternative embodiments of the present invention, more or less scalars may be used to define the grouping vector 0.
In the first Example g2, g3, g4, g5 are content related scalars iii the above exemplary list. In the following the group generation by the grouping component 320 of the analytic engine 300 is shown for two pictures represented by & limited number of four scalars as an example.
For a first picture the grouping vector Gpjctttre is defined by the following scalars gl to g3.2: gi = 1214617821 (Unix Time) g2 = 2 (2 people) g3.l = 56789243 (number represeuting a first person, the number is generated by a set of face dimensions) g3.2 = 23978744 (number representing a second person) 0pictvare = (1214617821,2,56789243,23978744) For a second picture the grouping vector GpLnuc2 is defined by the following scalars gi to g3.2: gl = 12146178331 (Unix Time) g2 = 2 (2 people) -16 -g3.l = 56789245 (number representing a first person, the number is generated by a set of face dimensions) g3.2 = 23978745 (number representing a second perscn) Gpcmre2 = (1214617831,2,56789245,23978745) Now the grouping component 320 of the analytic engine 300 performs the part match prccess (Gpcvnlre2 -Gr!ctrre = (10,0,2,1) This means relative to the original values that the first picture and the second picture match better than 99: In turn the grouping component 320 of the analytic engine 300 adds the first picture and the second picture to the same group.
Example 2:
The second example relates to the analytic engine 300 grouping process based on contend analysis for text files 314. To execute the grouping the grouping component 320 of the analytic engine 300 generates the grouping vector G. The grouping vector G is built by the following scalars, for example: gi: timestamp g2: filename g3: number cf key phrases g4: key phrases To determine key phrases in a text automatically, well known prior art key phrase extraction processes may be used.
In alternative embodiments of the present invention, more or less scalars may be used to define the grouping vector G. In the second Example g3 and g4 are contend related scalars in the above exemplary list. In the following the group generation by the grouping component 320 of the analytic engine 300 is shown for two text files represented by a limited number of four scalars as an example.
Text 1: Mail Online April 30, 2013 -17 - "The Queen of the Netherlands announced last night that she was abdicating in favor of her son and heir after 33 years on the throne -Tn a broadcast on Dutch state television three days before her 75th birthday, Queen Beatrix said she was stepping down because she believed the responsibility should now lie in the hands of a new generation'." For the first text the grouping vector GLCX is defined by the following scalars gi to g4: gi = 2378923456 (Unix Time) g2 = filel.nsf g3 = 6 g4 = Queen, Beatrix, Dutch, Orange-Nassau, 77 birthday, generation = (2378923456,6,Queen,Beatrix,Netherlands,Orange-Nassau,77 birthday, generation) Text 2: New York Times April 30, 2013 "To the cheers of tens of thousands of people crammed shoulder to shoulder outside the royal palace here, Wilhelm-Alexander of the House of Orange-Nassau became the Netherlands' first king in 123 years on Tuesday as his mother, Queen Beatrix, ended a 33-year reign with the stroke of a pen, signing the act of abdication in a chandeliered chamber at the royal palace." For the second text the grouping vector GLCXL2 is defined by the following scalars gl to g4: gi = 2378923456 (Unix Time) g2 = file2.nsf g3 = 7 g4 = palace, Queen, Beatrix, Orange-Nassau, Wilhelm-Alexander, Netherlands, king -18 - G-extv = (2378923456,7,palace,Queen,Beatrix,Orange-Nassau,Wilhelm-Alexander, Netherlands, king) Tn the second example four cut of six key phrases of the first text match with key phrases of the second text. By using thesaurus databases the grouping component 320 of the analytic engine 300 could be more precise in determining matching key phrases. If the user rates this matching as acceptable, the grouping component 320 of the analytic engine 300 adds the first text and the second text into the same group.
In step 3530 the analytic engine 300 performs the ranking of the data objects 132, 134, 136, 138 of each data object group 322, 324, 326 using the ranking component 330. The ranking component 330 of the analytic engine 300 ranks the data objects 132, 134, 136, 138 within each group 322, 324, 326 based on the quality of the data object 132, 134, 136, 138. The quality is determined by the content of the data objects 132, 134, 136, 138.
Each file within a group will be associated to a ranking vector R = {rl,r2,r3,...,rn}. The ranking will be executed on a quantitative analysis of the ranking vector R. Each scalar provides a measure for the quality in a range from 0 -low to 10 -high.
The ranking vector R for pictures is built by the following attributes, for example: rl: sharpness r2: red-eye identification r3: open/closed eyes r4: are the people centric The ranking vector R for test is built by the following attributes, for example: rl: number of identified keywords (the more the better) -19 -r2: number of typos r3: quality of keyword (compared to reference) r4: quality of sentenoes In step 5540 the storage management 250 executes different migration polioies to each data objeot 132, 134, 136, 138 within a group 322, 324, 326. Data objects 132, 134, 136, 138 with the highest rank, e.g. Rank 1, can be migrated to a gold storage pool. This could be, for example, a first storage media 210 with the highest performanoe like SSD or high performance hard disk drives. Lower ranked data objects 132, 134, 136, 138, e.g. Rank 2, can be migrated to a silver storage pooi. This could be, for example, a second storage media 220 with lower performance like tape storage. Lowest ranked data objects 132, 134, 136, 138, e.g. Rank N, can be migrated to a bronze storage pool 230. This data objects 132, 134, 136, 138 could now be marked for deletion. The following list outlines some possible migration policies: Rank 1 Pool: High value data objects: Migrate to gold storage pool, e.g. SSD's or fast disk storage.
Rank 2 Pool: Medium value files: Migrate to silver storage pool, e.g. tape storage.
Rank 3 Pool: Low value files: Migrate to bronze storage pool, e.g. keep for a defined number of days before deletion.
FIG. 7 shows an example of result of the grouping process performed by the grouping component 320 of the analytic engine 300. Referring to FIG. 7 the shown example of a data object group comprises six similar but not identical data objects 130A, 130B, 13CC, 130D, l3OE, 130F as identified by the grouping component 320. To outline the different data content a first data object 130k comprises twelve first data chunks 140 represented by a first hatching and a first shape, nine second data chunks 142 represented by a second hatching and a second -20 -shape, and ten third data chunks 144 represented by a third hatching and a third shape. A seccnd data object 130B comprises eleven first data chunks 140 represented by the first hatching and the first shape, eight second data chunks 142 represented by the second hatching and the second shape, and nine third data chunks 144 represented by the third hatching and the third shape. A third data object 1300 comprises eleven first data chunks 140 represented by the first hatching and the first shape, eight second data chunks 142 represented by the second hatching and the second shape, nine third data chunks 144 represented by the third hatching and the third shape, one fourth data chunk i4OA represented by the first shape without hatching, one fifth data chunk 142A represented by the second shape without hatching, and one sixth data chunk 144A represented by the third shape without hatching. A fourth data object 130D comprises ten first data chunks 140 represented by the first hatching and the first shape, seven second data chunks 142 represented by the second hatching arid the second shape, arid eight third data chunks 144 represented by the third hatching and the third shape. A fifth data cbject 130E comprises twelve seventh data chunks 1403 represented by a fourth hatching and the first shape, nine eighth data chunks i42B represented by a fifth hatching and the second shape, and ten ninth data chunks 144B represented by a sixth hatching and the third shape. A sixth data object i3OF comprises ten first data chunks 140 represented by the first hatching and the first shape, seven second data chunks 142 represented by the second hatching and the second shape, six third data chunks 144 represented by the third hatching and the third shape, one fourth data chunk 140A represented by the first shape without hatching, cne fifth data chunk 142A represented by the second shape without hatching, one sixth data chunk 144A represented by the third shape without hatching, one tenth data chunk 1400 represented by the fifth hatching and the first shape, one fifth data chunk 142A represented by the second shape without hatching, one eleventh -21 -data chunk 1420 represented by the fourth hatching and the seccnd shape, one sixth data chunk 144A represented by the third shape without hatching, one twelfth data chunk 144B represented by the sixth hatching and the third shape, one thirteenth data chunk 1440 represented by the fourth hatching and the third shape, and one fourteenth data chunk 144D represented by the fifth hatching aud the third shape.
The above described "analytical engine" can be used in network attached storage (NAS) products. The invention will be used to identify very high value data objects intended to keep for a long time. This files can once identified be migrated to long lasting storage media.
On the other hand data objects being identified as low value can be marked for potential deletion. Combined with prior art policies the invention will be used to dramatically reduce the amount of stored data objects with the intention to keep high value data objects only.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and -22 -combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1315180.8A GB2517688A (en) | 2013-08-26 | 2013-08-26 | Storage system and method for data object storage managing in a storage system |
DE201410111571 DE102014111571A1 (en) | 2013-08-26 | 2014-08-13 | A storage system and method for managing a data object store in a storage system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
GB1315180.8A GB2517688A (en) | 2013-08-26 | 2013-08-26 | Storage system and method for data object storage managing in a storage system |
Publications (2)
Publication Number | Publication Date |
---|---|
GB201315180D0 GB201315180D0 (en) | 2013-10-09 |
GB2517688A true GB2517688A (en) | 2015-03-04 |
Family
ID=49355900
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
GB1315180.8A Withdrawn GB2517688A (en) | 2013-08-26 | 2013-08-26 | Storage system and method for data object storage managing in a storage system |
Country Status (2)
Country | Link |
---|---|
DE (1) | DE102014111571A1 (en) |
GB (1) | GB2517688A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106604111A (en) * | 2016-12-16 | 2017-04-26 | 深圳市九洲电器有限公司 | Set-top box Flash data storage method and set-top box Flash data storage system |
EP3185136A1 (en) * | 2015-12-22 | 2017-06-28 | Incubaid Business Center NV | A mass data storage system and method |
US20200134198A1 (en) * | 2018-10-31 | 2020-04-30 | EMC IP Holding Company LLC | Intelligent data protection platform with multi-tenancy |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060004820A1 (en) * | 2004-07-01 | 2006-01-05 | Claudatos Christopher H | Storage pools for information management |
US7693877B1 (en) * | 2007-03-23 | 2010-04-06 | Network Appliance, Inc. | Automated information lifecycle management system for network data storage |
CN103313090A (en) * | 2012-03-16 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Method and system for off-line downloading video files |
-
2013
- 2013-08-26 GB GB1315180.8A patent/GB2517688A/en not_active Withdrawn
-
2014
- 2014-08-13 DE DE201410111571 patent/DE102014111571A1/en not_active Ceased
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060004820A1 (en) * | 2004-07-01 | 2006-01-05 | Claudatos Christopher H | Storage pools for information management |
US7693877B1 (en) * | 2007-03-23 | 2010-04-06 | Network Appliance, Inc. | Automated information lifecycle management system for network data storage |
CN103313090A (en) * | 2012-03-16 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Method and system for off-line downloading video files |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3185136A1 (en) * | 2015-12-22 | 2017-06-28 | Incubaid Business Center NV | A mass data storage system and method |
WO2017108482A1 (en) | 2015-12-22 | 2017-06-29 | Incubaid Business Center Nv | A mass data storage system and method |
CN106604111A (en) * | 2016-12-16 | 2017-04-26 | 深圳市九洲电器有限公司 | Set-top box Flash data storage method and set-top box Flash data storage system |
US20200134198A1 (en) * | 2018-10-31 | 2020-04-30 | EMC IP Holding Company LLC | Intelligent data protection platform with multi-tenancy |
EP3647931A1 (en) * | 2018-10-31 | 2020-05-06 | EMC IP Holding Company LLC | Intelligent data protection platform with multi-tenancy |
CN111125746A (en) * | 2018-10-31 | 2020-05-08 | Emc知识产权控股有限公司 | Multi-tenant intelligent data protection platform |
US10943016B2 (en) | 2018-10-31 | 2021-03-09 | EMC IP Holding Company LLC | System and method for managing data including identifying a data protection pool based on a data classification analysis |
CN111125746B (en) * | 2018-10-31 | 2023-10-31 | Emc知识产权控股有限公司 | Multi-tenant intelligent data protection platform |
Also Published As
Publication number | Publication date |
---|---|
GB201315180D0 (en) | 2013-10-09 |
DE102014111571A1 (en) | 2015-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11093466B2 (en) | Incremental out-of-place updates for index structures | |
US20230126005A1 (en) | Consistent filtering of machine learning data | |
US10713589B1 (en) | Consistent sort-based record-level shuffling of machine learning data | |
US10366053B1 (en) | Consistent randomized record-level splitting of machine learning data | |
US9619487B2 (en) | Method and system for the normalization, filtering and securing of associated metadata information on file objects deposited into an object store | |
US10942813B2 (en) | Cloud object data layout (CODL) | |
US11100420B2 (en) | Input processing for machine learning | |
US10318882B2 (en) | Optimized training of linear machine learning models | |
US10339465B2 (en) | Optimized decision tree based models | |
US9002907B2 (en) | Method and system for storing binary large objects (BLObs) in a distributed key-value storage system | |
US20120089775A1 (en) | Method and apparatus for selecting references to use in data compression | |
US11200083B2 (en) | Inexact reconstitution of virtual machine images | |
US11221992B2 (en) | Storing data files in a file system | |
Xiao et al. | SWEclat: a frequent itemset mining algorithm over streaming data using Spark Streaming | |
KR20130049111A (en) | Forensic index method and apparatus by distributed processing | |
GB2517688A (en) | Storage system and method for data object storage managing in a storage system | |
Singhal et al. | A Novel approach of data deduplication for distributed storage | |
WO2023124135A1 (en) | Feature retrieval method and apparatus, electronic device, computer storage medium and program | |
Vikraman et al. | A study on various data de-duplication systems | |
Manjusha et al. | Detect/remove duplicate images from a dataset for deep learning | |
Sethi et al. | Leveraging hadoop framework to develop duplication detector and analysis using Mapreduce, Hive and Pig | |
US10083182B2 (en) | Augmented directory hash for efficient file system operations and data management | |
US11777519B2 (en) | Partitional data compression | |
US20240143628A1 (en) | Optimizing cross-data center mobility using content-based datasets | |
US20240143812A1 (en) | Multi-network data management using content-based datasets and distributed tagging |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WAP | Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1) |