WO2020080931A1 - Management of data for content based data locality search - Google Patents

Management of data for content based data locality search Download PDF

Info

Publication number
WO2020080931A1
WO2020080931A1 PCT/MY2019/050076 MY2019050076W WO2020080931A1 WO 2020080931 A1 WO2020080931 A1 WO 2020080931A1 MY 2019050076 W MY2019050076 W MY 2019050076W WO 2020080931 A1 WO2020080931 A1 WO 2020080931A1
Authority
WO
WIPO (PCT)
Prior art keywords
partition
data
search
minimum
maximum
Prior art date
Application number
PCT/MY2019/050076
Other languages
French (fr)
Inventor
Meng Wei CHUA
Weiying KOK
Chuan Hai NGO
Yasaman EFTEKHARYPOUR
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2020080931A1 publication Critical patent/WO2020080931A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries

Definitions

  • the present disclosure relates to the field of management and storage of data in a storage medium, in particular to a system and method for managing configuration of a storage medium.
  • United States Patent Number US 8,572,091 B1 provides a system and method for implementing a scalable data storage service that maintains tables in a non-relational data.
  • the cited reference discloses partitioning and indexing of items stored in the tables according to a respective primary key that consists of a hash key component and a range key component. The system determines whether a range key attribute of a query is completely, partially or not within range of each partitioned item.
  • partitioning and indexing is based on comparison of range key attribute of the query with range key component of each partitioned item, and categorization of each matching partition based on relevancy of corresponding partition with respect to the query is not taught or even suggested.
  • partitioning and indexing management system that significantly improves locating data location among file partitions and reduces amount of data to be searched with respect to a search query.
  • the present disclosure proposes a system and method for managing a storage medium for locating data locations among a plurality of partitions of the storage medium.
  • the storage medium can be managed using a data manager operatively coupled with a computing device.
  • An aspect of the present disclosure relates to a system comprising a data manager that is operatively coupled with a computing device, characterized in that the data manager enables management of a storage medium for locating data locations among a plurality of partitions of the storage medium by means of: a data indexer engine for indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index; a data locality engine for locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index
  • the indexing further comprises: selecting a partition of the plurality of partitions to determine data type of the contents of said partition; examining, when the determined data type of the contents is numeric, numeric data in the partition to find a maximum numeric value and a minimum numeric value of the numeric data and storing the maximum numeric value and the minimum numeric value as index; and extracting, when the determined data type of the contents is string, length of the string and a string prefix for plurality of records of the partition to determine a maximum string prefix and a minimum string prefix in lexical order and a maximum string length and a minimum string length, and storing any or a combination of the determined maximum string prefix, the determined minimum string prefix, the determined maximum string length and the determined minimum string length as index.
  • data locality for each partition is set as any of ‘all’, ‘partial’ or‘none’ based on said comparing.
  • the searching further comprises: examining data locality of each partition; retrieving, when data locality for the partition is set as‘all’, all data from said partition; and searching, when data locality for the partition is set as‘partial’, the target data against the search query to retrieve match data from said partition.
  • the target partition comprises the partition associated with data locality as’all’ or‘partial’. According to an embodiment, the searching is not performed in the partition associated with data locality as‘none’.
  • the locating further comprises: selecting a partition of the plurality of partition to perform data locality check on said partition; determining data type of the contents of said partition; and performing the data locality check based on the determined data type.
  • the data locality check comprises: extracting the minimum numeric value and the maximum numeric value of the partition from the index; examining a search value range pertaining to the target data based on the search query; setting data locality of the partition as’all’ when the minimum numeric value and the maximum numeric value are within the search value range; setting data locality of the partition as‘partial’ when the minimum numeric value and the maximum numeric value are not within the search value range and the search value range is within the minimum numeric value and the maximum numeric value; and setting data locality of the partition as ‘partial’ when any of the minimum numeric value or the maximum numeric value is within the search value range else setting data locality of the partition as‘none’.
  • said data locality check comprises: extracting any or a combination of the minimum string length and the maximum string length and the minimum string prefix and the maximum string prefix of the partition from the index; extracting a search string length range and a search string prefix pertaining to the target data based on the search query; setting data locality of the partition as‘none’ when the search string length range does not overlap with the minimum string length and the maximum string length; determining whether the search string prefix matches with the minimum and maximum string prefix when the search string length range overlaps with the minimum string length and the maximum string length; in response to said determining being negative, setting data locality of the partition as‘partial’ when the search string prefix is within the minimum and the maximum string prefix, else setting data locality of the partition as‘none’; and in response to said determining being affirmative, setting data locality of the partition as‘partial’ when the minimum string length and the maximum string length is not within search string length range, else: setting data locality of the
  • Another aspect of the present disclosure relates to a method for managing configuration of a storage medium for locating data locations among a plurality of partitions of the storage medium, characterized in that the method comprises the steps of: configuring a data manager that is operatively coupled with a computing device, wherein the data manager performs the steps of: indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index; locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index; and searching the target data in the at least one target partition based on the data locality of each partition.
  • FIG. 1 illustrates an exemplary network architecture in which or with which proposed system can be implemented in accordance with an embodiment of the present disclosure.
  • FIGs. 2A-B illustrate exemplary implementations of the proposed system in accordance with an embodiment of the present disclosure.
  • FIG. 3 is a flow diagram representing working of the proposed system in accordance with an embodiment of the present disclosure.
  • FIG. 4 is a flow diagram representing working of the data indexer engine in accordance with an embodiment of the present disclosure.
  • FIG. 5 is a flow diagram representing working of the data locality engine in accordance with an embodiment of the present disclosure.
  • FIGs. 6A-B illustrate flow diagram and exemplary representations of data locality check when data type of the partition is numeric in accordance with an embodiment of the present disclosure.
  • FIGs. 7A-C illustrate flow diagram and exemplary representations of data locality check when data type of the partition is string in accordance with an embodiment of the present disclosure.
  • FIG. 8 is a flow diagram representing working of the locality search engine in accordance with an embodiment of the present disclosure.
  • FIG. 9 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present invention.
  • Storage Medium Any computing hardware that is used for storing, porting and extracting data files and objects. It can hold and store information both temporarily and permanently, and can be internal or external to a computer, server or any similar computing device.
  • Partition A separate region created by partitioning of the storage media into of one or more regions, so that an operating system can manage information in each region separately.
  • the present disclosure relates to a system that comprises a data manager.
  • the data manager enables management of a storage medium for locating data locations among a plurality of partitions of the storage medium.
  • the data manager can include a data indexer engine for indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index.
  • the data manager can include a data locality engine for locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index and a local search engine for searching the target data in the at least one target partition based on the data locality of each partition.
  • a data locality engine for locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index
  • a local search engine for searching the target data in the at least one target partition based on the data locality of each partition.
  • FIG. 1 illustrates an exemplary architecture (100) in which or with which proposed system can be implemented in accordance with an embodiment of the present disclosure.
  • proposed system can include a data manager (102) that can be implemented in a computing device.
  • the data manager (102) can enable management of a storage medium such as data repository (1 18) for locating data locations among a plurality of partitions of the data repository (1 18).
  • the computing device can be any device using any or a combination of hardware components and software components such as a computing device, a security device, a network device, a mobile phone, a desktop computer, a personal computer, a laptop, a tablet PC, a portable computer, a personal digital assistant and the like, such that a user can interact with the data manager (102) using the computing device.
  • partitioning addresses a key problem of supporting very large tables and indexes by allowing to decompose contents of the tables into smaller and more manageable units called partitions.
  • an object is used to define how rows (or index) of a partitioned table are mapped to a set of partitions based on values of certain column, called a partitioned column. Further, number of partitions that the table will have and how the boundaries of the partitions are also defined.
  • the data manager (102) aims to enable fast searching by locating data location among partitions and perform searching on fewer partitions, which reduces total disk Input/ Output time and overall searching time while maintaining minimal size of the index.
  • a system which may comprise a data manager (102), can include one or more processor(s) (104).
  • the processor(s) (104) can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that manipulate data based on operational instructions.
  • the processor(s) (104) are configured to fetch and execute computer-readable instructions stored in a memory (106) of the system.
  • the memory (106) can store one or more computer-readable instructions or routines, which may be fetched and executed to create or share the data units over a network service.
  • the memory (106) can include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like.
  • the memory (106) may be a local memory or may be located remotely, such as a server, a file server, a data server, and the cloud.
  • the system can also include one or more interface(s) (108).
  • the interface(s) (108) may include a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like.
  • the interface(s) (108) may facilitate communication of the system with various devices coupled to the system.
  • the interface(s) (108) may also provide a communication pathway for one or more components of the system. Examples of such components include, but are not limited to, processing engine(s) (1 10) and data (122).
  • the processing engine(s) (110) can be implemented as a combination of hardware and software or firmware programming (for example, programmable instructions) to implement one or more functionalities of the engine(s) (1 10).
  • the programming for the engine(s) (110) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the engine(s) (1 10) may include a processing resource (for example, one or more processors), to execute such instructions.
  • the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the engine(s) (1 10).
  • the system can include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system and the processing resource.
  • the processing engine(s) (1 10) may be implemented by electronic circuitry.
  • the data (122) can include data that is either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) (1 10).
  • the processing engine(s) (1 10) can include a data indexer engine (1 12), a data locality engine (1 14), a local search engine (1 16) and supplementary engine(s) (120).
  • Supplementary engine(s) (120) can implement functionalities that supplement applications or functions performed by system or processing engine(s) (1 10).
  • the data indexer engine (1 12) can index each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index value.
  • the data indexer engine (112) when data type of a partition is numeric, stores minimum numeric value and maximum numeric value as index.
  • the data indexer engine (1 12) when data type of the contents stored in the partition is string, stores minimum string prefix value, maximum string prefix value, minimum string length and maximum string length as index.
  • the data locality engine (1 14) in response to a search query, can locate at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index.
  • the search attribute can be a search value range in case of numeric data type or a search string length range and a search string prefix in case of string data type. Exemplary working of the data locality engine (1 14) is explained with the reference to FIG. 5, 6A-B and 7A-C.
  • the local search engine (1 16) can search the target data in the target partition based on the data locality of each partition. Exemplary working of the local search engine (1 16) is explained with reference to FIG. 8.
  • FIGs. 2A-B illustrate exemplary implementations of the proposed system in accordance with an embodiment of the present disclosure.
  • the data indexer engine (112) indexes the partitions of the data repository (1 18) by examining contents of each partition and extracting minimum and maximum values of the contents of the partition as index. For example, if a data type of the contents is numeric, a minimum numeric value and a maximum numeric value among the contents is extracted as index. Further, if a data type of the contents is string, a minimum string prefix value, a maximum string prefix value, minimum string length and maximum string length are extracted as index.
  • the data locality engine (1 14) receives a search request pertaining to a search query (220).
  • the data locality engine (1 14) locates target partitions of the data repository (118) to find target partitions that stores a target data.
  • the data locality engine (1 14) compares a search attribute with any or a combination of the minimum value and the maximum value of the contents and then sets the data locality for each partition as all,‘partial’ or ‘none’ based on comparison.
  • the local search engine (116) performs local search on the target partitions.
  • the local search engine (116) examines the data locality of each partition and retrieves all data from the partition, when the data locality for the partition is set as all. In case the data locality for the partition is set as‘partial’, the local search engine (1 16) searches the target data against the search query to retrieve match data from the partition. Further, the local search engine (1 16) does not search the partition associated with data locality as ‘none’. Finally, at step (210), the local search engine (1 16) returns the search results.
  • the data locality engine (1 14) receives a search request pertaining to the search query (220). As illustrated in block (254), based on the search query, the data locality engine (114) determines a search value range and compares the search value range with minimum and maximum values of the contents of each partition, which are stored as index, to determine the data locality of each partition.
  • the data locality engine (1 14) locates relevant partition by performing data locality check on each partition to set data locality as‘partial’,’all’ or‘none’.
  • the partitions with data locality as all or‘partial’ can be considered as target/relevant partitions, wherein the local search engine (1 16) retrieves all data from the partition, when the data locality for the partition is set as all. If the data locality for the partition is set as‘partial’, the local search engine (1 16) searches the target data against the search query to retrieve match data from the partition. The partitions with data locality set as‘none’ is omitted from searching.
  • the local search engine (1 16) performs searching in only relevant partitions and at step (262), the local search engine (1 16) returns the search result.
  • FIG. 3 is a flow diagram (300) representing working of the proposed system in accordance with an embodiment of the present disclosure.
  • each partition of the plurality of partitions can be indexed by examining contents of each partition and extracting a minimum value and a maximum value of the contents.
  • the extracted minimum value and the extracted maximum value can be stored as index of the partition.
  • the data locality of the partition can be set based on comparing a search attribute with the minimum value and the maximum value of the contents stored as the index.
  • the target data can be searched in the at least one target partition based on the data locality of each partition.
  • FIG. 4 is a flow diagram (400) representing working of the data indexer engine in accordance with an embodiment of the present disclosure.
  • a partitioned table, on which indexing is to be performed is selected.
  • the partitioned table contains plurality of partitions.
  • a partition of the partitioned table is selected to determine data type of the contents of the partition.
  • numeric data in the partition is examined so that at block (410), a maximum numeric value and a minimum numeric value from the numeric data is found.
  • the maximum numeric value and the minimum numeric value are stored as index of the partition in an index file.
  • the data type of the contents of the partition is string
  • string prefix of specified length and length of the string for a plurality of records of the partition are extracted.
  • a maximum string prefix and a minimum string prefix from the plurality of records is determined in lexical order along with a maximum string length and a minimum string length.
  • any or a combination of the maximum string prefix, the minimum string prefix, the maximum string length and the minimum string length are stored as index of the partition.
  • FIG. 5 is a flow diagram (500) representing working of the data locality engine in accordance with an embodiment of the present disclosure.
  • a partition is selected to perform data locality check for setting data locality of the partition such that target data pertaining to a search query can be located based on the data locality.
  • the results of data locality check from blocks (508) and (506) are collected so that, at block (512), the results can be provided to the local search engine.
  • FIG. 6A is a flow diagram (600) representing working of the data locality engine when data type of the partition is numeric in accordance with an embodiment of the present disclosure.
  • index of the partition is examined to extract, at block (604), the minimum numeric value and the maximum numeric value of the partition from the index.
  • a search value range is examined pertaining to the target data based on the search query.
  • data locality of the partition is set as ‘all’, when at block (608), the minimum numeric value and the maximum numeric value is determined to be within the search value range.
  • the search value range is within the minimum numeric value and the maximum numeric value.
  • data locality of the partition is set as‘partial’ when the minimum numeric value and the maximum numeric value are not within the search value range and the search value range is within the minimum numeric value and the maximum numeric value. Further, when neither the minimum numeric value and the maximum numeric value is within the search value range nor the search value range is within the minimum numeric value and the maximum numeric value, at block, (616), it is determined whether any of the minimum numeric value or the maximum numeric value is within the search value range so that, at block (614) data locality of the partition is set as ‘partial’ when any of the minimum numeric value or the maximum numeric value is within the search value range.
  • data locality of the partition is set as‘none’. Further, at block (620), it is determined whether another partition is pending for data locality check so that, when another partition is pending, the process continues from block (602), otherwise the process ends.
  • FIG. 6B illustrates various examples of setting data locality for a partition containing numeric data in accordance with an embodiment of the present disclosure.
  • a search value range of 20 to 40 is considered as pertaining to a search query.
  • data locality of a partition is set as‘all’, as minimum and maximum value of the contents of the partition is within the search value range.
  • data locality of a partition is set as‘partial’ as search value range is within minimum and maximum value range.
  • data locality of a partition is set as ‘none’ as minimum and maximum value range does not overlap with search value range.
  • data locality of a partition is set as ‘partial’ as maximum value is within search value range.
  • FIG. 7A illustrate exemplary flow diagram (700) representing working of the data locality engine when data type of the partition is string in accordance with an embodiment of the present disclosure.
  • index of the partition is examined to extract, at block (704), the minimum string prefix, the maximum string prefix along with the minimum string length and the maximum string length of the partition from the index.
  • a search string length range and a search string prefix pertaining to the target data based on the search query is extracted.
  • data locality of the partition is set as‘none’ when at block (708) the search string length range does not overlap with the minimum string length and the maximum string length.
  • search string prefix matches with the minimum and maximum prefix.
  • search string prefix does not match with the minimum and maximum prefix, it is determined whether the search string prefix is within the minimum string prefix and maximum string prefix lexical range.
  • data locality of the partition is set as‘partial’ when the search string prefix is within the maximum string prefix otherwise, at block (710), data locality of the partition is set as‘none’.
  • search string prefix matches with the minimum and maximum string prefix it is determined whether the minimum string length and the maximum string length is within the search string length range.
  • data locality of the partition is set as‘partial’ when the maximum string length and the minimum string length is not within the search string length range. Further, when the maximum string length and the minimum string length is within the search string length range, at block (718), it is determined whether search string prefix contains a wildcard suffix such that at block (722) data locality of the partition is set as’all’ when the search string prefix contains the wildcard suffix otherwise the data locality of the partition is set as ‘partial’. At block (724), it is determined whether another partition is available for data locality check such that the process is repeated from block (702), otherwise the process ends.
  • FIGs. 7B-C illustrate various examples of setting data locality for a partition containing string data in accordance with an embodiment of the present disclosure.
  • a search query“name like‘del%’” is considered. Therefore, search prefix is considered as“del”.
  • search prefix matches minimum prefix and maximum prefix of the partition the data locality of the partition is set as ’all’.
  • the data locality of the partition is set as‘partial’.
  • search prefix is not within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘none’.
  • search prefix is considered as“delt”.
  • search prefix matches minimum prefix and maximum prefix of the partition but minimum and maximum string length is not within search string length range, the data locality of the partition is set as‘partial’.
  • search prefix is within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘partial’.
  • search prefix is not within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘none’.
  • FIG. 8 is a flow diagram (800) representing working of the locality search engine in accordance with an embodiment of the present disclosure.
  • data locality check for each partition is examined.
  • all data from the partition is retrieved when, at block (804), the data locality of the partition is determined to be’all’.
  • searching is performed in the partition when, at block (808), it is determined that the data locality of said partition is‘partial’. Further, at block (814), matched data from the partition is retrieved.
  • the partition with data locality as‘none’ is not considered for retrieving data, thus, at block (810), when data locality of the partition is neither’all’ nor‘partial’, the partition is skipped.
  • FIG. 9 illustrates an exemplary computer system (900) in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present invention.
  • computer system (900) which may represent the proposed system or data manager (102) can include an external storage device (910), a bus (920), a main memory (930), a read only memory (940), a mass storage device (950), communication port (960), and a processor (970).
  • processor (970) include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOCTM system on a chip processors or other future processors.
  • Processor (970) may include various modules associated with embodiments of the present invention.
  • Communication port (960) can be any of an RS-232 port for use with a modem-based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports.
  • Communication port (960) may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects.
  • LAN Local Area Network
  • WAN Wide Area Network
  • Memory (930) can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art.
  • Read only memory (940) can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor (970).
  • Mass storage (950) may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g.
  • PATA Parallel Advanced Technology Attachment
  • SATA Serial Advanced Technology Attachment
  • USB Universal Serial Bus
  • Seagate e.g., the Seagate Barracuda 7200 family
  • Hitachi e.g., the Hitachi Deskstar 7K1000
  • one or more optical discs e.g., Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
  • RAID Redundant Array of Independent Disks
  • Bus (920) communicatively couples processor(s) (970) with the other memory, storage and communication blocks.
  • Bus (920) can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor (970) to software system.
  • PCI Peripheral Component Interconnect
  • PCI-X PCI Extended
  • SCSI Small Computer System Interface
  • FFB front side bus
  • operator and administrative interfaces e.g. a display, keyboard, and a cursor control device
  • bus (920) may also be coupled to bus (920) to support direct operator interaction with computer system.
  • Other operator and administrative interfaces can be provided through network connections connected through communication port (960).
  • External storage device (910) can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD- ROM), Compact Disc - Re-Writable (CD-RW), Digital Video Disk - Read Only Memory (DVD-ROM).
  • CD- ROM Compact Disc - Read Only Memory
  • CD-RW Compact Disc - Re-Writable
  • DVD-ROM Digital Video Disk - Read Only Memory
  • Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process.
  • the machine-readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine-readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A system for managing configuration of a storage medium is disclosed. The system (100) comprises a data manager (102) that is operatively coupled with a computing device, the data manager (102) comprising: a data indexer engine (110) for indexing each partition of a plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index; a data locality engine (112) for locating, in response to a search query, at least one target partition; and a local search engine (114) for searching the target data in the at least one target partition.

Description

MANAGEMENT OF DATA FOR CONTENT BASED DATA LOCALITY SEARCH
FIELD OF THE DISCLOSURE The present disclosure relates to the field of management and storage of data in a storage medium, in particular to a system and method for managing configuration of a storage medium.
BACKGROUND As various enterprises rely on storing large amounts of data, data locating services have become a critical element in event of a catastrophic occurrence or for processing extremely large amounts of data. Conventional techniques exist that process through contents of each data file available on a storage location or a database and search the content in response to a search query. Latter variations on this allowed searching for variants of a search term using address-based data locating techniques. These techniques were somewhat effective in being able to find a small number of search terms in a small group of data, but lacked the performance required to search a large volume of data or to search a large number of search terms in a reasonable amount of time. Overcoming these bottlenecks is a complex task because current approaches, such as, hashing, B-tree indexing, etc. use address-based search technique for data locality positioning. However, as data stored on the database grows, the addresses are liable to grow even larger, accounts for higher computational power during processing of a search query for a content stored on the database, and reduces processing speed in searching for the content on the database as large chunk of data needs to be sorted and/or indexed. Conventional techniques employ partitioning of the data to be searched, and rely on creating local indices to improve data searching. However, this is still an address-based technique and addresses grow larger and harder to maintain with increasing number of files and data to be searched.
Efforts have been made in the past to overcome the foresaid limitations associated with the pertinent art. For instance, United States Patent Number US 8,572,091 B1 provides a system and method for implementing a scalable data storage service that maintains tables in a non-relational data. The cited reference discloses partitioning and indexing of items stored in the tables according to a respective primary key that consists of a hash key component and a range key component. The system determines whether a range key attribute of a query is completely, partially or not within range of each partitioned item. However, partitioning and indexing is based on comparison of range key attribute of the query with range key component of each partitioned item, and categorization of each matching partition based on relevancy of corresponding partition with respect to the query is not taught or even suggested. There is therefore a need in the art for an efficient data partitioning and indexing management system that significantly improves locating data location among file partitions and reduces amount of data to be searched with respect to a search query.
SUMMARY
The present disclosure proposes a system and method for managing a storage medium for locating data locations among a plurality of partitions of the storage medium. In accordance with this disclosure, the storage medium can be managed using a data manager operatively coupled with a computing device. An aspect of the present disclosure relates to a system comprising a data manager that is operatively coupled with a computing device, characterized in that the data manager enables management of a storage medium for locating data locations among a plurality of partitions of the storage medium by means of: a data indexer engine for indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index; a data locality engine for locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index; and a local search engine for searching the target data in the at least one target partition based on the data locality of each partition.
According to an embodiment, the indexing further comprises: selecting a partition of the plurality of partitions to determine data type of the contents of said partition; examining, when the determined data type of the contents is numeric, numeric data in the partition to find a maximum numeric value and a minimum numeric value of the numeric data and storing the maximum numeric value and the minimum numeric value as index; and extracting, when the determined data type of the contents is string, length of the string and a string prefix for plurality of records of the partition to determine a maximum string prefix and a minimum string prefix in lexical order and a maximum string length and a minimum string length, and storing any or a combination of the determined maximum string prefix, the determined minimum string prefix, the determined maximum string length and the determined minimum string length as index.
According to an embodiment, data locality for each partition is set as any of ‘all’, ‘partial’ or‘none’ based on said comparing.
According to an embodiment, the searching further comprises: examining data locality of each partition; retrieving, when data locality for the partition is set as‘all’, all data from said partition; and searching, when data locality for the partition is set as‘partial’, the target data against the search query to retrieve match data from said partition.
According to an embodiment, the target partition comprises the partition associated with data locality as’all’ or‘partial’. According to an embodiment, the searching is not performed in the partition associated with data locality as‘none’.
According to an embodiment, the locating further comprises: selecting a partition of the plurality of partition to perform data locality check on said partition; determining data type of the contents of said partition; and performing the data locality check based on the determined data type.
According to an embodiment, when data type of the partition is numeric, the data locality check comprises: extracting the minimum numeric value and the maximum numeric value of the partition from the index; examining a search value range pertaining to the target data based on the search query; setting data locality of the partition as’all’ when the minimum numeric value and the maximum numeric value are within the search value range; setting data locality of the partition as‘partial’ when the minimum numeric value and the maximum numeric value are not within the search value range and the search value range is within the minimum numeric value and the maximum numeric value; and setting data locality of the partition as ‘partial’ when any of the minimum numeric value or the maximum numeric value is within the search value range else setting data locality of the partition as‘none’.
According to an embodiment, when the data type of the partition is string, said data locality check comprises: extracting any or a combination of the minimum string length and the maximum string length and the minimum string prefix and the maximum string prefix of the partition from the index; extracting a search string length range and a search string prefix pertaining to the target data based on the search query; setting data locality of the partition as‘none’ when the search string length range does not overlap with the minimum string length and the maximum string length; determining whether the search string prefix matches with the minimum and maximum string prefix when the search string length range overlaps with the minimum string length and the maximum string length; in response to said determining being negative, setting data locality of the partition as‘partial’ when the search string prefix is within the minimum and the maximum string prefix, else setting data locality of the partition as‘none’; and in response to said determining being affirmative, setting data locality of the partition as‘partial’ when the minimum string length and the maximum string length is not within search string length range, else: setting data locality of the partition as‘partial’ when the search string prefix contains a wildcard suffix, else setting the data locality of the partition as ’all’. Another aspect of the present disclosure relates to a method for managing configuration of a storage medium for locating data locations among a plurality of partitions of the storage medium, characterized in that the method comprises the steps of: configuring a data manager that is operatively coupled with a computing device, wherein the data manager performs the steps of: indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index; locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index; and searching the target data in the at least one target partition based on the data locality of each partition.
Various objects, features, aspects and advantages of the inventive subject matter will become more apparent from the following detailed description of preferred embodiments, along with the accompanying drawing figures in which like numerals represent like components.
BRIEF DESCRIPTION OF THE ACCOMPANYING DRAWINGS
In the figures, similar components and/or features may have the same reference label. Further, various components of the same type may be distinguished by following the reference label with a second label that distinguishes among the similar components. If only the first reference label is used in the specification, the description is applicable to any one of the similar components having the same first reference label irrespective of the second reference label. FIG. 1 illustrates an exemplary network architecture in which or with which proposed system can be implemented in accordance with an embodiment of the present disclosure.
FIGs. 2A-B illustrate exemplary implementations of the proposed system in accordance with an embodiment of the present disclosure.
FIG. 3 is a flow diagram representing working of the proposed system in accordance with an embodiment of the present disclosure.
FIG. 4 is a flow diagram representing working of the data indexer engine in accordance with an embodiment of the present disclosure. FIG. 5 is a flow diagram representing working of the data locality engine in accordance with an embodiment of the present disclosure.
FIGs. 6A-B illustrate flow diagram and exemplary representations of data locality check when data type of the partition is numeric in accordance with an embodiment of the present disclosure. FIGs. 7A-C illustrate flow diagram and exemplary representations of data locality check when data type of the partition is string in accordance with an embodiment of the present disclosure.
FIG. 8 is a flow diagram representing working of the locality search engine in accordance with an embodiment of the present disclosure. FIG. 9 illustrates an exemplary computer system in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present invention.
DETAILED DESCRIPTION In accordance with the present disclosure, there is provided a system and a method for managing configuration of a storage medium for locating data locations among a plurality of partitions of the storage medium, which will now be described with reference to the embodiment shown in the accompanying drawings. The embodiment does not limit the scope and ambit of the disclosure. The description relates purely to the exemplary embodiment and its suggested applications.
The embodiment herein and the various features and advantageous details thereof are explained with reference to the non-limiting embodiment in the following description. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiment herein may be practiced and to further enable those of skill in the art to practice the embodiment herein. Accordingly, the description should not be construed as limiting the scope of the embodiment herein.
The description hereinafter, of the specific embodiment will so fully reveal the general nature of the embodiments herein that others can, by applying current knowledge, readily modify or adapt or perform both for various applications such specific embodiment without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation.
Various terms as used herein are defined below. To the extent a term used in a claim is not defined below, it should be given the broadest definition persons in the pertinent art have given that term as reflected in printed publications and issued patents at the time of filing.
Definitions:
Storage Medium: Any computing hardware that is used for storing, porting and extracting data files and objects. It can hold and store information both temporarily and permanently, and can be internal or external to a computer, server or any similar computing device. Partition: A separate region created by partitioning of the storage media into of one or more regions, so that an operating system can manage information in each region separately.
The present disclosure relates to a system that comprises a data manager. The data manager enables management of a storage medium for locating data locations among a plurality of partitions of the storage medium. The data manager can include a data indexer engine for indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index. Further, the data manager can include a data locality engine for locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index and a local search engine for searching the target data in the at least one target partition based on the data locality of each partition. Those skilled in the art would appreciate that various embodiments of the present disclosure enable content-based data locality search that provides fast data search by using content information as heuristic measure to locate data locality for highly distinctive content. Embodiments herein improve searching time by locating the data among target partition such that fewer partition are accessed and minimal index file size is maintained.
Referring to the accompanying drawings, FIG. 1 illustrates an exemplary architecture (100) in which or with which proposed system can be implemented in accordance with an embodiment of the present disclosure.
As illustrated, in an architecture (100) that represents, proposed system can include a data manager (102) that can be implemented in a computing device. The data manager (102) can enable management of a storage medium such as data repository (1 18) for locating data locations among a plurality of partitions of the data repository (1 18). The computing device can be any device using any or a combination of hardware components and software components such as a computing device, a security device, a network device, a mobile phone, a desktop computer, a personal computer, a laptop, a tablet PC, a portable computer, a personal digital assistant and the like, such that a user can interact with the data manager (102) using the computing device.
Those skilled in the art would appreciate that partitioning addresses a key problem of supporting very large tables and indexes by allowing to decompose contents of the tables into smaller and more manageable units called partitions. Generally, during partitioning, an object is used to define how rows (or index) of a partitioned table are mapped to a set of partitions based on values of certain column, called a partitioned column. Further, number of partitions that the table will have and how the boundaries of the partitions are also defined. According to various embodiments of the present disclosure, the data manager (102) aims to enable fast searching by locating data location among partitions and perform searching on fewer partitions, which reduces total disk Input/ Output time and overall searching time while maintaining minimal size of the index.
As illustrated, a system, which may comprise a data manager (102), can include one or more processor(s) (104). The processor(s) (104) can be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, logic circuitries, and/or any devices that manipulate data based on operational instructions. Among other capabilities, the processor(s) (104) are configured to fetch and execute computer-readable instructions stored in a memory (106) of the system. The memory (106) can store one or more computer-readable instructions or routines, which may be fetched and executed to create or share the data units over a network service. The memory (106) can include any non-transitory storage device including, for example, volatile memory such as RAM, or non-volatile memory such as EPROM, flash memory, and the like. In an example embodiment, the memory (106) may be a local memory or may be located remotely, such as a server, a file server, a data server, and the cloud.
The system can also include one or more interface(s) (108). The interface(s) (108) may include a variety of interfaces, for example, interfaces for data input and output devices, referred to as I/O devices, storage devices, and the like. The interface(s) (108) may facilitate communication of the system with various devices coupled to the system. The interface(s) (108) may also provide a communication pathway for one or more components of the system. Examples of such components include, but are not limited to, processing engine(s) (1 10) and data (122).
The processing engine(s) (110) can be implemented as a combination of hardware and software or firmware programming (for example, programmable instructions) to implement one or more functionalities of the engine(s) (1 10). In the examples described herein, such combinations of hardware and software or firmware programming may be implemented in several different ways. For example, the programming for the engine(s) (110) may be processor executable instructions stored on a non-transitory machine-readable storage medium and the hardware for the engine(s) (1 10) may include a processing resource (for example, one or more processors), to execute such instructions. In the examples, the machine-readable storage medium may store instructions that, when executed by the processing resource, implement the engine(s) (1 10). In such examples, the system can include the machine-readable storage medium storing the instructions and the processing resource to execute the instructions, or the machine-readable storage medium may be separate but accessible to the system and the processing resource. In other examples, the processing engine(s) (1 10) may be implemented by electronic circuitry. The data (122) can include data that is either stored or generated as a result of functionalities implemented by any of the components of the processing engine(s) (1 10).
In an example, the processing engine(s) (1 10) can include a data indexer engine (1 12), a data locality engine (1 14), a local search engine (1 16) and supplementary engine(s) (120). Supplementary engine(s) (120) can implement functionalities that supplement applications or functions performed by system or processing engine(s) (1 10).
According to an embodiment, the data indexer engine (1 12) can index each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index value. In an example, when data type of a partition is numeric, the data indexer engine (112) stores minimum numeric value and maximum numeric value as index. In another example, when data type of the contents stored in the partition is string, the data indexer engine (1 12) stores minimum string prefix value, maximum string prefix value, minimum string length and maximum string length as index. Working of the data indexer engine (1 12) is explained with reference to FIG 4.
According to an embodiment, in response to a search query, the data locality engine (1 14) can locate at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents stored as the index. The search attribute can be a search value range in case of numeric data type or a search string length range and a search string prefix in case of string data type. Exemplary working of the data locality engine (1 14) is explained with the reference to FIG. 5, 6A-B and 7A-C.
According to an embodiment, the local search engine (1 16) can search the target data in the target partition based on the data locality of each partition. Exemplary working of the local search engine (1 16) is explained with reference to FIG. 8.
FIGs. 2A-B illustrate exemplary implementations of the proposed system in accordance with an embodiment of the present disclosure.
Referring to FIG. 2A, at step (202), the data indexer engine (112) indexes the partitions of the data repository (1 18) by examining contents of each partition and extracting minimum and maximum values of the contents of the partition as index. For example, if a data type of the contents is numeric, a minimum numeric value and a maximum numeric value among the contents is extracted as index. Further, if a data type of the contents is string, a minimum string prefix value, a maximum string prefix value, minimum string length and maximum string length are extracted as index. At step (208), the data locality engine (1 14) receives a search request pertaining to a search query (220). At step (204), the data locality engine (1 14) locates target partitions of the data repository (118) to find target partitions that stores a target data. The data locality engine (1 14) compares a search attribute with any or a combination of the minimum value and the maximum value of the contents and then sets the data locality for each partition as all,‘partial’ or ‘none’ based on comparison.
At step (206), the local search engine (116) performs local search on the target partitions. The local search engine (116) examines the data locality of each partition and retrieves all data from the partition, when the data locality for the partition is set as all. In case the data locality for the partition is set as‘partial’, the local search engine (1 16) searches the target data against the search query to retrieve match data from the partition. Further, the local search engine (1 16) does not search the partition associated with data locality as ‘none’. Finally, at step (210), the local search engine (1 16) returns the search results.
Referring to FIG. 2B, considering a search query (220) pertaining to income > 3500 is to be performed on the data repository (118). Firstly, all partitions of the data repository (1 18) are indexed by the data indexer engine (1 12) based on minimum value and maximum value of the contents of each partition. At step (260), the data locality engine (1 14) receives a search request pertaining to the search query (220). As illustrated in block (254), based on the search query, the data locality engine (114) determines a search value range and compares the search value range with minimum and maximum values of the contents of each partition, which are stored as index, to determine the data locality of each partition. Thus, at step (256), the data locality engine (1 14), locates relevant partition by performing data locality check on each partition to set data locality as‘partial’,’all’ or‘none’. The partitions with data locality as all or‘partial’ can be considered as target/relevant partitions, wherein the local search engine (1 16) retrieves all data from the partition, when the data locality for the partition is set as all. If the data locality for the partition is set as‘partial’, the local search engine (1 16) searches the target data against the search query to retrieve match data from the partition. The partitions with data locality set as‘none’ is omitted from searching. At step (258), the local search engine (1 16) performs searching in only relevant partitions and at step (262), the local search engine (1 16) returns the search result.
FIG. 3 is a flow diagram (300) representing working of the proposed system in accordance with an embodiment of the present disclosure. According to an embodiment, at block (302), each partition of the plurality of partitions can be indexed by examining contents of each partition and extracting a minimum value and a maximum value of the contents. The extracted minimum value and the extracted maximum value can be stored as index of the partition.
According to an embodiment, at block (304), in response to a search query, at least one target partition of the plurality of partitions that stores a target data can be located, by setting a data locality for each partition. The data locality of the partition can be set based on comparing a search attribute with the minimum value and the maximum value of the contents stored as the index.
According to an embodiment, at block (306), the target data can be searched in the at least one target partition based on the data locality of each partition.
FIG. 4 is a flow diagram (400) representing working of the data indexer engine in accordance with an embodiment of the present disclosure.
According to an embodiment, at block, (402), a partitioned table, on which indexing is to be performed is selected. The partitioned table contains plurality of partitions. At block (404), a partition of the partitioned table is selected to determine data type of the contents of the partition. At block (406), it is determined whether the data type is numeric. When the data type is numeric, at block (408), numeric data in the partition is examined so that at block (410), a maximum numeric value and a minimum numeric value from the numeric data is found. At block (412), the maximum numeric value and the minimum numeric value are stored as index of the partition in an index file. Conversely, when the data type of the contents of the partition is string, at block (414), string prefix of specified length and length of the string for a plurality of records of the partition are extracted. Further, at block (416), a maximum string prefix and a minimum string prefix from the plurality of records is determined in lexical order along with a maximum string length and a minimum string length. At block (418), any or a combination of the maximum string prefix, the minimum string prefix, the maximum string length and the minimum string length are stored as index of the partition. At block (420), it is determined whether another partition is available for indexing. If another partition is available, the process continues at block (404), otherwise indexing is complete.
FIG. 5 is a flow diagram (500) representing working of the data locality engine in accordance with an embodiment of the present disclosure.
In an example, as illustrated FIG. 5, at block (502), a partition is selected to perform data locality check for setting data locality of the partition such that target data pertaining to a search query can be located based on the data locality. At block (504), it is determined whether data type of the contents of the partition is numeric. In response to determining that the data type is numeric, at block (508), data locality check for numeric data is performed, which is further explained with reference to FIGs. 6A-B. Flowever, if data type is string, at block (506), data locality check for string data is performed, which is explained with reference to FIGs. 7A-C. At block (510), the results of data locality check from blocks (508) and (506) are collected so that, at block (512), the results can be provided to the local search engine.
FIG. 6A is a flow diagram (600) representing working of the data locality engine when data type of the partition is numeric in accordance with an embodiment of the present disclosure.
According to an embodiment, at block (602), index of the partition is examined to extract, at block (604), the minimum numeric value and the maximum numeric value of the partition from the index. At block (606), a search value range is examined pertaining to the target data based on the search query. At block (610), data locality of the partition is set as ‘all’, when at block (608), the minimum numeric value and the maximum numeric value is determined to be within the search value range. When at block (608), it is determined that the minimum numeric value and the maximum numeric value is not within the search value range, at block (612), it is determined whether the search value range is within the minimum numeric value and the maximum numeric value. At block (614), data locality of the partition is set as‘partial’ when the minimum numeric value and the maximum numeric value are not within the search value range and the search value range is within the minimum numeric value and the maximum numeric value. Further, when neither the minimum numeric value and the maximum numeric value is within the search value range nor the search value range is within the minimum numeric value and the maximum numeric value, at block, (616), it is determined whether any of the minimum numeric value or the maximum numeric value is within the search value range so that, at block (614) data locality of the partition is set as ‘partial’ when any of the minimum numeric value or the maximum numeric value is within the search value range. When neither of the above conditions of block (608), (612), or (616) are satisfied, at block (618) data locality of the partition is set as‘none’. Further, at block (620), it is determined whether another partition is pending for data locality check so that, when another partition is pending, the process continues from block (602), otherwise the process ends.
FIG. 6B illustrates various examples of setting data locality for a partition containing numeric data in accordance with an embodiment of the present disclosure.
In context of the present examples, a search value range of 20 to 40 is considered as pertaining to a search query. At (650), data locality of a partition is set as‘all’, as minimum and maximum value of the contents of the partition is within the search value range. At (660), data locality of a partition is set as‘partial’ as search value range is within minimum and maximum value range. At (670), data locality of a partition is set as ‘none’ as minimum and maximum value range does not overlap with search value range. At (680), data locality of a partition is set as ‘partial’ as maximum value is within search value range.
FIG. 7A illustrate exemplary flow diagram (700) representing working of the data locality engine when data type of the partition is string in accordance with an embodiment of the present disclosure. According to an embodiment, at block (702), index of the partition is examined to extract, at block (704), the minimum string prefix, the maximum string prefix along with the minimum string length and the maximum string length of the partition from the index. At block (706), a search string length range and a search string prefix pertaining to the target data based on the search query is extracted. At block (710), data locality of the partition is set as‘none’ when at block (708) the search string length range does not overlap with the minimum string length and the maximum string length. On contrary, when the search string length range overlaps with the minimum and maximum string length, at block (712) it is determined whether the search string prefix matches with the minimum and maximum prefix. At block (714), when the search string prefix does not match with the minimum and maximum prefix, it is determined whether the search string prefix is within the minimum string prefix and maximum string prefix lexical range. At block (720), data locality of the partition is set as‘partial’ when the search string prefix is within the maximum string prefix otherwise, at block (710), data locality of the partition is set as‘none’. At block (716), when search string prefix matches with the minimum and maximum string prefix, it is determined whether the minimum string length and the maximum string length is within the search string length range. At block (720), data locality of the partition is set as‘partial’ when the maximum string length and the minimum string length is not within the search string length range. Further, when the maximum string length and the minimum string length is within the search string length range, at block (718), it is determined whether search string prefix contains a wildcard suffix such that at block (722) data locality of the partition is set as’all’ when the search string prefix contains the wildcard suffix otherwise the data locality of the partition is set as ‘partial’. At block (724), it is determined whether another partition is available for data locality check such that the process is repeated from block (702), otherwise the process ends.
FIGs. 7B-C illustrate various examples of setting data locality for a partition containing string data in accordance with an embodiment of the present disclosure. Referring to FIG. 7B, a search query“name like‘del%’” is considered. Therefore, search prefix is considered as“del”. At (710), as search prefix matches minimum prefix and maximum prefix of the partition, the data locality of the partition is set as ’all’. At (720), as search prefix is within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘partial’. At (730), as search prefix is not within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘none’.
Referring to FIG. 7C, a search query“name like‘delta’” is considered. Therefore, search prefix is considered as“delt”. At (740), as search prefix matches minimum prefix and maximum prefix of the partition but minimum and maximum string length is not within search string length range, the data locality of the partition is set as‘partial’. At (750), as search prefix is within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘partial’. At (760), as search prefix is not within lexical range of the minimum prefix and the maximum prefix of the partition, the data locality of the partition is set as‘none’.
FIG. 8 is a flow diagram (800) representing working of the locality search engine in accordance with an embodiment of the present disclosure. According to an embodiment, at block (802), data locality check for each partition is examined. At block (806), all data from the partition is retrieved when, at block (804), the data locality of the partition is determined to be’all’. At block (812), searching is performed in the partition when, at block (808), it is determined that the data locality of said partition is‘partial’. Further, at block (814), matched data from the partition is retrieved. Those skilled in the art would appreciate that, the partition with data locality as‘none’ is not considered for retrieving data, thus, at block (810), when data locality of the partition is neither’all’ nor‘partial’, the partition is skipped. At block (816), it is determined whether another partition is pending for searching, such that if another partition is available, the process continues at block (802), otherwise search results are returned at block (818), and the process is ended.
FIG. 9 illustrates an exemplary computer system (900) in which or with which embodiments of the present invention can be utilized in accordance with embodiments of the present invention.
As shown in FIG. 9, computer system (900) which may represent the proposed system or data manager (102) can include an external storage device (910), a bus (920), a main memory (930), a read only memory (940), a mass storage device (950), communication port (960), and a processor (970). A person skilled in the art will appreciate that computer system may include more than one processor and communication ports. Examples of processor (970) include, but are not limited to, an Intel® Itanium® or Itanium 2 processor(s), or AMD® Opteron® or Athlon MP® processor(s), Motorola® lines of processors, FortiSOC™ system on a chip processors or other future processors. Processor (970) may include various modules associated with embodiments of the present invention. Communication port (960) can be any of an RS-232 port for use with a modem-based dialup connection, a 10/100 Ethernet port, a Gigabit or 10 Gigabit port using copper or fiber, a serial port, a parallel port, or other existing or future ports. Communication port (960) may be chosen depending on a network, such a Local Area Network (LAN), Wide Area Network (WAN), or any network to which computer system connects.
Memory (930) can be Random Access Memory (RAM), or any other dynamic storage device commonly known in the art. Read only memory (940) can be any static storage device(s) e.g., but not limited to, a Programmable Read Only Memory (PROM) chips for storing static information e.g., start-up or BIOS instructions for processor (970). Mass storage (950) may be any current or future mass storage solution, which can be used to store information and/or instructions. Exemplary mass storage solutions include, but are not limited to, Parallel Advanced Technology Attachment (PATA) or Serial Advanced Technology Attachment (SATA) hard disk drives or solid-state drives (internal or external, e.g., having Universal Serial Bus (USB) and/or Firewire interfaces), e.g. those available from Seagate (e.g., the Seagate Barracuda 7200 family) or Hitachi (e.g., the Hitachi Deskstar 7K1000), one or more optical discs, Redundant Array of Independent Disks (RAID) storage, e.g. an array of disks (e.g., SATA arrays), available from various vendors including Dot Hill Systems Corp., LaCie, Nexsan Technologies, Inc. and Enhance Technology, Inc.
Bus (920) communicatively couples processor(s) (970) with the other memory, storage and communication blocks. Bus (920) can be, e.g. a Peripheral Component Interconnect (PCI) / PCI Extended (PCI-X) bus, Small Computer System Interface (SCSI), USB or the like, for connecting expansion cards, drives and other subsystems as well as other buses, such a front side bus (FSB), which connects processor (970) to software system.
Optionally, operator and administrative interfaces, e.g. a display, keyboard, and a cursor control device, may also be coupled to bus (920) to support direct operator interaction with computer system. Other operator and administrative interfaces can be provided through network connections connected through communication port (960). External storage device (910) can be any kind of external hard-drives, floppy drives, IOMEGA® Zip Drives, Compact Disc - Read Only Memory (CD- ROM), Compact Disc - Re-Writable (CD-RW), Digital Video Disk - Read Only Memory (DVD-ROM). Components described above are meant only to exemplify various possibilities. In no way should the aforementioned exemplary computer system limit the scope of the present disclosure.
The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" may be intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises," "comprising,"“including,” and“having,” are inclusive and therefore specify the presence of stated features, integers, steps, operations, elements, or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order discussed or illustrated, unless specifically identified as an order of performance. It is also to be understood that additional or alternative steps may be employed.
The use of the expression“at least” or“at least one” suggests the use of one or more elements, as the use may be in one of the embodiments to achieve one or more of the desired objects or results.
The process steps, method steps, algorithms or the like may be described in a sequential order, such processes, methods and algorithms may be configured to work in alternate orders. In other words, any sequence or order of steps that may be described does not necessarily indicate a requirement that the steps be performed in that order. The steps of processes described herein may be performed in any order practical. Further, some steps may be performed simultaneously, in parallel, or concurrently.
Embodiments of the present invention may be provided as a computer program product, which may include a machine-readable storage medium tangibly embodying thereon instructions, which may be used to program a computer (or other electronic devices) to perform a process. The machine-readable medium may include, but is not limited to, fixed (hard) drives, magnetic tape, floppy diskettes, optical disks, compact disc read-only memories (CD-ROMs), and magneto-optical disks, semiconductor memories, such as ROMs, PROMs, random access memories (RAMs), programmable read-only memories (PROMs), erasable PROMs (EPROMs), electrically erasable PROMs (EEPROMs), flash memory, magnetic or optical cards, or other type of media/machine-readable medium suitable for storing electronic instructions (e.g., computer programming code, such as software or firmware).
Various methods described herein may be practiced by combining one or more machine-readable storage media containing the code according to the present invention with appropriate standard computer hardware to execute the code contained therein. An apparatus for practicing various embodiments of the present invention may involve one or more computers (or one or more processors within a single computer) and storage systems containing or having network access to computer program(s) coded in accordance with various methods described herein, and the method steps of the invention could be accomplished by modules, routines, subroutines, or subparts of a computer program product. While the foregoing describes various embodiments of the invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof. The scope of the invention is determined by the claims that follow. The invention is not limited to the described embodiments, versions or examples, which are included to enable a person having ordinary skill in the art to make and use the invention when combined with information and knowledge available to the person having ordinary skill in the art.

Claims

CLAIMS:
1. A system for managing configuration of a storage medium for locating data locations among a plurality of partitions of the storage medium, comprising a data manager (102) that is operatively coupled with a computing device for management of a storage medium for locating data locations among a plurality of partitions of the storage medium, characterized in that the data manager (102) includes: i. a data indexer engine (1 12) for indexing each partition of the plurality of partitions by examining contents of each partition and extracting any or a combination of a minimum value and a maximum value of the contents as an index of the corresponding partition; ii. a data locality engine (1 14) for locating, in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with any or a combination of the minimum value and the maximum value of the contents extracted as the index; and iii. a local search engine (1 16) for searching the target data in the located target partition based on the data locality of each partition.
2. The system of claim 1 , wherein said data indexer engine (1 12) further: selects a partition of the plurality of partitions to determine data type of the contents of said partition; examines, when the determined data type of the contents is numeric, numeric data in the partition to find a maximum numeric value and a minimum numeric value of the numeric data and storing the maximum numeric value and the minimum numeric value as index of the partition; and extracts, when the determined data type of the contents is string, length of the string and a string prefix for plurality of records of the partition to determine a maximum string prefix and a minimum string prefix in lexical order and a maximum string length and a minimum string length and storing any or a combination of the determined maximum string prefix, the determined minimum string prefix, the determined maximum string length and the determined minimum string length as index.
3. The system of claim 1 , wherein the data locality engine (1 14) sets the data locality for each partition as any of ‘all’,‘partial’ or‘none’ based on said comparing.
4. The system of claim 1 , wherein the local search engine (1 16) further: examines the data locality of each partition; retrieves all data from said partition, when the data locality for the partition is set as‘all’; and searches, when the data locality for the partition is set as‘partial’, the target data against the search query to retrieve match data from the partition.
5. The system of claim 3, wherein the target partition comprises the partition associated with data locality as‘all’ or‘partial’.
6. The system of claim 4, wherein the local search engine (116) does not search the partition associated with data locality as‘none’.
7. The system of claim 1 , wherein the data locality engine (1 14) further: selects a partition of the plurality of partition to perform data locality check on said partition; determines a data type of the contents of said partition; and performs the data locality check based on the determined data type.
8. The system of claim 7, wherein when data type of the partition is numeric, the data locality engine (114): extracts the minimum numeric value and the maximum numeric value of the partition from the index; examines a search value range pertaining to the target data based on the search query; sets data locality of the partition as‘all’ when the minimum numeric value and the maximum numeric value are within the search value range; sets data locality of the partition as ‘partial’ when the minimum numeric value and the maximum numeric value are not within the search value range and the search value range is within the minimum numeric value and the maximum numeric value; and sets data locality of the partition as‘partial’ when any of the minimum numeric value or the maximum numeric value is within the search value range, or else sets data locality of the partition as‘none’.
9. The system of claim 7, wherein when the data type of the partition is string, the data locality engine (114): extracts any or a combination of the minimum string length, the maximum string length, the minimum string prefix and the maximum string prefix of the partition from the index; extracts a search string length range and a search string prefix pertaining to the target data based on the search query; sets data locality of the partition as‘none’ when the search string length range does not overlap with the minimum string length and the maximum string length; determines whether the search string prefix matches with the minimum and maximum string prefix when the search string length range overlaps with the minimum string length and the maximum string length; in response to said determining being negative, sets data locality of the partition as‘partial’ when the search string prefix is within the minimum and the maximum string prefix, or else sets data locality of the partition as ‘none’; and in response to said determining being affirmative, sets data locality of the partition as ‘partial’ when the minimum string length and the maximum string length is not within search string length range, or else: sets data locality of the partition as‘all’ when the search string prefix contains a wildcard suffix, or else sets the data locality of the partition as ’partial’.
10. A method for managing configuration of a storage medium for locating data locations among a plurality of partitions of the storage medium, characterized in that the method comprises the steps of: configuring a data manager that is operatively coupled with a computing device, wherein the data manager performs the steps of: i. indexing (302) each partition of the plurality of partitions by examining contents of each partition and extracting a minimum value and a maximum value of the contents as an index; ii. locating (304), in response to a search query, at least one target partition of the plurality of partitions that stores a target data, by setting a data locality for each partition based on comparing a search attribute with the minimum value and the maximum value of the contents stored as the index of the corresponding partition; and iii. searching (306) the target data in the located target partition based on the data locality of each partition.
PCT/MY2019/050076 2018-10-15 2019-10-15 Management of data for content based data locality search WO2020080931A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2018001747 2018-10-15
MYPI2018001747 2018-10-15

Publications (1)

Publication Number Publication Date
WO2020080931A1 true WO2020080931A1 (en) 2020-04-23

Family

ID=70284245

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2019/050076 WO2020080931A1 (en) 2018-10-15 2019-10-15 Management of data for content based data locality search

Country Status (1)

Country Link
WO (1) WO2020080931A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040172387A1 (en) * 2003-02-28 2004-09-02 Jeff Dexter Apparatus and method for matching a query to partitioned document path segments
KR20060049239A (en) * 2004-09-27 2006-05-18 마이크로소프트 코포레이션 System and method for scoping searches using index keys
US20110218972A1 (en) * 2010-03-08 2011-09-08 Quantum Corporation Data reduction indexing
US20140324880A1 (en) * 2010-03-10 2014-10-30 Emc Corporation Index searching using a bloom filter
EP2937794A1 (en) * 2014-04-22 2015-10-28 DataVard GmbH Method and system for archiving digital data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040172387A1 (en) * 2003-02-28 2004-09-02 Jeff Dexter Apparatus and method for matching a query to partitioned document path segments
KR20060049239A (en) * 2004-09-27 2006-05-18 마이크로소프트 코포레이션 System and method for scoping searches using index keys
US20110218972A1 (en) * 2010-03-08 2011-09-08 Quantum Corporation Data reduction indexing
US20140324880A1 (en) * 2010-03-10 2014-10-30 Emc Corporation Index searching using a bloom filter
EP2937794A1 (en) * 2014-04-22 2015-10-28 DataVard GmbH Method and system for archiving digital data

Similar Documents

Publication Publication Date Title
US8099401B1 (en) Efficiently indexing and searching similar data
US6931408B2 (en) Method of storing, maintaining and distributing computer intelligible electronic data
US20170161375A1 (en) Clustering documents based on textual content
US9734150B2 (en) Document management techniques to account for user-specific patterns in document metadata
US9858303B2 (en) In-memory latch-free index structure
ES2593779T3 (en) Limit the exploration of unordered and / or grouped relationships using near-ordered correspondences
US20120166414A1 (en) Systems and methods for relevance scoring
US7783660B2 (en) System and method for enhanced text matching
US7895210B2 (en) Methods and apparatuses for information analysis on shared and distributed computing systems
US20070239673A1 (en) Removing nodes from a query tree based on a result set
US20140136510A1 (en) Hybrid table implementation by using buffer pool as permanent in-memory storage for memory-resident data
US10025511B2 (en) Method for storing a dataset including dividing the dataset into sub-datasets each with a subset of values of an attribute of the dataset
US20210011965A1 (en) System and method for searching based on text blocks and associated search operators
Giangreco et al. ADAM pro: Database support for big multimedia retrieval
EP2766828A1 (en) Presenting search results based upon subject-versions
US10990573B2 (en) Fast index creation system for cloud big data database
US20050171931A1 (en) Database searching method and system
CN114281989B (en) Data deduplication method and device based on text similarity, storage medium and server
US8533150B2 (en) Search index generation apparatus
Jin et al. Hybrid indexing for versioned document search with cluster-based retrieval
US20080177701A1 (en) System and method for searching a volume of files
Moravec et al. A comparison of extended fingerprint hashing and locality sensitive hashing for binary audio fingerprints
WO2020080931A1 (en) Management of data for content based data locality search
Knoblock et al. Automatic spatio-temporal indexing to integrate and analyze the data of an organization
US11954223B2 (en) Data record search with field level user access control

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19872670

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19872670

Country of ref document: EP

Kind code of ref document: A1