WO2020146448A1 - Method and system for content agnostic file indexing - Google Patents

Method and system for content agnostic file indexing Download PDF

Info

Publication number
WO2020146448A1
WO2020146448A1 PCT/US2020/012661 US2020012661W WO2020146448A1 WO 2020146448 A1 WO2020146448 A1 WO 2020146448A1 US 2020012661 W US2020012661 W US 2020012661W WO 2020146448 A1 WO2020146448 A1 WO 2020146448A1
Authority
WO
WIPO (PCT)
Prior art keywords
chunks
binary data
chunk
data file
pregenerated
Prior art date
Application number
PCT/US2020/012661
Other languages
English (en)
French (fr)
Inventor
Christopher Mcelveen
Original Assignee
Lognovations Holdings, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/244,332 external-priority patent/US11138152B2/en
Application filed by Lognovations Holdings, Llc filed Critical Lognovations Holdings, Llc
Priority to KR1020217025238A priority Critical patent/KR20210110875A/ko
Priority to EP20737931.4A priority patent/EP3908937A4/en
Priority to JP2021540318A priority patent/JP2022518194A/ja
Priority to CA3126012A priority patent/CA3126012A1/en
Priority to AU2020205970A priority patent/AU2020205970A1/en
Publication of WO2020146448A1 publication Critical patent/WO2020146448A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3088Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6052Synchronisation of encoder and decoder

Definitions

  • This disclosure relates to a method for content agnostic file referencing.
  • the method may further relate to a method for content agnostic data compression.
  • File referencing techniques generally require knowledge about the kind of data being stored in order to efficiently index the data in a file referencing system. Similarly, knowledge about the data at issue is also generally used in creating improved compression approaches to reduce data size for transmission, storage, and the like.
  • this disclosure provides a method for improving computing technology with an enhanced content-agnostic file referencing system. The method improves the operation of the computer itself.
  • the disclosed method has several important advantages. For example, the disclosed method permits file referencing of any content type.
  • the disclosed method additionally permits a significant reduction in the amount of information or data that must be persisted or transmitted, as data may be generated at access time as opposed to persisted.
  • Various embodiments of the present disclosure may have none, some, or all of these advantages. Other technical advantages of the present disclosure may also be readily apparent to one skilled in the art.
  • FIG. 1 is a flowchart outlining the steps of one embodiment of the present disclosure.
  • FIG. 2 is another flowchart outlining the steps of another embodiment of the present disclosure.
  • FIG. 3 is a flowchart outlining the steps of an alternate embodiment of the present disclosure.
  • the present disclosure relates to a method for content-agnostic indexing of data.
  • the method may be used for a variety of computer-specific needs, including for example as a file referencing system or a compression system.
  • One embodiment of the present invention comprises a method as described in the flow chart depicted in FIG. 1.
  • Binary data (//,) for instance, a data file
  • the method uses this information, at step 106, the method calculates all permutations of data of the identified length. For example, if the input data is:
  • the method determines the index ( «/) of the input binary data file in the generated permutations. Using the example above, the index ( «/) returned would be “1”. Finally, rather than storing or transmitting the input binary data (i.e.“01”), the system instead stores the length (2) and the index (1).
  • the method needs only a length (/( «,)) and an index (// / ) as input.
  • the input provided would be the length (2) and the index (1).
  • the system calculates all permutations of the inputted length. As above, that would generate the following permutations:
  • n length in appropriate n-ary units respective to the order of the system
  • steps generate) oin.to s step Input.kmp_search("# ⁇ steps ⁇ " ,”# ⁇ input ⁇ ”) p input
  • an input byte string is converted into a bit string corresponding to a representation of the input byte string. This bit string is what is then processed through the method described herein.
  • a table may be pregenerated with all permuations of data of a particular length. This pregenerated table may be persisted in memory, either non-volatile or volatile memory. Using the above example, if the predetermined length is 2-bits, the pregenerated table will include all permutations of 2-bit data, such as
  • this table may be stored in an array with corresponding indices as follows:
  • This pregenerated table may be stored on disk, in RAM, or otherwise.
  • this pregenerated table is stored with the computing system that reduces file size (or squeezes a file) as well as the computing system that expands a reduced file (or unsqueezes the data).
  • the method“chunks” the data into smaller subsets of data.
  • “chunk” means to take a data string and create smaller data strings comprising subsets of the larger data string. All chunks together would form the original data string. For example, if the input data is:
  • each individual chunk will then be compared to the pregenerated table to see if there is a match.
  • each chunk will not be found in the table as the table has permutations for all 2-bit chunks.
  • each chunk will be chunked again, resulting in the following:
  • the method will continue for each chunk until a point where the particular chunk is located in the pregenerated table. At that point, the chunk will be associated with its respective index, and preferably a series of tuples will be generated indicating the chunk level and the corresponding index.
  • the system chunked twice, so the index association will be as follows:
  • each chunk is represented with a chunk level (2) and corresponding index into the pregenerated table.
  • the data may be chunked in any number of ways. For instance, the data may be chunked based on a pre-determined size as in the above example (where the predetermined size was 4-bits for purposes of example). Alternatively, the input data may be recursively chunked into 2 separate data chunks, until each data chunk may be found in the pregenerated table. Using the same input data as above, a method of chunking the data by splitting it would result in the following first level chunk:
  • segments“1”,“1”,“0”, and“1” are chunked into data smaller than the pregenerated table size (i.e. segments“1”,“1”,“0”, and“1”). These segments may be padded in order to compare them to the pregenerated table.
  • the numbers may be stored either using big endian or little endian byte order, so long as consistency is maintained. Using big endian byte order, for example, the chunked data above would be represented as:
  • the data may be originally chunked like above, by breaking it into 4 bit sequences:
  • Pregenerated Table comprising all permutations of data of a particular length is created at step 302. As indicated above, preferably that table is persisted in some fashion.
  • the system receives input data to be squeezed at step 304.
  • the process then chunks the data into smaller segments until the data length is of a length that would be located in the Pregenerated Table at steps 306 and 308.
  • the process maintains the chunk level so that the system knows how many times an input data set has been chunked. Each chunk is then located in the Pregenerated Table at step 310.
  • the chunk, its chunk level, and the respective index in the Pregerated Table is associated, resulting in the squeezed data at step 312.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
PCT/US2020/012661 2019-01-10 2020-01-08 Method and system for content agnostic file indexing WO2020146448A1 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
KR1020217025238A KR20210110875A (ko) 2019-01-10 2020-01-08 콘텐츠 애그노스틱 파일 인덱싱을 위한 방법 및 시스템
EP20737931.4A EP3908937A4 (en) 2019-01-10 2020-01-08 METHOD AND SYSTEM FOR INDEXING CONTENT AGNOSTIC FILES
JP2021540318A JP2022518194A (ja) 2019-01-10 2020-01-08 コンテンツ不可知ファイルインデキシングの方法及びシステム
CA3126012A CA3126012A1 (en) 2019-01-10 2020-01-08 Method and system for content agnostic file indexing
AU2020205970A AU2020205970A1 (en) 2019-01-10 2020-01-08 Method and system for content agnostic file indexing

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16/244,332 2019-01-10
US16/244,332 US11138152B2 (en) 2017-10-11 2019-01-10 Method and system for content agnostic file indexing

Publications (1)

Publication Number Publication Date
WO2020146448A1 true WO2020146448A1 (en) 2020-07-16

Family

ID=71520909

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2020/012661 WO2020146448A1 (en) 2019-01-10 2020-01-08 Method and system for content agnostic file indexing

Country Status (6)

Country Link
EP (1) EP3908937A4 (ko)
JP (1) JP2022518194A (ko)
KR (1) KR20210110875A (ko)
AU (1) AU2020205970A1 (ko)
CA (1) CA3126012A1 (ko)
WO (1) WO2020146448A1 (ko)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11138152B2 (en) 2017-10-11 2021-10-05 Lognovations Holdings, Llc Method and system for content agnostic file indexing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060244639A1 (en) * 2003-10-17 2006-11-02 Bruce Parker Data compression system and method
US20090319536A1 (en) * 2006-09-01 2009-12-24 Pacbyte Software Pty Limited Method and system for transmitting a data file over a data network
US20110125727A1 (en) * 2003-09-29 2011-05-26 Shenglong Zou Content oriented index and search method and system
US20120166448A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Adaptive Index for Data Deduplication
US20150201043A1 (en) * 2010-08-20 2015-07-16 Abdulrahman Ahmed Sulieman Methods and systems for encoding/decoding files and transmissions thereof
US20190146950A1 (en) * 2017-10-11 2019-05-16 Lognovations Holdings, Llc Method and System for Content Agnostic File Indexing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5594435A (en) * 1995-09-13 1997-01-14 Philosophers' Stone Llc Permutation-based data compression
US20050071151A1 (en) * 2003-09-30 2005-03-31 Ali-Reza Adl-Tabatabai Compression-decompression mechanism

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110125727A1 (en) * 2003-09-29 2011-05-26 Shenglong Zou Content oriented index and search method and system
US20060244639A1 (en) * 2003-10-17 2006-11-02 Bruce Parker Data compression system and method
US20090319536A1 (en) * 2006-09-01 2009-12-24 Pacbyte Software Pty Limited Method and system for transmitting a data file over a data network
US20150201043A1 (en) * 2010-08-20 2015-07-16 Abdulrahman Ahmed Sulieman Methods and systems for encoding/decoding files and transmissions thereof
US20120166448A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Adaptive Index for Data Deduplication
US20190146950A1 (en) * 2017-10-11 2019-05-16 Lognovations Holdings, Llc Method and System for Content Agnostic File Indexing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3908937A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11138152B2 (en) 2017-10-11 2021-10-05 Lognovations Holdings, Llc Method and system for content agnostic file indexing

Also Published As

Publication number Publication date
KR20210110875A (ko) 2021-09-09
CA3126012A1 (en) 2020-07-16
EP3908937A1 (en) 2021-11-17
JP2022518194A (ja) 2022-03-14
EP3908937A4 (en) 2022-09-28
AU2020205970A1 (en) 2021-08-05

Similar Documents

Publication Publication Date Title
US11138152B2 (en) Method and system for content agnostic file indexing
US20220093210A1 (en) System and method for characterizing biological sequence data through a probabilistic data structure
US11899641B2 (en) Trie-based indices for databases
US8554561B2 (en) Efficient indexing of documents with similar content
US20170351737A1 (en) Methods and systems for autonomous memory searching
US10680645B2 (en) System and method for data storage, transfer, synchronization, and security using codeword probability estimation
US10509771B2 (en) System and method for data storage, transfer, synchronization, and security using recursive encoding
US11899624B2 (en) System and method for random-access manipulation of compacted data files
WO2020146448A1 (en) Method and system for content agnostic file indexing
US11544225B2 (en) Method and system for content agnostic file indexing
Lou et al. Data deduplication with random substitutions
US11995060B2 (en) Hashing a data set with multiple hash engines
US11397707B2 (en) System and method for computer data type identification
US20220245104A1 (en) Hashing for deduplication through skipping selected data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20737931

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3126012

Country of ref document: CA

ENP Entry into the national phase

Ref document number: 2021540318

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020205970

Country of ref document: AU

Date of ref document: 20200108

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217025238

Country of ref document: KR

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020737931

Country of ref document: EP

Effective date: 20210810