WO2020146448A1 - Method and system for content agnostic file indexing - Google Patents
Method and system for content agnostic file indexing Download PDFInfo
- Publication number
- WO2020146448A1 WO2020146448A1 PCT/US2020/012661 US2020012661W WO2020146448A1 WO 2020146448 A1 WO2020146448 A1 WO 2020146448A1 US 2020012661 W US2020012661 W US 2020012661W WO 2020146448 A1 WO2020146448 A1 WO 2020146448A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- chunks
- binary data
- chunk
- data file
- pregenerated
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3088—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method employing the use of a dictionary, e.g. LZ78
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/60—General implementation details not specific to a particular type of compression
- H03M7/6052—Synchronisation of encoder and decoder
Definitions
- This disclosure relates to a method for content agnostic file referencing.
- the method may further relate to a method for content agnostic data compression.
- File referencing techniques generally require knowledge about the kind of data being stored in order to efficiently index the data in a file referencing system. Similarly, knowledge about the data at issue is also generally used in creating improved compression approaches to reduce data size for transmission, storage, and the like.
- this disclosure provides a method for improving computing technology with an enhanced content-agnostic file referencing system. The method improves the operation of the computer itself.
- the disclosed method has several important advantages. For example, the disclosed method permits file referencing of any content type.
- the disclosed method additionally permits a significant reduction in the amount of information or data that must be persisted or transmitted, as data may be generated at access time as opposed to persisted.
- Various embodiments of the present disclosure may have none, some, or all of these advantages. Other technical advantages of the present disclosure may also be readily apparent to one skilled in the art.
- FIG. 1 is a flowchart outlining the steps of one embodiment of the present disclosure.
- FIG. 2 is another flowchart outlining the steps of another embodiment of the present disclosure.
- FIG. 3 is a flowchart outlining the steps of an alternate embodiment of the present disclosure.
- the present disclosure relates to a method for content-agnostic indexing of data.
- the method may be used for a variety of computer-specific needs, including for example as a file referencing system or a compression system.
- One embodiment of the present invention comprises a method as described in the flow chart depicted in FIG. 1.
- Binary data (//,) for instance, a data file
- the method uses this information, at step 106, the method calculates all permutations of data of the identified length. For example, if the input data is:
- the method determines the index ( «/) of the input binary data file in the generated permutations. Using the example above, the index ( «/) returned would be “1”. Finally, rather than storing or transmitting the input binary data (i.e.“01”), the system instead stores the length (2) and the index (1).
- the method needs only a length (/( «,)) and an index (// / ) as input.
- the input provided would be the length (2) and the index (1).
- the system calculates all permutations of the inputted length. As above, that would generate the following permutations:
- n length in appropriate n-ary units respective to the order of the system
- steps generate) oin.to s step Input.kmp_search("# ⁇ steps ⁇ " ,”# ⁇ input ⁇ ”) p input
- an input byte string is converted into a bit string corresponding to a representation of the input byte string. This bit string is what is then processed through the method described herein.
- a table may be pregenerated with all permuations of data of a particular length. This pregenerated table may be persisted in memory, either non-volatile or volatile memory. Using the above example, if the predetermined length is 2-bits, the pregenerated table will include all permutations of 2-bit data, such as
- this table may be stored in an array with corresponding indices as follows:
- This pregenerated table may be stored on disk, in RAM, or otherwise.
- this pregenerated table is stored with the computing system that reduces file size (or squeezes a file) as well as the computing system that expands a reduced file (or unsqueezes the data).
- the method“chunks” the data into smaller subsets of data.
- “chunk” means to take a data string and create smaller data strings comprising subsets of the larger data string. All chunks together would form the original data string. For example, if the input data is:
- each individual chunk will then be compared to the pregenerated table to see if there is a match.
- each chunk will not be found in the table as the table has permutations for all 2-bit chunks.
- each chunk will be chunked again, resulting in the following:
- the method will continue for each chunk until a point where the particular chunk is located in the pregenerated table. At that point, the chunk will be associated with its respective index, and preferably a series of tuples will be generated indicating the chunk level and the corresponding index.
- the system chunked twice, so the index association will be as follows:
- each chunk is represented with a chunk level (2) and corresponding index into the pregenerated table.
- the data may be chunked in any number of ways. For instance, the data may be chunked based on a pre-determined size as in the above example (where the predetermined size was 4-bits for purposes of example). Alternatively, the input data may be recursively chunked into 2 separate data chunks, until each data chunk may be found in the pregenerated table. Using the same input data as above, a method of chunking the data by splitting it would result in the following first level chunk:
- segments“1”,“1”,“0”, and“1” are chunked into data smaller than the pregenerated table size (i.e. segments“1”,“1”,“0”, and“1”). These segments may be padded in order to compare them to the pregenerated table.
- the numbers may be stored either using big endian or little endian byte order, so long as consistency is maintained. Using big endian byte order, for example, the chunked data above would be represented as:
- the data may be originally chunked like above, by breaking it into 4 bit sequences:
- Pregenerated Table comprising all permutations of data of a particular length is created at step 302. As indicated above, preferably that table is persisted in some fashion.
- the system receives input data to be squeezed at step 304.
- the process then chunks the data into smaller segments until the data length is of a length that would be located in the Pregenerated Table at steps 306 and 308.
- the process maintains the chunk level so that the system knows how many times an input data set has been chunked. Each chunk is then located in the Pregenerated Table at step 310.
- the chunk, its chunk level, and the respective index in the Pregerated Table is associated, resulting in the squeezed data at step 312.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Priority Applications (5)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020217025238A KR20210110875A (ko) | 2019-01-10 | 2020-01-08 | 콘텐츠 애그노스틱 파일 인덱싱을 위한 방법 및 시스템 |
EP20737931.4A EP3908937A4 (en) | 2019-01-10 | 2020-01-08 | METHOD AND SYSTEM FOR INDEXING CONTENT AGNOSTIC FILES |
JP2021540318A JP2022518194A (ja) | 2019-01-10 | 2020-01-08 | コンテンツ不可知ファイルインデキシングの方法及びシステム |
CA3126012A CA3126012A1 (en) | 2019-01-10 | 2020-01-08 | Method and system for content agnostic file indexing |
AU2020205970A AU2020205970A1 (en) | 2019-01-10 | 2020-01-08 | Method and system for content agnostic file indexing |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/244,332 | 2019-01-10 | ||
US16/244,332 US11138152B2 (en) | 2017-10-11 | 2019-01-10 | Method and system for content agnostic file indexing |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020146448A1 true WO2020146448A1 (en) | 2020-07-16 |
Family
ID=71520909
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2020/012661 WO2020146448A1 (en) | 2019-01-10 | 2020-01-08 | Method and system for content agnostic file indexing |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP3908937A4 (ko) |
JP (1) | JP2022518194A (ko) |
KR (1) | KR20210110875A (ko) |
AU (1) | AU2020205970A1 (ko) |
CA (1) | CA3126012A1 (ko) |
WO (1) | WO2020146448A1 (ko) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11138152B2 (en) | 2017-10-11 | 2021-10-05 | Lognovations Holdings, Llc | Method and system for content agnostic file indexing |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060244639A1 (en) * | 2003-10-17 | 2006-11-02 | Bruce Parker | Data compression system and method |
US20090319536A1 (en) * | 2006-09-01 | 2009-12-24 | Pacbyte Software Pty Limited | Method and system for transmitting a data file over a data network |
US20110125727A1 (en) * | 2003-09-29 | 2011-05-26 | Shenglong Zou | Content oriented index and search method and system |
US20120166448A1 (en) * | 2010-12-28 | 2012-06-28 | Microsoft Corporation | Adaptive Index for Data Deduplication |
US20150201043A1 (en) * | 2010-08-20 | 2015-07-16 | Abdulrahman Ahmed Sulieman | Methods and systems for encoding/decoding files and transmissions thereof |
US20190146950A1 (en) * | 2017-10-11 | 2019-05-16 | Lognovations Holdings, Llc | Method and System for Content Agnostic File Indexing |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5594435A (en) * | 1995-09-13 | 1997-01-14 | Philosophers' Stone Llc | Permutation-based data compression |
US20050071151A1 (en) * | 2003-09-30 | 2005-03-31 | Ali-Reza Adl-Tabatabai | Compression-decompression mechanism |
-
2020
- 2020-01-08 KR KR1020217025238A patent/KR20210110875A/ko unknown
- 2020-01-08 WO PCT/US2020/012661 patent/WO2020146448A1/en unknown
- 2020-01-08 CA CA3126012A patent/CA3126012A1/en active Pending
- 2020-01-08 EP EP20737931.4A patent/EP3908937A4/en active Pending
- 2020-01-08 JP JP2021540318A patent/JP2022518194A/ja active Pending
- 2020-01-08 AU AU2020205970A patent/AU2020205970A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110125727A1 (en) * | 2003-09-29 | 2011-05-26 | Shenglong Zou | Content oriented index and search method and system |
US20060244639A1 (en) * | 2003-10-17 | 2006-11-02 | Bruce Parker | Data compression system and method |
US20090319536A1 (en) * | 2006-09-01 | 2009-12-24 | Pacbyte Software Pty Limited | Method and system for transmitting a data file over a data network |
US20150201043A1 (en) * | 2010-08-20 | 2015-07-16 | Abdulrahman Ahmed Sulieman | Methods and systems for encoding/decoding files and transmissions thereof |
US20120166448A1 (en) * | 2010-12-28 | 2012-06-28 | Microsoft Corporation | Adaptive Index for Data Deduplication |
US20190146950A1 (en) * | 2017-10-11 | 2019-05-16 | Lognovations Holdings, Llc | Method and System for Content Agnostic File Indexing |
Non-Patent Citations (1)
Title |
---|
See also references of EP3908937A4 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11138152B2 (en) | 2017-10-11 | 2021-10-05 | Lognovations Holdings, Llc | Method and system for content agnostic file indexing |
Also Published As
Publication number | Publication date |
---|---|
KR20210110875A (ko) | 2021-09-09 |
CA3126012A1 (en) | 2020-07-16 |
EP3908937A1 (en) | 2021-11-17 |
JP2022518194A (ja) | 2022-03-14 |
EP3908937A4 (en) | 2022-09-28 |
AU2020205970A1 (en) | 2021-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11138152B2 (en) | Method and system for content agnostic file indexing | |
US20220093210A1 (en) | System and method for characterizing biological sequence data through a probabilistic data structure | |
US11899641B2 (en) | Trie-based indices for databases | |
US8554561B2 (en) | Efficient indexing of documents with similar content | |
US20170351737A1 (en) | Methods and systems for autonomous memory searching | |
US10680645B2 (en) | System and method for data storage, transfer, synchronization, and security using codeword probability estimation | |
US10509771B2 (en) | System and method for data storage, transfer, synchronization, and security using recursive encoding | |
US11899624B2 (en) | System and method for random-access manipulation of compacted data files | |
WO2020146448A1 (en) | Method and system for content agnostic file indexing | |
US11544225B2 (en) | Method and system for content agnostic file indexing | |
Lou et al. | Data deduplication with random substitutions | |
US11995060B2 (en) | Hashing a data set with multiple hash engines | |
US11397707B2 (en) | System and method for computer data type identification | |
US20220245104A1 (en) | Hashing for deduplication through skipping selected data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20737931 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3126012 Country of ref document: CA |
|
ENP | Entry into the national phase |
Ref document number: 2021540318 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020205970 Country of ref document: AU Date of ref document: 20200108 Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 20217025238 Country of ref document: KR Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2020737931 Country of ref document: EP Effective date: 20210810 |