WO2018121430A1 - File storage and indexing method, apparatus, media, device and method for reading files - Google Patents

File storage and indexing method, apparatus, media, device and method for reading files Download PDF

Info

Publication number
WO2018121430A1
WO2018121430A1 PCT/CN2017/117967 CN2017117967W WO2018121430A1 WO 2018121430 A1 WO2018121430 A1 WO 2018121430A1 CN 2017117967 W CN2017117967 W CN 2017117967W WO 2018121430 A1 WO2018121430 A1 WO 2018121430A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
key value
index
offset
bytes
Prior art date
Application number
PCT/CN2017/117967
Other languages
French (fr)
Chinese (zh)
Inventor
陈闯
张炎泼
Original Assignee
贵州白山云科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 贵州白山云科技有限公司 filed Critical 贵州白山云科技有限公司
Publication of WO2018121430A1 publication Critical patent/WO2018121430A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices

Definitions

  • Embodiments of the present invention relate to, but are not limited to, the field of file storage and indexing, and in particular, to a file storage and indexing method, device, medium, device, and method for reading a file.
  • Some well-known Internet companies in the industry have proposed solutions for a large number of small files.
  • the famous social networking site Facebook has stored more than 60 billion images and has launched the Haystack system to customize and optimize the storage of large numbers of images.
  • Other small file processing schemes include Taobao's TFS, etc.
  • the core idea of these systems is to append small files to a data file, and at the same time generate an index file to locate the location of the small file through the index file.
  • Haystack's data file part Haystack's data file, which encapsulates each small file into a file containing the key value, size, data, etc. of the file. All small files are appended to the data file in the order in which they were written.
  • Haystack's index file stores the key value of each file pin, as well as the offset, size and other information of the file pin in the data file. The program loads the index into memory when it starts, and locates the offset and size in the data file by looking up the index in memory.
  • Read request index Load the index file into memory, locate the index, and locate the offset and size of the file to be read.
  • Facebook's Haystack feature is to load the full key value of the file into memory for file location.
  • Facebook's full 8-byte key value can be fully loaded into memory, but there are two problems in the real world:
  • the storage server memory will not be too large, generally 32G to 64G;
  • the key value corresponding to the small file is difficult to control.
  • MD5 or SHA1 of the file content is selected as the key value of the file.
  • a storage server has 12 4T disks and the memory is about 32GB.
  • the server now needs to store about 4K avatars, thumbnails and other files, about 1 billion.
  • the key value of the file uses MD5, plus the offset and size fields, and the index information corresponding to an average small file occupies 28 bytes. In this case, the index occupies nearly 30GB of memory and the disk occupies only 4TB. Memory consumption is nearly 100%, and disk consumption is only 8%.
  • the indexing scheme adopted by the Haystack system consumes a large amount of memory resources, and the memory resources limit the utilization of disk resources. Therefore, in order to obtain a larger utilization of disk resources, an excessive increase in memory resources is required.
  • the embodiments of the present invention provide a file storage and indexing method, device, medium, device, and method for reading a file, so as to at least solve the problem that the indexing scheme adopted by the Haystack system consumes a large amount of memory resources.
  • an index in the index file uses a first N bytes of an actual key value of each file as a key value, and each index points to the data file
  • the key value corresponding to the offset is an offset of the first file in the one or more files pointed by the key value
  • the size value corresponding to the key value is the key value pointing
  • N is a positive integer.
  • the offset and size fields in the index file are aligned by 512 bytes.
  • the generating an index file for indexing each file in the data file further includes:
  • the index of the index file is hierarchically stored according to a key value prefix, wherein a key value of an index stored in a layer corresponding to the key value prefix is a short key value truncating the key value prefix, wherein the key
  • the value prefix has a byte length less than N.
  • the offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space. of.
  • the method further includes mapping all of the files in the data file to a Bloom filter such that when the file in the data file is read, the Bron filter is quickly searched to determine that the file is to be read. Whether the file may exist.
  • the computer readable storage medium provided by the embodiment of the present invention stores a computer program, and when the program is executed by the processor, the steps of the foregoing method are implemented.
  • a computer device provided by an embodiment of the present invention includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the steps of the foregoing method when the program is executed.
  • a data file storage module configured to store a data file, wherein the data file is obtained by storing each file in alphabetical order according to an actual key value of the file;
  • An index file generating module configured to generate an index file for indexing each file in the data file, wherein an index in the index file uses a first N bytes of an actual key value of each file as a key value, and each index Pointing to one or more files in the data file, the offset corresponding to the key value is an offset of a first file in one or more files pointed by the key value, and a size corresponding to the key value
  • the value is the size of the first file in one or more files pointed to by the key value, and N is a positive integer.
  • the above device also has the following features:
  • the index file generating module is further configured to hierarchically store an index of the index file according to a key value prefix, wherein a key value of an index stored in a layer corresponding to the key value prefix is a prefix of the key value Short key value, wherein the key value prefix has a byte length less than N.
  • the above device also has the following features:
  • the offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space. of.
  • the above device also has the following features:
  • the device also includes:
  • mapping module configured to map all the files in the data file into the Bloom filter, so that when the file in the data file is read, by searching the Bloom filter to determine whether the file to be read is That may exist.
  • the method for reading a file in a file storage and indexing device includes:
  • the file is read when it matches a file whose key value is consistent with the actual key value.
  • the index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read includes:
  • each file is stored in alphabetical order according to the actual key value of the file to obtain a data file; an index file for indexing each file in the data file is generated, wherein the index in the index file uses the actual key of each file.
  • the first N bytes of the value are used as key values, and each index points to one or more files in the data file, and the offset corresponding to the key value is the offset of the first file in one or more files pointed to by the key value.
  • the size corresponding to the key value is the size of the first file in one or more files pointed to by the key value, which solves the problem that the index scheme adopted by the Haystack system consumes a large amount of memory resources, and reduces the memory resources of the index system. Consumption.
  • FIG. 1 is a flow chart of a file storage and indexing method in accordance with an embodiment of the present invention
  • FIG. 2 is a structural block diagram of a file storage and indexing apparatus according to an embodiment of the present invention
  • FIG. 3 is a flowchart of a method of reading a file in a file storage and indexing device according to an embodiment of the present invention
  • FIG. 4 is a schematic diagram of a file storage and index structure in accordance with a preferred embodiment of the present invention.
  • FIG. 5 is a flow chart of a method of reading a file in accordance with a preferred embodiment of the present invention.
  • FIG. 6, FIG. 7, and FIG. 8 are schematic diagrams of index hierarchy according to a preferred embodiment of the present invention.
  • FIGS. 9 and 10 are diagrams showing a comparison of memory consumption of an indexing scheme in accordance with a preferred embodiment of the present invention.
  • FIG. 1 is a flowchart of a file storage and indexing method according to an embodiment of the present invention. As shown in Figure 1, the process includes the following steps:
  • Step S101 storing each file in alphabetical order according to the actual key value of the file, to obtain a data file
  • Step S102 generating an index file for indexing each file in the data file, wherein the index in the index file uses the first N bytes of the actual key value of each file as a key value, and each index points to one or more of the data files.
  • File, the offset value corresponding to the key value is the offset value of the first file in one or more files pointed to by the key value
  • the size value corresponding to the key value is the first one or more files pointed to by the key value.
  • the size of the file, N is a positive integer.
  • the size of the index file is reduced; at the same time, such an index no longer points to a file, but Point to the same one or more files of the first N bytes of the actual key value; in order to be able to locate the location of the file according to the offset in the index, store the file in the alphabetical order of the actual key value to the data file when the file is stored.
  • One or more files in which the first N bytes of the actual key value are the same are stored in one continuous position, and an offset is used to indicate their storage location.
  • the Haystack system of the related art will occupy less memory resources, which solves the problem that the index scheme adopted by the Haystack system consumes a large amount of memory resources.
  • the problem is that the consumption of memory resources by the indexing system is reduced.
  • the index can no longer directly index to a certain file according to the index, but will index to a continuous file set; when it is necessary to accurately read a certain file, According to the actual key value of this file, it is possible to read the desired file by matching the files one by one in the file collection.
  • the size of the file is 1024 bytes by multiplying 2 by 512 bytes; the previous need to save is 1024, now only need to save 2, save at least one byte;
  • the number of bytes required for the offset and size fields can be calculated based on the actual size of the entire data file, thereby further reducing the number of bytes occupied by the index.
  • the key value stored in the index file still has a possible row of key value prefixes. Therefore, it is also considered to layer the index in the index file according to the key value prefix.
  • the storage wherein the key value of the index stored in the layer corresponding to the key value prefix is a short key value of the truncated key value prefix, and the byte length of the key value prefix is less than N. In the case where the number of indexes in the hierarchy is larger, the number of bytes occupied by the layered index file will be smaller than the original index file.
  • the offset of the index within each layer can be further optimized to reduce the number of bytes.
  • the offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the offset in the layer is determined according to the layered maximum layer address space. . Since the maximum layer address space must be smaller than the size of the entire data file, the number of bytes occupied by the intra-layer offset will also be less than the number of bytes occupied by the original offset in the offset range of the entire data file.
  • the Bloom filter is a binary vector data structure that has good spatial and temporal efficiency and is used to detect if an element is a member of a collection. If the test result is yes, the element is not necessarily in the set; but if the test result is no, the element must not be in the set.
  • the advantage of the Bloom filter is that its insertion and query time are constant, and it does not save the element itself, but it has good security.
  • all files in the data file are also mapped into the Bloom filter, so that when the file in the data file is read, it is possible to determine whether the file to be read is possible by quickly searching for the Bloom filter. presence.
  • the value of N is preferably 4.
  • a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
  • a file storage and indexing device is also provided, which is used to implement the above-mentioned embodiments and preferred embodiments, and has not been described again.
  • the term “module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 2 is a structural block diagram of a file storage and indexing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes: a data file storage module 21 and an index file generating module 22, wherein
  • a data file storage module 21 configured to store data files, wherein the data files are obtained by storing the files in alphabetical order according to actual key values of the files;
  • the index file generating module 22 is coupled to the data file storage module 21 for generating an index file for indexing each file in the data file, wherein the index in the index file uses the first N bytes of the actual key value of each file as a key. Value, each index points to one or more files in the data file.
  • the offset corresponding to the key value is the offset of the first file in one or more files pointed to by the key value, and the size value corresponding to the key value is the key.
  • the size of the first file in one or more files pointed to by the value, N is a positive integer.
  • the index file generating module is further configured to hierarchically store the index of the index file according to the key value prefix, wherein the key value of the index stored in the layer corresponding to the key value prefix is a short key value of the prefix of the truncated key value, wherein the key value The prefix has a byte length less than N.
  • the offset of the index of the index file is the intra-layer offset of the offset range of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space.
  • the file storage and indexing device further includes: a mapping module, configured to map all the files in the data file to the Bloom filter, so that when the file in the data file is read, the Bron filter is searched to determine that the file is to be read. Whether the file may exist.
  • a mapping module configured to map all the files in the data file to the Bloom filter, so that when the file in the data file is read, the Bron filter is searched to determine that the file is to be read. Whether the file may exist.
  • each of the above modules may be implemented by software or hardware.
  • the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the modules are located in multiple In the processor.
  • the value of N is preferably 4.
  • FIG. 3 is a flow chart of a method of reading a file in a file storage and indexing device according to an embodiment of the present invention, such as As shown in Figure 3, the process includes the following steps:
  • Step S301 querying an index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read;
  • Step S302 according to the actual key value, matching the file in one or more files pointed to by the index corresponding to the first N bytes of the actual key value;
  • Step S303 when the file matching the key value and the actual key value is matched, the file is read.
  • step S301 before the index is queried, whether the file to be read may be determined according to the Bloom filter; if the result of the determination is possible, according to the actual key of the file to be read The first N bytes of the value query the index corresponding to the first N bytes of the actual key value in the index file, otherwise the file is terminated.
  • the value of N is preferably 4.
  • FIG. 4 is a schematic diagram of a file storage and index structure according to a preferred embodiment of the present invention, as shown in FIG. 4, wherein the hierarchical file is stored in the memory. In the middle, the same key-value prefix is divided into one layer. Index files are used to locate small files. The data files are stored on disk, and each file pin is a small file.
  • FIG. 5 is a flow chart of a method of reading a file according to a preferred embodiment of the present invention.
  • FIG. 5 shows a specific location of a small file by matching an index prefix, and then viewing the file by reading the complete key value. Whether the key values match, if not matched, continue to search for the detailed flow of the next file pin.
  • the file storage and indexing scheme provided by the preferred embodiment includes the following steps: Step 1: compressing the prefix optimization, reducing the key value, the offset, and the size occupied space;
  • the index file only stores the first four bytes of the key value, not the full key value
  • the offset and size fields in the index file are saved by 512 bytes, saving 1 byte; and the number of bytes used for the offset and size is calculated according to the actual size of the entire data file.
  • Step 2 The file pins are stored in order, and the location of the small files is located; the file pins in the data files are stored in alphabetical order according to the key values.
  • the index file Due to the key value of the index file, only the first four bytes are saved. If the first four bytes of the small file key value are the same and the file pins are not stored sequentially, the specific positions of all the file pins scattered can not be found according to an offset. For example, the file key value read by the user is 0xabcdefacee, but since the key value in the index file only saves the first four bytes, it can only match the prefix 0xabcdefac, and the offset to be read cannot be located at this time.
  • the above problem is solved by storing the file pins sequentially: for example, the key value of the user reading the file is 0xabcdefacbb, and the prefix is 0xabcdefac, and the offset points to the file pin of 0xabcdefacaa, the first time. Match miss.
  • Step 3 Index layering optimization
  • the index with the same key-value prefix in the index can be divided into one layer.
  • the layering principle is that the number of files in each layer is controlled as much as possible to about 64, and the hierarchical level is selected according to the number of file pins to be stored in the layer.
  • the level of hierarchy can be determined as needed, for example, an example of a hierarchical level is given below:
  • Level 1 Select the first byte of the file pin key value for layering
  • Level 2 Select the first two bytes of the file pin key value for layering
  • the number of bytes used for the key-value prefix used for layering is less than the byte length of the key in the index.
  • the offset before optimization is the address space of the entire data file.
  • the offset of the layer is offset in the entire data file, and the offset of the index under the layer only needs to be offset within the layer in the data file, which can be calculated according to the maximum layer address space. The number of bytes.
  • access to the file is also avoided by the Bloom filter.
  • the Bloom filter In memory, map existing files to Bloom filters, only through a quick search.
  • the time complexity is O(k), where k is the number of bits required for an element.
  • k is the number of bits required for an element.
  • the false positive rate is 1%. If k is increased by 4.8, the false alarm rate will be reduced to 0.1%.
  • the horizontal axis represents the number of files
  • the vertical axis represents the memory size required for the index file
  • the short dashed line represents the memory consumption of the conventional Haystack
  • the long dashed line represents the memory consumption after the prefix compression by the embodiment of the present invention.
  • the horizontal axis represents the number of files
  • the vertical axis represents the memory size required for the index file
  • the short dashed line represents the memory consumption of the conventional Haystack
  • the long dashed line represents the memory consumption after the prefix compression by the embodiment of the present invention
  • the solid line The memory consumption after prefix compression and index stratification is performed by the embodiment of the present invention.
  • the 9G multi-memory consumption before optimization is further reduced to more than 4G, and one-half memory consumption is saved.
  • the overall performance of the small file is significantly improved, and the number of requests per second (RequestPerSecond, referred to as RPS) is more than doubled, and the input/output of the machine (Input/Output, referred to as For IO) usage has nearly doubled.
  • RPS RequestPerSecond
  • For IO input/output
  • the minimum memory unit is optimized, the fragmentation is reduced by 80%.
  • This embodiment provides a storage medium.
  • the above storage medium may be configured to store program code for performing the following steps:
  • Step S101 storing each file in alphabetical order according to the actual key value of the file, to obtain a data file
  • Step S102 generating an index file for indexing each file in the data file, wherein the index in the index file uses the first N bytes of the actual key value of each file as a key value, and each index points to one or more of the data files.
  • File, the offset corresponding to the key value is the offset of the first file in one or more files pointed to by the key value
  • the size corresponding to the key value is the first file in one or more files pointed to by the key value.
  • the size value, N is a positive integer.
  • the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a removable hard disk.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • mobile hard disk a hard disk
  • removable hard disk a variety of media that can store program code, such as a disk or an optical disk.
  • Embodiments of the present invention also provide a storage medium.
  • the above storage medium may be configured to store program code for performing the following steps:
  • Step S301 querying an index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read;
  • Step S302 according to the actual key value, matching the file in one or more files pointed to by the index corresponding to the first N bytes of the actual key value;
  • Step S303 when the file matching the key value and the actual key value is matched, the file is read.
  • computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media.
  • Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer.
  • communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Provided herein are a file storage and indexing method, apparatus, media, device and a method for reading files, wherein said file storage and indexing method comprises: storing each file according to the alphabetical order of actual key values of files, and obtaining a data file; generating an index file which is used for indexing each file in the data file, wherein an index in the index file uses first N bytes of an actual key value of each file as a key value, each index pointing to one or more files in the data file, while an offset value corresponding to the key value is an offset value of the first file in one or more files to which the key value points, and a size value corresponding to the key value is a size value of the first file of one or more files to which the key value points. Solved herein is the problem wherein memory resource consumption of an indexing solution used by a Haystack system is large, thereby reducing consumption of memory resources by an indexing system.

Description

文件存储和索引方法、装置、介质、设备及读取文件的方法File storage and indexing method, device, medium, device, and method of reading a file
[根据细则26改正19.01.2018] 
本申请要求在2016年12月26日提交中国专利局、申请号为201611221215.1、发明名称为“文件存储和索引方法、装置及读取文件的方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
[Correct according to Rule 26 19.01.2018]
This application claims the priority of the Chinese Patent Application filed on Dec. 26, 2016, the Chinese Patent Application No. 2016112212151.1, the invention titled "File Storage and Indexing Method, Apparatus, and Method of Reading Documents", the entire contents of which are The citations are incorporated herein by reference.
技术领域Technical field
本发明实施例涉及但不限于文件存储及索引领域,尤其涉及一种文件存储和索引方法、装置、介质、设备及读取文件的方法。Embodiments of the present invention relate to, but are not limited to, the field of file storage and indexing, and in particular, to a file storage and indexing method, device, medium, device, and method for reading a file.
背景技术Background technique
互联网数据呈现爆炸式增长,社交网络、移动通信、网络视频、电子商务等各种应用往往能产生亿级甚至十亿、百亿级的海量小文件。由于在元数据管理、访问性能、存储效率等方面面临巨大的挑战,海量小文件问题成为了业界公认的难题。Internet data is exploding, and various applications such as social networks, mobile communications, online video, and e-commerce can often generate huge files of billions or even billions and tens of billions. Due to the huge challenges in metadata management, access performance, storage efficiency, etc., the massive file problem has become a recognized problem in the industry.
业界的一些知名互联网公司对海量小文件提出了解决方案,例如:著名的社交网站Facebook,存储了超过600亿张图片,专门推出了Haystack系统,针对海量小图片进行定制优化的存储。其他的小文件处理方案还有淘宝的TFS等,这些系统的核心思想都是将小文件追加到一个数据文件中,同时生成索引文件,通过索引文件来定位小文件的位置。Some well-known Internet companies in the industry have proposed solutions for a large number of small files. For example, the famous social networking site Facebook has stored more than 60 billion images and has launched the Haystack system to customize and optimize the storage of large numbers of images. Other small file processing schemes include Taobao's TFS, etc. The core idea of these systems is to append small files to a data file, and at the same time generate an index file to locate the location of the small file through the index file.
下面介绍Facebook采用的Haystack的解决方案:Here's a look at the Haystack solution that Facebook uses:
Facebook的Haystack对小文件的解决办法是,把小文件合起来,将一些小文件的数据依次追加到数据文件中并且生成索引文件,通过索引来查找小文件在数据文件中的偏移量和大小,对文件进行读取。Facebook's Haystack's solution to small files is to put together small files, append the data of some small files to the data file and generate an index file, and use the index to find the offset and size of the small file in the data file. , read the file.
(1)Haystack的数据文件部分:Haystack的数据文件,将每个小文件封装成一个文件针,包含文件的键值、大小、数据等信息。所有小文件按写入的先后顺序追加到数据文件中。(1) Haystack's data file part: Haystack's data file, which encapsulates each small file into a file containing the key value, size, data, etc. of the file. All small files are appended to the data file in the order in which they were written.
(2)Haystack的索引文件部分:Haystack的索引文件保存每个文件针的 键值,以及该文件针在数据文件中的偏移量、大小等信息。程序启动时会将索引加载到内存中,在内存中通过查找索引来定位在数据文件中的偏移量和大小。(2) Haystack's index file part: Haystack's index file stores the key value of each file pin, as well as the offset, size and other information of the file pin in the data file. The program loads the index into memory when it starts, and locates the offset and size in the data file by looking up the index in memory.
(3)读请求使用索引:将索引文件载入内存,通过查找索引,来定位要读取文件的偏移量、大小,将数据读取出来。(3) Read request index: Load the index file into memory, locate the index, and locate the offset and size of the file to be read.
(4)写请求使用索引:写文件每次添加一个文件,将文件的数据添加到末尾的文件针n。生成索引添加到文件针n索引记录。(4) Write request to use the index: Write a file each time to add a file, add the file's data to the end of the file pin n. Generate an index added to the file pin n index record.
由上述的描述可以看出,Facebook的Haystack特点是将文件的完整键值都加载到内存中进行文件定位。机器内存足够大的情况下,Facebook完整的8字节键值可以全部加载到内存中,但是现实环境下存在两个问题:As can be seen from the above description, Facebook's Haystack feature is to load the full key value of the file into memory for file location. When the machine memory is large enough, Facebook's full 8-byte key value can be fully loaded into memory, but there are two problems in the real world:
(1)存储服务器内存不会太大,一般为32G至64G;(1) The storage server memory will not be too large, generally 32G to 64G;
(2)小文件对应的键值大小难控制,一般选择文件内容的MD5或SHA1作为该文件的键值。(2) The key value corresponding to the small file is difficult to control. Generally, MD5 or SHA1 of the file content is selected as the key value of the file.
假设一台存储服务器有12块4T磁盘,内存为32GB左右。服务器上现需存储大小约为4K的头像、缩略图等文件,约为10亿个。文件的键值使用MD5,加上偏移量和大小字段,平均一个小文件对应的索引信息占用28字节。在这种情况下,索引占用内存接近30GB,磁盘仅占用4TB。内存消耗近100%,磁盘消耗只有8%。Suppose a storage server has 12 4T disks and the memory is about 32GB. The server now needs to store about 4K avatars, thumbnails and other files, about 1 billion. The key value of the file uses MD5, plus the offset and size fields, and the index information corresponding to an average small file occupies 28 bytes. In this case, the index occupies nearly 30GB of memory and the disk occupies only 4TB. Memory consumption is nearly 100%, and disk consumption is only 8%.
由此可见,Haystack系统采用的索引方案对内存资源消耗巨大,并且内存资源限制了磁盘资源的利用率,因此,想要获得更大的磁盘资源的利用率需要额外增加内存资源的大量投入。It can be seen that the indexing scheme adopted by the Haystack system consumes a large amount of memory resources, and the memory resources limit the utilization of disk resources. Therefore, in order to obtain a larger utilization of disk resources, an excessive increase in memory resources is required.
发明内容Summary of the invention
以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this document. This Summary is not intended to limit the scope of the claims.
本发明实施例提供了一种文件存储和索引方法、装置、介质、设备及读取文件的方法,以至少解决Haystack系统采用的索引方案对内存资源消耗大的问题。The embodiments of the present invention provide a file storage and indexing method, device, medium, device, and method for reading a file, so as to at least solve the problem that the indexing scheme adopted by the Haystack system consumes a large amount of memory resources.
本发明实施例提供的文件存储和索引方法,包括:The file storage and indexing method provided by the embodiment of the invention includes:
按照文件的实际键值的字母顺序存储各文件,得到数据文件;Store each file in alphabetical order according to the actual key value of the file to obtain a data file;
生成用于索引所述数据文件中各文件的索引文件,其中,所述索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向所述数据文件中的一个或者多个文件,所述键值对应的偏移量为所述键值指向的一个或者多个文件中首个文件的偏移量,所述键值对应的大小值为所述键值指向的一个或者多个文件中首个文件的大小值,N为正整数。Generating an index file for indexing each file in the data file, wherein an index in the index file uses a first N bytes of an actual key value of each file as a key value, and each index points to the data file One or more files, the key value corresponding to the offset is an offset of the first file in the one or more files pointed by the key value, and the size value corresponding to the key value is the key value pointing The size of the first file in one or more files, N is a positive integer.
上述方法还具有以下特点:The above method also has the following characteristics:
所述索引文件中的偏移量字段和大小字段是通过512字节对齐的。The offset and size fields in the index file are aligned by 512 bytes.
上述方法还具有以下特点:The above method also has the following characteristics:
所述生成用于索引所述数据文件中各文件的索引文件还包括:The generating an index file for indexing each file in the data file further includes:
按照键值前缀分层存储所述索引文件的索引,其中,所述键值前缀对应的分层中存储的索引的键值为截去所述键值前缀的简短键值,其中,所述键值前缀的字节长度小于N。The index of the index file is hierarchically stored according to a key value prefix, wherein a key value of an index stored in a layer corresponding to the key value prefix is a short key value truncating the key value prefix, wherein the key The value prefix has a byte length less than N.
上述方法还具有以下特点:The above method also has the following characteristics:
所述索引文件的索引的偏移量是以所述索引所在分层为偏移范围的层内偏移量,所述层内偏移量的字节数是根据分层的最大层地址空间确定的。The offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space. of.
上述方法还具有以下特点:The above method also has the following characteristics:
所述方法还包括:将所述数据文件中的所有文件映射到布隆过滤器中,以使读取所述数据文件中的文件时通过快速搜索所述布隆过滤器来判断将要读取的文件是否可能存在。The method further includes mapping all of the files in the data file to a Bloom filter such that when the file in the data file is read, the Bron filter is quickly searched to determine that the file is to be read. Whether the file may exist.
本发明实施例提供的计算机可读存储介质上存储有计算机程序,所述程序被处理器执行时实现上述方法的步骤。The computer readable storage medium provided by the embodiment of the present invention stores a computer program, and when the program is executed by the processor, the steps of the foregoing method are implemented.
本发明实施例提供的计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述方法的步骤。A computer device provided by an embodiment of the present invention includes a memory, a processor, and a computer program stored on the memory and operable on the processor, and the processor implements the steps of the foregoing method when the program is executed.
本发明实施例提供的文件存储和索引装置,包括:The file storage and indexing device provided by the embodiment of the invention includes:
数据文件存储模块,用于存储数据文件,其中,所述数据文件是按照文件的实际键值的字母顺序存储各文件所得到的;a data file storage module, configured to store a data file, wherein the data file is obtained by storing each file in alphabetical order according to an actual key value of the file;
索引文件生成模块,用于生成用于索引所述数据文件中各文件的索引文件, 其中,所述索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向所述数据文件中的一个或者多个文件,所述键值对应的偏移量为所述键值指向的一个或者多个文件中首个文件的偏移量,所述键值对应的大小值为所述键值指向的一个或者多个文件中首个文件的大小值,N为正整数。An index file generating module, configured to generate an index file for indexing each file in the data file, wherein an index in the index file uses a first N bytes of an actual key value of each file as a key value, and each index Pointing to one or more files in the data file, the offset corresponding to the key value is an offset of a first file in one or more files pointed by the key value, and a size corresponding to the key value The value is the size of the first file in one or more files pointed to by the key value, and N is a positive integer.
上述装置还具有以下特点:The above device also has the following features:
所述索引文件生成模块,还用于按照键值前缀分层存储所述索引文件的索引,其中,所述键值前缀对应的分层中存储的索引的键值为截去所述键值前缀的简短键值,其中,所述键值前缀的字节长度小于N。The index file generating module is further configured to hierarchically store an index of the index file according to a key value prefix, wherein a key value of an index stored in a layer corresponding to the key value prefix is a prefix of the key value Short key value, wherein the key value prefix has a byte length less than N.
上述装置还具有以下特点:The above device also has the following features:
所述索引文件的索引的偏移量是以所述索引所在分层为偏移范围的层内偏移量,所述层内偏移量的字节数是根据分层的最大层地址空间确定的。The offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space. of.
上述装置还具有以下特点:The above device also has the following features:
所述装置还包括:The device also includes:
映射模块,用于将所述数据文件中的所有文件映射到布隆过滤器中,以使读取所述数据文件中的文件时通过搜索所述布隆过滤器来判断将要读取的文件是否可能存在。a mapping module, configured to map all the files in the data file into the Bloom filter, so that when the file in the data file is read, by searching the Bloom filter to determine whether the file to be read is That may exist.
本发明提供的文件存储和索引装置中读取文件的方法,包括:The method for reading a file in a file storage and indexing device provided by the present invention includes:
根据将要读取的文件的实际键值的前N字节查询所述索引文件中所述实际键值的前N字节对应的索引;Querying an index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read;
根据所述实际键值,在所述实际键值的前N字节对应的索引指向的一个或者多个文件中匹配文件;Matching, according to the actual key value, a file in one or more files pointed to by an index corresponding to a first N bytes of the actual key value;
在匹配到键值与所述实际键值一致的文件时,读取该文件。The file is read when it matches a file whose key value is consistent with the actual key value.
上述方法还具有以下特点:The above method also has the following characteristics:
所述根据将要读取的文件的实际键值的前N字节查询所述索引文件中所述实际键值的前N字节对应的索引包括:The index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read includes:
根据所述布隆过滤器判断将要读取的文件是否可能存在;在判断结果为可能存在的情况下,根据将要读取的文件的实际键值的前N字节查询所述索引文件中所述实际键值的前N字节对应的索引,否则终止读取文件。Determining, according to the Bloom filter, whether a file to be read is likely to exist; if the result of the determination is possible, querying the index file according to the first N bytes of the actual key value of the file to be read The index corresponding to the first N bytes of the actual key value, otherwise the file is terminated.
通过本发明实施例,采用按照文件的实际键值的字母顺序存储各文件,得到数据文件;生成用于索引数据文件中各文件的索引文件,其中,索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向数据文件中的一个或者多个文件,键值对应的偏移量为键值指向的一个或者多个文件中首个文件的偏移量,键值对应的大小值为键值指向的一个或者多个文件中首个文件的大小值的方式,解决了Haystack系统采用的索引方案对内存资源消耗大的问题,降低了索引系统对内存资源的消耗。Through the embodiment of the present invention, each file is stored in alphabetical order according to the actual key value of the file to obtain a data file; an index file for indexing each file in the data file is generated, wherein the index in the index file uses the actual key of each file. The first N bytes of the value are used as key values, and each index points to one or more files in the data file, and the offset corresponding to the key value is the offset of the first file in one or more files pointed to by the key value. The size corresponding to the key value is the size of the first file in one or more files pointed to by the key value, which solves the problem that the index scheme adopted by the Haystack system consumes a large amount of memory resources, and reduces the memory resources of the index system. Consumption.
附图说明DRAWINGS
此处所说明的附图用来提供对本发明实施例的进一步理解,构成本申请的一部分,本发明实施例的示意性实施例及其说明用于解释本发明实施例,并不构成对本发明实施例的不当限定。在附图中:The accompanying drawings are intended to provide a further understanding of the embodiments of the embodiments of the invention Improper limitations. In the drawing:
图1是根据本发明实施例的文件存储和索引方法的流程图;1 is a flow chart of a file storage and indexing method in accordance with an embodiment of the present invention;
图2是根据本发明实施例的文件存储和索引装置的结构框图;2 is a structural block diagram of a file storage and indexing apparatus according to an embodiment of the present invention;
图3是根据本发明实施例的在文件存储和索引装置中读取文件的方法的流程图;3 is a flowchart of a method of reading a file in a file storage and indexing device according to an embodiment of the present invention;
图4是根据本发明优选实施例的文件存储和索引结构的示意图;4 is a schematic diagram of a file storage and index structure in accordance with a preferred embodiment of the present invention;
图5是根据本发明优选实施例的读取文件的方法的流程图;5 is a flow chart of a method of reading a file in accordance with a preferred embodiment of the present invention;
图6、图7和图8是根据本发明优选实施例的索引分层示意图;6, FIG. 7, and FIG. 8 are schematic diagrams of index hierarchy according to a preferred embodiment of the present invention;
图9和图10是根据本发明优选实施例与相关技术的索引方案的内存消耗对比示意图。9 and 10 are diagrams showing a comparison of memory consumption of an indexing scheme in accordance with a preferred embodiment of the present invention.
具体实施方式detailed description
现结合附图和具体实施方式对本发明实施例进一步说明。The embodiments of the present invention will be further described with reference to the drawings and specific embodiments.
实施例1Example 1
在本实施例中提供了一种文件存储和索引方法,图1是根据本发明实施例的文件存储和索引方法的流程图。如图1所示,该流程包括如下步骤:A file storage and indexing method is provided in this embodiment, and FIG. 1 is a flowchart of a file storage and indexing method according to an embodiment of the present invention. As shown in Figure 1, the process includes the following steps:
步骤S101,按照文件的实际键值的字母顺序存储各文件,得到数据文件;Step S101, storing each file in alphabetical order according to the actual key value of the file, to obtain a data file;
步骤S102,生成用于索引数据文件中各文件的索引文件,其中,索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向数据文件中的一个或者多个文件,键值对应的偏移量值为键值指向的一个或者多个文件中首个文件的偏移量值,键值对应的大小值为键值指向的一个或者多个文件中首个文件的大小值,N为正整数。Step S102, generating an index file for indexing each file in the data file, wherein the index in the index file uses the first N bytes of the actual key value of each file as a key value, and each index points to one or more of the data files. File, the offset value corresponding to the key value is the offset value of the first file in one or more files pointed to by the key value, and the size value corresponding to the key value is the first one or more files pointed to by the key value. The size of the file, N is a positive integer.
在上述步骤中,由于在索引中不再保存文件实际键值,而是仅保存实际键值的前N字节,减少了索引文件的大小;同时,这样的索引不再指向一个文件,而会指向实际键值的前N字节相同的一个或者多个文件;为了能够根据索引中的偏移量定位到文件的位置,在存储文件时把文件按照实际键值的字母顺序依次存储到数据文件中,使得实际键值的前N字节相同的一个或者多个文件集中存储在一片连续的位置上,得以使用一个偏移量来指示它们的存储位置。可见,在将步骤S102中生成的索引文件加载到内存中之后,相对于相关技术的Haystack系统而言,将会占用更少的内存资源,解决了Haystack系统采用的索引方案对内存资源消耗大的问题,降低了索引系统对内存资源的消耗。In the above steps, since the actual key value of the file is no longer saved in the index, but only the first N bytes of the actual key value are saved, the size of the index file is reduced; at the same time, such an index no longer points to a file, but Point to the same one or more files of the first N bytes of the actual key value; in order to be able to locate the location of the file according to the offset in the index, store the file in the alphabetical order of the actual key value to the data file when the file is stored. One or more files in which the first N bytes of the actual key value are the same are stored in one continuous position, and an offset is used to indicate their storage location. It can be seen that after loading the index file generated in step S102 into the memory, the Haystack system of the related art will occupy less memory resources, which solves the problem that the index scheme adopted by the Haystack system consumes a large amount of memory resources. The problem is that the consumption of memory resources by the indexing system is reduced.
在采用步骤S102生成的索引文件索引某一个文件时,根据索引不再能直接索引到某一个确定的文件,而将会索引到一个连续的文件集合;在需要精确读取某一个文件时,只要根据这个文件的实际键值,在文件集合中逐一匹配文件就可能读取到想要的文件。When indexing a certain file by using the index file generated in step S102, the index can no longer directly index to a certain file according to the index, but will index to a continuous file set; when it is necessary to accurately read a certain file, According to the actual key value of this file, it is possible to read the desired file by matching the files one by one in the file collection.
上述索引文件中的偏移量字段和大小字段是通过512字节对齐的;即如果一个文件是1024字节大小,按照512字节对齐,1024/512=2,则文件大小可以用2表示,当在索引中得到大小是2,用2乘以512字节就可以得到文件的大小是1024字节;之前需要保存的是1024,现在只需要保存2这个数字,至少节省一个字节;并且还可以根据整个数据文件的实际大小计算偏移量字段和大小字段所需使用的字节数,从而可以进一步减小索引所占用的字节数。The offset field and the size field in the above index file are aligned by 512 bytes; that is, if a file is 1024 bytes in size and aligned in 512 bytes, 1024/512=2, the file size can be represented by 2. When the size is 2 in the index, the size of the file is 1024 bytes by multiplying 2 by 512 bytes; the previous need to save is 1024, now only need to save 2, save at least one byte; The number of bytes required for the offset and size fields can be calculated based on the actual size of the entire data file, thereby further reducing the number of bytes occupied by the index.
为了能够进一步减小索引所占用的字节数,考虑到索引文件中存储的键值仍有键值前缀重复的可能行,因此,还可以考虑对索引文件中的索引按照键值前缀进行分层存储,其中,键值前缀对应的分层中存储的索引的键值为截去键值前缀的简短键值,键值前缀的字节长度小于N。在该分层中索引数 量越多的情况下,分层后的索引文件占用的字节数相对于原来的索引文件将会更小。In order to further reduce the number of bytes occupied by the index, it is considered that the key value stored in the index file still has a possible row of key value prefixes. Therefore, it is also considered to layer the index in the index file according to the key value prefix. The storage, wherein the key value of the index stored in the layer corresponding to the key value prefix is a short key value of the truncated key value prefix, and the byte length of the key value prefix is less than N. In the case where the number of indexes in the hierarchy is larger, the number of bytes occupied by the layered index file will be smaller than the original index file.
在索引文件采用分层存储之后,各分层内的索引的偏移量可以进一步优化以减少字节数。可选地,索引文件的索引的偏移量是以索引所在分层为偏移范围的层内偏移量,该层内偏移量的字节数是根据分层的最大层地址空间确定的。由于最大层地址空间必然小于整个数据文件的大小,因此,层内偏移量占用的字节数也将小于按照整个数据文件为偏移范围的原始偏移量占用的字节数。After the index file is tiered, the offset of the index within each layer can be further optimized to reduce the number of bytes. Optionally, the offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the offset in the layer is determined according to the layered maximum layer address space. . Since the maximum layer address space must be smaller than the size of the entire data file, the number of bytes occupied by the intra-layer offset will also be less than the number of bytes occupied by the original offset in the offset range of the entire data file.
布隆过滤器是一种二进制向量数据结构,它具有很好的空间和时间效率,被用来检测一个元素是不是集合中的一个成员。如果检测结果为是,该元素不一定在集合中;但如果检测结果为否,该元素一定不在集合中。布隆过滤器优点是它的插入和查询时间都是常数,另外它查询元素却不保存元素本身,具有良好的安全性。在本发明实施例中,由于一个索引指向多个文件,因此有必要利用布隆过滤器,以通过快速搜索文件是否可能存在来避免对不存在文件的查询所造成的资源和时间浪费。可选地,本实施例中还将数据文件中的所有文件映射到布隆过滤器中,以使读取数据文件中的文件时通过快速搜索布隆过滤器来判断将要读取的文件是否可能存在。The Bloom filter is a binary vector data structure that has good spatial and temporal efficiency and is used to detect if an element is a member of a collection. If the test result is yes, the element is not necessarily in the set; but if the test result is no, the element must not be in the set. The advantage of the Bloom filter is that its insertion and query time are constant, and it does not save the element itself, but it has good security. In the embodiment of the present invention, since one index points to a plurality of files, it is necessary to utilize a Bloom filter to avoid waste of resources and time caused by queries for non-existing files by quickly searching for possible existence of files. Optionally, in this embodiment, all files in the data file are also mapped into the Bloom filter, so that when the file in the data file is read, it is possible to determine whether the file to be read is possible by quickly searching for the Bloom filter. presence.
本发明实施例中N的取值优选为4。通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。In the embodiment of the present invention, the value of N is preferably 4. Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
实施例2Example 2
在本实施例中还提供了一种文件存储和索引装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所 描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In the embodiment, a file storage and indexing device is also provided, which is used to implement the above-mentioned embodiments and preferred embodiments, and has not been described again. As used below, the term "module" may implement a combination of software and/or hardware of a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
图2是根据本发明实施例的文件存储和索引装置的结构框图,如图2所示,该装置包括:数据文件存储模块21和索引文件生成模块22,其中,2 is a structural block diagram of a file storage and indexing apparatus according to an embodiment of the present invention. As shown in FIG. 2, the apparatus includes: a data file storage module 21 and an index file generating module 22, wherein
数据文件存储模块21,用于存储数据文件,其中,数据文件是按照文件的实际键值的字母顺序存储各文件所得到的;a data file storage module 21, configured to store data files, wherein the data files are obtained by storing the files in alphabetical order according to actual key values of the files;
索引文件生成模块22,耦合至数据文件存储模块21,用于生成用于索引数据文件中各文件的索引文件,其中,索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向数据文件中的一个或者多个文件,键值对应的偏移量为键值指向的一个或者多个文件中首个文件的偏移量,键值对应的大小值为键值指向的一个或者多个文件中首个文件的大小值,N为正整数。The index file generating module 22 is coupled to the data file storage module 21 for generating an index file for indexing each file in the data file, wherein the index in the index file uses the first N bytes of the actual key value of each file as a key. Value, each index points to one or more files in the data file. The offset corresponding to the key value is the offset of the first file in one or more files pointed to by the key value, and the size value corresponding to the key value is the key. The size of the first file in one or more files pointed to by the value, N is a positive integer.
索引文件生成模块还用于按照键值前缀分层存储索引文件的索引,其中,键值前缀对应的分层中存储的索引的键值为截去键值前缀的简短键值,其中,键值前缀的字节长度小于N。The index file generating module is further configured to hierarchically store the index of the index file according to the key value prefix, wherein the key value of the index stored in the layer corresponding to the key value prefix is a short key value of the prefix of the truncated key value, wherein the key value The prefix has a byte length less than N.
索引文件的索引的偏移量是以索引所在分层为偏移范围的层内偏移量,层内偏移量的字节数是根据分层的最大层地址空间确定的。The offset of the index of the index file is the intra-layer offset of the offset range of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space.
上述文件存储和索引装置还包括:映射模块,用于将数据文件中的所有文件映射到布隆过滤器中,以使读取数据文件中的文件时通过搜索布隆过滤器来判断将要读取的文件是否可能存在。The file storage and indexing device further includes: a mapping module, configured to map all the files in the data file to the Bloom filter, so that when the file in the data file is read, the Bron filter is searched to determine that the file is to be read. Whether the file may exist.
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述模块分别位于多个处理器中。It should be noted that each of the above modules may be implemented by software or hardware. For the latter, the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the modules are located in multiple In the processor.
本发明实施例中N的取值优选为4。In the embodiment of the present invention, the value of N is preferably 4.
实施例3Example 3
在本实施例中提供了一种在上述的文件存储和索引装置中读取文件的方法,图3是根据本发明实施例的在文件存储和索引装置中读取文件的方法的流程图,如图3所示,该流程包括如下步骤:In the present embodiment, there is provided a method of reading a file in the above file storage and indexing device, and FIG. 3 is a flow chart of a method of reading a file in a file storage and indexing device according to an embodiment of the present invention, such as As shown in Figure 3, the process includes the following steps:
步骤S301,根据将要读取的文件的实际键值的前N字节查询索引文件中实际键值的前N字节对应的索引;Step S301, querying an index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read;
步骤S302,根据实际键值,在实际键值的前N字节对应的索引指向的一个或者多个文件中匹配文件;Step S302, according to the actual key value, matching the file in one or more files pointed to by the index corresponding to the first N bytes of the actual key value;
步骤S303,在匹配到键值与实际键值一致的文件时,读取该文件。Step S303, when the file matching the key value and the actual key value is matched, the file is read.
可选地,在步骤S301中,在查询索引之前,还可以根据布隆过滤器判断将要读取的文件是否可能存在;在判断结果为可能存在的情况下,根据将要读取的文件的实际键值的前N字节查询索引文件中实际键值的前N字节对应的索引,否则终止读取文件。Optionally, in step S301, before the index is queried, whether the file to be read may be determined according to the Bloom filter; if the result of the determination is possible, according to the actual key of the file to be read The first N bytes of the value query the index corresponding to the first N bytes of the actual key value in the index file, otherwise the file is terminated.
本发明实施例中N的取值优选为4。In the embodiment of the present invention, the value of N is preferably 4.
实施例4Example 4
为了使本发明实施例的描述更加清楚,下面结合优选实施例进行描述和说明。In order to make the description of the embodiments of the present invention more clear, the following description and description are given in conjunction with the preferred embodiments.
在本优选实施例中提供了一种文件存储和索引结构和方法,图4是根据本发明优选实施例的文件存储和索引结构的示意图,如图4所示,其中,分层文件存储于内存中,将相同的键值前缀分为一层。索引文件用于对小文件进行定位。数据文件存储于磁盘中,其中的每个文件针都是一个小文件。In the preferred embodiment, a file storage and index structure and method are provided. FIG. 4 is a schematic diagram of a file storage and index structure according to a preferred embodiment of the present invention, as shown in FIG. 4, wherein the hierarchical file is stored in the memory. In the middle, the same key-value prefix is divided into one layer. Index files are used to locate small files. The data files are stored on disk, and each file pin is a small file.
图5是根据本发明优选实施例的读取文件的方法的流程图,在图5中示出了通过匹配索引前缀,定位小文件的具体位置,然后通过读取完整的键值,来查看文件的键值是否匹配,如果不匹配再继续顺序查找下一个文件针的详细流程。5 is a flow chart of a method of reading a file according to a preferred embodiment of the present invention. FIG. 5 shows a specific location of a small file by matching an index prefix, and then viewing the file by reading the complete key value. Whether the key values match, if not matched, continue to search for the detailed flow of the next file pin.
本优选实施例提供的文件存储和索引方案包括下列步骤:步骤1:压缩前缀优化,减少键值、偏移量、大小占用空间;The file storage and indexing scheme provided by the preferred embodiment includes the following steps: Step 1: compressing the prefix optimization, reducing the key value, the offset, and the size occupied space;
(1)数据文件组织:(1) Data file organization:
与Facebook的Haystack类似,该系统将多个小文件写入到一个数据文件中,每个文件针保存键值、大小、data等信息。Similar to Facebook's Haystack, the system writes multiple small files into a single data file, each of which holds key-values, sizes, data, and more.
(2)索引文件组织:(2) Index file organization:
1)索引文件只保存键值的前四字节,而非完整的键值;1) The index file only stores the first four bytes of the key value, not the full key value;
2)索引文件中的偏移量和大小字段,通过512字节对齐,节省1个字节; 并根据整个数据文件实际大小计算偏移量和大小使用的字节数。2) The offset and size fields in the index file are saved by 512 bytes, saving 1 byte; and the number of bytes used for the offset and size is calculated according to the actual size of the entire data file.
步骤2:文件针顺序存放,定位小文件位置;数据文件中的文件针按照键值的字母顺序存放。Step 2: The file pins are stored in order, and the location of the small files is located; the file pins in the data files are stored in alphabetical order according to the key values.
由于索引文件的键值,只保存前四字节,如果小文件键值的前四字节相同,不顺序存放文件针,则无法根据一个偏移量找到分散存放的全部文件针的具体位置。例如:用户读取的文件键值是0xabcdefacee,但由于索引文件中的键值只保存前四字节,只能匹配0xabcdefac这个前缀,此时无法定位到具体要读取的偏移量。Due to the key value of the index file, only the first four bytes are saved. If the first four bytes of the small file key value are the same and the file pins are not stored sequentially, the specific positions of all the file pins scattered can not be found according to an offset. For example, the file key value read by the user is 0xabcdefacee, but since the key value in the index file only saves the first four bytes, it can only match the prefix 0xabcdefac, and the offset to be read cannot be located at this time.
在本优选实施例中,通过文件针顺序存放,来解决上述问题:例如:用户读取文件的键值是0xabcdefacbb,匹配到0xabcdefac这个前缀,此时偏移量指向0xabcdefacaa这个文件针,第一次匹配未命中。In the preferred embodiment, the above problem is solved by storing the file pins sequentially: for example, the key value of the user reading the file is 0xabcdefacbb, and the prefix is 0xabcdefac, and the offset points to the file pin of 0xabcdefacaa, the first time. Match miss.
通过存放在文件针的header(文件头)中的大小,我们可以定位0xabcdefacbb位置,匹配到正确文件针,并将数据读取给用户。By storing the size in the header of the file pin, we can locate the 0xabcdefacbb location, match the correct file pin, and read the data to the user.
步骤3:索引分层优化;Step 3: Index layering optimization;
(1)分层方案(1) Stratification scheme
参考图6,可以将索引中键值前缀相同的索引分为一层。分层原则是每个分层中的文件针数尽量控制在64个左右,并且根据分层要存放的文件针数量,选择分层级别。分层级别可以根据需要确定,例如下面给出了一种分层级别的示例:Referring to FIG. 6, the index with the same key-value prefix in the index can be divided into one layer. The layering principle is that the number of files in each layer is controlled as much as possible to about 64, and the hierarchical level is selected according to the number of file pins to be stored in the layer. The level of hierarchy can be determined as needed, for example, an example of a hierarchical level is given below:
0级:不进行分层;Level 0: no stratification;
1级:选择文件针键值第一字节进行分层;Level 1: Select the first byte of the file pin key value for layering;
2级:选择文件针键值的前两字节进行分层;Level 2: Select the first two bytes of the file pin key value for layering;
分层所用的键值前缀的字节数小于索引中键值的字节长度。The number of bytes used for the key-value prefix used for layering is less than the byte length of the key in the index.
(2)分层减少键值的占用字节数(2) tiering reduces the number of occupied bytes of the key value
参考图7,通过分层,只保存一份重复的前缀,节省键值的字节数。Referring to Figure 7, by layering, only one duplicate prefix is saved, saving the number of bytes of the key value.
(3)分层减少偏移量的占用字节数(3) tiering reduces the number of occupied bytes of the offset
参考图8,优化前的偏移量,偏移范围为整个数据文件的地址空间。优化后,layer的偏移量在整个数据文件中进行偏移,而分层下的索引的偏移量只需在数据文件中的层内进行偏移,根据最大的层地址空间可以计算所需字节数。Referring to Figure 8, the offset before optimization is the address space of the entire data file. After optimization, the offset of the layer is offset in the entire data file, and the offset of the index under the layer only needs to be offset within the layer in the data file, which can be calculated according to the maximum layer address space. The number of bytes.
此外,在本优选实施例中,还通过布隆过滤器避免不存在文件的访问。在内存中,将存在的文件映射到布隆过滤器中,只需要通过快速搜索,Moreover, in the preferred embodiment, access to the file is also avoided by the Bloom filter. In memory, map existing files to Bloom filters, only through a quick search.
就可以排除掉不存在文件。时间复杂度为O(k),k为一个元素需要的bit位数。经验表明,当k为9.6时,误报率为1%,如果k再增加4.8,误报率会降低到0.1%。It is possible to exclude files that do not exist. The time complexity is O(k), where k is the number of bits required for an element. Experience has shown that when k is 9.6, the false positive rate is 1%. If k is increased by 4.8, the false alarm rate will be reduced to 0.1%.
下面将以Haystack为参考说明本发明优选实施例的有益效果。Advantageous effects of the preferred embodiment of the present invention will be described below with reference to Haystack.
(1)通过前缀压缩,带来的内存节省对比(1) Comparison of memory savings brought by prefix compression
参考图9,横轴表示文件数,纵轴表示索引文件需要的内存大小,短虚线表示传统的Haystack的内存消耗量,长虚线表示通过本发明实施例进行前缀压缩后的内存消耗量。从图9可以看出在文件数量为10亿的情况下,使用facabook的Haystack消耗的内存为26G多,使用本优选实施例提供的压缩前缀的索引方案消耗的内存为9G多,内存使用降低了2/3。Referring to Fig. 9, the horizontal axis represents the number of files, the vertical axis represents the memory size required for the index file, the short dashed line represents the memory consumption of the conventional Haystack, and the long dashed line represents the memory consumption after the prefix compression by the embodiment of the present invention. It can be seen from FIG. 9 that in the case where the number of files is 1 billion, the memory used by the Haystack of the facabook is more than 26G, and the indexing scheme using the compression prefix provided by the preferred embodiment consumes more than 9G of memory, and the memory usage is reduced. 2/3.
(2)再次通过索引分层,带来的内存节省对比(2) again through the index layering, the resulting memory savings comparison
参考图10,横轴表示文件数,纵轴表示索引文件需要的内存大小,短虚线表示传统的Haystack的内存消耗量,长虚线表示通过本发明实施例进行前缀压缩后的内存消耗量,实线表示通过本发明实施例进行前缀压缩并索引分层后的内存消耗量。从图10可以看出,在进行索引分层后,从优化之前的9G多内存消耗,进一步降低到4G多,又节省了1半的内存消耗。Referring to FIG. 10, the horizontal axis represents the number of files, the vertical axis represents the memory size required for the index file, the short dashed line represents the memory consumption of the conventional Haystack, and the long dashed line represents the memory consumption after the prefix compression by the embodiment of the present invention, the solid line The memory consumption after prefix compression and index stratification is performed by the embodiment of the present invention. As can be seen from FIG. 10, after index layering, the 9G multi-memory consumption before optimization is further reduced to more than 4G, and one-half memory consumption is saved.
在试验本优选实施例提供的文件存储和索引方案后,小文件的整体性能有显著提高,每秒请求数(RequestPerSecond,简称为RPS)提升一倍多,机器的输入输出(Input/Output,简称为IO)使用率减少了将近一倍。同时,因为优化了最小存储单元,碎片降低80%。使用该系统我们可以为用户提供更快速地读写服务,并且节省了集群的资源消耗。After testing the file storage and indexing scheme provided by the preferred embodiment, the overall performance of the small file is significantly improved, and the number of requests per second (RequestPerSecond, referred to as RPS) is more than doubled, and the input/output of the machine (Input/Output, referred to as For IO) usage has nearly doubled. At the same time, because the minimum memory unit is optimized, the fragmentation is reduced by 80%. Using this system, we can provide users with faster read and write services and save the resource consumption of the cluster.
实施例5Example 5
在本实施例中提供了一种软件,该软件用于执行上述实施例及优选实施方式中描述的技术方案。In the embodiment, software is provided for executing the technical solutions described in the above embodiments and preferred embodiments.
实施例6Example 6
本实施例提供了一种存储介质。在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:This embodiment provides a storage medium. In this embodiment, the above storage medium may be configured to store program code for performing the following steps:
步骤S101,按照文件的实际键值的字母顺序存储各文件,得到数据文件;Step S101, storing each file in alphabetical order according to the actual key value of the file, to obtain a data file;
步骤S102,生成用于索引数据文件中各文件的索引文件,其中,索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向数据文件中的一个或者多个文件,键值对应的偏移量为键值指向的一个或者多个文件中首个文件的偏移量,键值对应的大小值为键值指向的一个或者多个文件中首个文件的大小值,N为正整数。Step S102, generating an index file for indexing each file in the data file, wherein the index in the index file uses the first N bytes of the actual key value of each file as a key value, and each index points to one or more of the data files. File, the offset corresponding to the key value is the offset of the first file in one or more files pointed to by the key value, and the size corresponding to the key value is the first file in one or more files pointed to by the key value. The size value, N is a positive integer.
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-OnlyMemory,简称为ROM)、随机存取存储器(RandomAccessMemory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。Optionally, in the embodiment, the foregoing storage medium may include, but is not limited to, a USB flash drive, a Read-Only Memory (ROM), a Random Access Memory (RAM), a mobile hard disk, and a removable hard disk. A variety of media that can store program code, such as a disk or an optical disk.
可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。For example, the specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the optional embodiments, and details are not described herein again.
实施例7Example 7
本发明的实施例还提供了一种存储介质。在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的程序代码:Embodiments of the present invention also provide a storage medium. In this embodiment, the above storage medium may be configured to store program code for performing the following steps:
步骤S301,根据将要读取的文件的实际键值的前N字节查询索引文件中实际键值的前N字节对应的索引;Step S301, querying an index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read;
步骤S302,根据实际键值,在实际键值的前N字节对应的索引指向的一个或者多个文件中匹配文件;Step S302, according to the actual key value, matching the file in one or more files pointed to by the index corresponding to the first N bytes of the actual key value;
步骤S303,在匹配到键值与实际键值一致的文件时,读取该文件。Step S303, when the file matching the key value and the actual key value is matched, the file is read.
本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围,均应涵盖在权利要求范围当中。A person skilled in the art should understand that the technical solutions of the present invention may be modified or equivalent, without departing from the spirit and scope of the present invention, and should be included in the scope of the claims.
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用 于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art will appreciate that all or some of the steps, systems, and functional blocks/units of the methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be composed of several physical The components work together. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on a computer readable medium, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to those of ordinary skill in the art, the term computer storage medium includes volatile and nonvolatile, implemented in any method or technology for storing information, such as computer readable instructions, data structures, program modules or other data. Sex, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storage, magnetic cartridge, magnetic tape, magnetic disk storage or other magnetic storage device, or may Any other medium used to store the desired information and that can be accessed by the computer. Moreover, it is well known to those skilled in the art that communication media typically includes computer readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and can include any information delivery media. .
工业实用性Industrial applicability
本文解决了Haystack系统采用的索引方案对内存资源消耗大的问题,降低了索引系统对内存资源的消耗。This paper solves the problem that the indexing scheme adopted by the Haystack system consumes a large amount of memory resources and reduces the consumption of memory resources by the indexing system.

Claims (13)

  1. 一种文件存储和索引方法,包括:A file storage and indexing method, including:
    按照文件的实际键值的字母顺序存储各文件,得到数据文件;Store each file in alphabetical order according to the actual key value of the file to obtain a data file;
    生成用于索引所述数据文件中各文件的索引文件,其中,所述索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向所述数据文件中的一个或者多个文件,所述键值对应的偏移量为所述键值指向的一个或者多个文件中首个文件的偏移量,所述键值对应的大小值为所述键值指向的一个或者多个文件中首个文件的大小值,N为正整数。Generating an index file for indexing each file in the data file, wherein an index in the index file uses a first N bytes of an actual key value of each file as a key value, and each index points to the data file One or more files, the key value corresponding to the offset is an offset of the first file in the one or more files pointed by the key value, and the size value corresponding to the key value is the key value pointing The size of the first file in one or more files, N is a positive integer.
  2. 根据权利要求1所述的方法,其中,所述索引文件中的偏移量字段和大小字段是通过512字节对齐的。The method of claim 1 wherein the offset field and the size field in the index file are aligned by 512 bytes.
  3. 根据权利要求1所述的方法,其中,所述生成用于索引所述数据文件中各文件的索引文件还包括:The method of claim 1, wherein the generating an index file for indexing each file in the data file further comprises:
    按照键值前缀分层存储所述索引文件的索引,其中,所述键值前缀对应的分层中存储的索引的键值为截去所述键值前缀的简短键值,其中,所述键值前缀的字节长度小于N。The index of the index file is hierarchically stored according to a key value prefix, wherein a key value of an index stored in a layer corresponding to the key value prefix is a short key value truncating the key value prefix, wherein the key The value prefix has a byte length less than N.
  4. 根据权利要求3所述的方法,其中,The method of claim 3, wherein
    所述索引文件的索引的偏移量是以所述索引所在分层为偏移范围的层内偏移量,所述层内偏移量的字节数是根据分层的最大层地址空间确定的。The offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space. of.
  5. 根据权利要求1至4中任一项所述的方法,其中,所述方法还包括:The method according to any one of claims 1 to 4, wherein the method further comprises:
    将所述数据文件中的所有文件映射到布隆过滤器中,以使读取所述数据文件中的文件时通过快速搜索所述布隆过滤器来判断将要读取的文件是否可能存在。All files in the data file are mapped into a Bloom filter such that when the file in the data file is read, it is determined whether the file to be read is likely to exist by quickly searching the Bloom filter.
  6. 一种计算机可读存储介质,所述存储介质上存储有计算机程序,所述程序被处理器执行时实现权利要求1至5中任意一项所述方法的步骤。A computer readable storage medium having stored thereon a computer program, the program being executed by a processor to perform the steps of the method of any one of claims 1 to 5.
  7. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至5中任意一项所述方法的步骤A computer device comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, the processor executing the program to implement any one of claims 1 to 5 Steps of the method
  8. 一种文件存储和索引装置,包括:A file storage and indexing device comprising:
    数据文件存储模块,用于存储数据文件,其中,所述数据文件是按照文件的实际键值的字母顺序存储各文件所得到的;a data file storage module, configured to store a data file, wherein the data file is obtained by storing each file in alphabetical order according to an actual key value of the file;
    索引文件生成模块,用于生成用于索引所述数据文件中各文件的索引文件,其中,所述索引文件中的索引使用各文件的实际键值的前N字节作为键值,每个索引指向所述数据文件中的一个或者多个文件,所述键值对应的偏移量为所述键值指向的一个或者多个文件中首个文件的偏移量,所述键值对应的大小值为所述键值指向的一个或者多个文件中首个文件的大小值,N为正整数。An index file generating module, configured to generate an index file for indexing each file in the data file, wherein an index in the index file uses a first N bytes of an actual key value of each file as a key value, and each index Pointing to one or more files in the data file, the offset corresponding to the key value is an offset of a first file in one or more files pointed by the key value, and a size corresponding to the key value The value is the size of the first file in one or more files pointed to by the key value, and N is a positive integer.
  9. 根据权利要求8所述的装置,其中,所述索引文件生成模块,还用于按照键值前缀分层存储所述索引文件的索引,其中,所述键值前缀对应的分层中存储的索引的键值为截去所述键值前缀的简短键值,其中,所述键值前缀的字节长度小于N。The apparatus according to claim 8, wherein the index file generating module is further configured to hierarchically store an index of the index file according to a key value prefix, wherein an index stored in a layer corresponding to the key value prefix The key value is a short key value that truncates the key value prefix, wherein the key value prefix has a byte length less than N.
  10. 根据权利要求9所述的装置,其中,The apparatus according to claim 9, wherein
    所述索引文件的索引的偏移量是以所述索引所在分层为偏移范围的层内偏移量,所述层内偏移量的字节数是根据分层的最大层地址空间确定的。The offset of the index of the index file is an intra-layer offset of the offset of the index, and the number of bytes of the intra-layer offset is determined according to the layered maximum layer address space. of.
  11. 根据权利要求8至10中任一项所述的装置,其中,所述装置还包括:The device according to any one of claims 8 to 10, wherein the device further comprises:
    映射模块,用于将所述数据文件中的所有文件映射到布隆过滤器中,以使读取所述数据文件中的文件时通过搜索所述布隆过滤器来判断将要读取的文件是否可能存在。a mapping module, configured to map all the files in the data file into the Bloom filter, so that when the file in the data file is read, by searching the Bloom filter to determine whether the file to be read is That may exist.
  12. [根据细则26改正01.02.2018]
    一种在权利要求8至11中任一项所述的文件存储和索引装置中读取文件的方法,包括:
    根据将要读取的文件的实际键值的前N字节查询所述索引文件中所述实际键值的前N字节对应的索引;
    根据所述实际键值,在所述实际键值的前N字节对应的索引指向的一个或者多个文件中匹配文件;
    在匹配到键值与所述实际键值一致的文件时,读取该文件。
    [Correct according to Rule 26 01.02.2018]
    A method of reading a file in the file storage and indexing device according to any one of claims 8 to 11, comprising:
    Querying an index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read;
    Matching, according to the actual key value, a file in one or more files pointed to by an index corresponding to a first N bytes of the actual key value;
    The file is read when it matches a file whose key value is consistent with the actual key value.
  13. [根据细则26改正01.02.2018]
    根据权利要求11所述的方法,其中,所述根据将要读取的文件的实际键值的前N字节查询所述索引文件中所述实际键值的前N字节对应的索引包括:
    根据所述布隆过滤器判断将要读取的文件是否可能存在;在判断结果为可能存在的情况下,根据将要读取的文件的实际键值的前N字节查询所述索引文件中所述实际键值的前N字节对应的索引,否则终止读取文件。
    [Correct according to Rule 26 01.02.2018]
    The method according to claim 11, wherein the index corresponding to the first N bytes of the actual key value in the index file according to the first N bytes of the actual key value of the file to be read includes:
    Determining, according to the Bloom filter, whether a file to be read is likely to exist; if the result of the determination is possible, querying the index file according to the first N bytes of the actual key value of the file to be read The index corresponding to the first N bytes of the actual key value, otherwise the file is terminated.
PCT/CN2017/117967 2016-12-26 2017-12-22 File storage and indexing method, apparatus, media, device and method for reading files WO2018121430A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201611221215.1 2016-12-26
CN201611221215.1A CN106874348B (en) 2016-12-26 2016-12-26 File storage and index method and device and file reading method

Publications (1)

Publication Number Publication Date
WO2018121430A1 true WO2018121430A1 (en) 2018-07-05

Family

ID=59164487

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/117967 WO2018121430A1 (en) 2016-12-26 2017-12-22 File storage and indexing method, apparatus, media, device and method for reading files

Country Status (2)

Country Link
CN (1) CN106874348B (en)
WO (1) WO2018121430A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825940A (en) * 2019-09-24 2020-02-21 武汉智美互联科技有限公司 Network data packet storage and query method
CN111639076A (en) * 2020-05-14 2020-09-08 民生科技有限责任公司 Cross-platform efficient key value storage method
CN112748866A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Method and device for processing incremental index data
CN115827573A (en) * 2023-02-16 2023-03-21 麒麟软件有限公司 Linux-based key-value graphic data storage and use method
CN117271440A (en) * 2023-11-21 2023-12-22 深圳市云希谷科技有限公司 File information storage method, reading method and related equipment based on freeRTOS

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106874348B (en) * 2016-12-26 2020-06-16 贵州白山云科技股份有限公司 File storage and index method and device and file reading method
CN110209489B (en) * 2018-02-28 2020-07-31 贵州白山云科技股份有限公司 Memory management method and device suitable for memory page structure
CN109614411B (en) * 2018-11-19 2022-03-04 杭州复杂美科技有限公司 Data storage method, device and storage medium
CN110502472A (en) * 2019-08-09 2019-11-26 西藏宁算科技集团有限公司 A kind of the cloud storage optimization method and its system of large amount of small documents
CN113312313B (en) * 2021-01-29 2023-09-29 淘宝(中国)软件有限公司 Data query method, nonvolatile storage medium and electronic device
CN112765113B (en) * 2021-01-31 2024-04-09 云知声智能科技股份有限公司 Index compression method, index compression device, computer readable storage medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1227413A1 (en) * 2001-01-25 2002-07-31 Telefonaktiebolaget L M Ericsson (Publ) Method for optimised locating of indexed records of static data with different length
CN103810246A (en) * 2013-12-27 2014-05-21 北京天融信软件有限公司 Index building method and device and index query method and device
CN103914483A (en) * 2013-01-07 2014-07-09 深圳市腾讯计算机系统有限公司 File storage method and device and file reading method and device
CN104572670A (en) * 2013-10-15 2015-04-29 方正国际软件(北京)有限公司 Small file storage, query and deletion method and system
CN105069048A (en) * 2015-07-23 2015-11-18 东方网力科技股份有限公司 Small file storage method, query method and device
CN106874348A (en) * 2016-12-26 2017-06-20 贵州白山云科技有限公司 File is stored and the method for indexing means, device and reading file

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8862555B1 (en) * 2011-05-16 2014-10-14 Trend Micro Incorporated Methods and apparatus for generating difference files
CN102779180B (en) * 2012-06-29 2015-09-09 华为技术有限公司 The operation processing method of data-storage system, data-storage system
CN103870492B (en) * 2012-12-14 2017-08-04 腾讯科技(深圳)有限公司 A kind of date storage method and device based on key row sequence
CN105117417B (en) * 2015-07-30 2018-04-17 西安交通大学 A kind of memory database Trie tree indexing means for reading optimization

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1227413A1 (en) * 2001-01-25 2002-07-31 Telefonaktiebolaget L M Ericsson (Publ) Method for optimised locating of indexed records of static data with different length
CN103914483A (en) * 2013-01-07 2014-07-09 深圳市腾讯计算机系统有限公司 File storage method and device and file reading method and device
CN104572670A (en) * 2013-10-15 2015-04-29 方正国际软件(北京)有限公司 Small file storage, query and deletion method and system
CN103810246A (en) * 2013-12-27 2014-05-21 北京天融信软件有限公司 Index building method and device and index query method and device
CN105069048A (en) * 2015-07-23 2015-11-18 东方网力科技股份有限公司 Small file storage method, query method and device
CN106874348A (en) * 2016-12-26 2017-06-20 贵州白山云科技有限公司 File is stored and the method for indexing means, device and reading file

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825940A (en) * 2019-09-24 2020-02-21 武汉智美互联科技有限公司 Network data packet storage and query method
CN110825940B (en) * 2019-09-24 2023-08-22 武汉智美互联科技有限公司 Network data packet storage and query method
CN112748866A (en) * 2019-10-31 2021-05-04 北京沃东天骏信息技术有限公司 Method and device for processing incremental index data
CN111639076A (en) * 2020-05-14 2020-09-08 民生科技有限责任公司 Cross-platform efficient key value storage method
CN111639076B (en) * 2020-05-14 2023-12-22 民生科技有限责任公司 Cross-platform efficient key value storage method
CN115827573A (en) * 2023-02-16 2023-03-21 麒麟软件有限公司 Linux-based key-value graphic data storage and use method
CN117271440A (en) * 2023-11-21 2023-12-22 深圳市云希谷科技有限公司 File information storage method, reading method and related equipment based on freeRTOS
CN117271440B (en) * 2023-11-21 2024-02-06 深圳市云希谷科技有限公司 File information storage method, reading method and related equipment based on freeRTOS

Also Published As

Publication number Publication date
CN106874348B (en) 2020-06-16
CN106874348A (en) 2017-06-20

Similar Documents

Publication Publication Date Title
WO2018121430A1 (en) File storage and indexing method, apparatus, media, device and method for reading files
US10417202B2 (en) Storage system deduplication
US11068441B2 (en) Caseless file lookup in a distributed file system
EP2863310B1 (en) Data processing method and apparatus, and shared storage device
CN112328435B (en) Method, device, equipment and storage medium for backing up and recovering target data
US20180113767A1 (en) Systems and methods for data backup using data binning and deduplication
CN107911461B (en) Object processing method in cloud storage system, storage server and cloud storage system
US9977598B2 (en) Electronic device and a method for managing memory space thereof
CN109446160A (en) A kind of file reading, system, device and computer readable storage medium
CN111177143B (en) Key value data storage method and device, storage medium and electronic equipment
CN111831208B (en) Information processing method and device, terminal equipment and storage medium
CN111324665B (en) Log playback method and device
US20150169570A1 (en) Method and device for managing data
CN103399823A (en) Method, equipment and system for storing service data
WO2017020668A1 (en) Physical disk sharing method and apparatus
US10515055B2 (en) Mapping logical identifiers using multiple identifier spaces
CN113806300A (en) Data storage method, system, device, equipment and storage medium
CN114610708A (en) Vector data processing method and device, electronic equipment and storage medium
CN107423425A (en) A kind of data quick storage and querying method to K/V forms
CN115964002B (en) Electric energy meter terminal archive management method, device, equipment and medium
CN114691612A (en) Data writing method and device and data reading method and device
US20130218851A1 (en) Storage system, data management device, method and program
CN114238264A (en) Data processing method, data processing device, computer equipment and storage medium
WO2011160392A1 (en) Method and apparatus for name-list management
CN111124279B (en) Storage deduplication processing method and device based on host

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17885665

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17885665

Country of ref document: EP

Kind code of ref document: A1