CN114691681A - Data processing method and device, electronic equipment and readable storage medium - Google Patents

Data processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN114691681A
CN114691681A CN202210287639.7A CN202210287639A CN114691681A CN 114691681 A CN114691681 A CN 114691681A CN 202210287639 A CN202210287639 A CN 202210287639A CN 114691681 A CN114691681 A CN 114691681A
Authority
CN
China
Prior art keywords
data
file
data block
storage
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210287639.7A
Other languages
Chinese (zh)
Inventor
韦万
黄俊深
周嘉祺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pingkai Star Beijing Technology Co ltd
Original Assignee
Pingkai Star Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pingkai Star Beijing Technology Co ltd filed Critical Pingkai Star Beijing Technology Co ltd
Priority to CN202210287639.7A priority Critical patent/CN114691681A/en
Publication of CN114691681A publication Critical patent/CN114691681A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data processing method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of data storage. The method comprises the following steps: acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a row group form; when the cached data in the memory reaches a preset size, compressing the cached data to obtain a data block; searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and a storage position of the data block in the target data file from each data file according to the size of the data block; and moving the data block to a storage position for storage. According to the embodiment of the application, excessive write IO generated by the column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.

Description

Data processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of data storage technologies, and in particular, to a data processing method, an apparatus, an electronic device, and a readable storage medium.
Background
In the field of analytical databases, in order to speed up the performance of analysis on data, columnar storage systems are often used to store data. In a columnar storage system, the data of the table is organized in columns, such as one column stored with one data file. Therefore, only the related columns need to be read in the query process, and the reading of irrelevant column data is avoided.
When a batch of updated data is written in the columnar storage system, IO times are required to be equal to the number of columns, so that the existing data updating process has the problems of too high IOPS and generation of a plurality of files, and further the writing capability of the columnar storage system is poor, and even Write Stall (Write Stall) is caused.
Disclosure of Invention
Embodiments of the present application provide a data processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which can solve the above problems in the prior art. The technical scheme is as follows:
according to an aspect of an embodiment of the present application, there is provided a data processing method, including:
acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a row group form;
when the cached data in the memory reaches a preset size, compressing the cached data to obtain a data block;
searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and a storage position of the data block in the target data file from each data file by combining the size of the data block;
moving the data block to a storage position for storage;
wherein each data file is located in a persistent storage device.
As an optional implementation, compressing the cache data to obtain the data block includes:
serializing and compressing each column group in the memory to obtain corresponding subdata blocks;
and summarizing all the subdata blocks to obtain the data blocks.
As an optional implementation, moving the data block to the target storage location for storage, and then further including:
determining storage-related meta-information for the data block;
storing the meta information in a first meta information file in the memory.
As an optional implementation, the meta information includes at least one of the following information:
a data block identifier of the data block and a corresponding first offset;
a file identification of the target data file;
the sub data block identifier of each sub data block and the corresponding second offset;
the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to represent the relative position of the corresponding sub-block of data in the storage location.
As an optional implementation, moving the data block to the target storage location for storage, and then further including:
receiving a data reading request, wherein the data reading request comprises a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;
searching target meta-information comprising a target data block identifier in a first meta-information file;
reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier;
and decompressing the target subdata block in the memory, and decoding to obtain a column group serving as a reading result.
As an optional implementation manner, the method further includes: a second meta information file that stores the meta information in a persistent storage device.
As an alternative embodiment, the space management file is an order-preserving data structure;
wherein, the nodes in the data structure are used for representing the data files; the node information of the node comprises a first offset and a residual space of the corresponding data file; each node is ordered by a corresponding offset.
According to another aspect of embodiments of the present application, there is provided a data processing apparatus including:
the cache module is used for acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into the memory in a row group form;
the compression module is used for compressing the cached data to obtain a data block when the cached data in the memory reaches a preset size;
the storage position determining module is used for searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and the storage position of the data block in the target data file from each data file by combining the size of the data block;
the storage module is used for moving the data block to a storage position for storage;
wherein each data file is located in a persistent storage device.
As an alternative embodiment, the compression module comprises:
the column group compression submodule is used for serializing and compressing each column group in the memory to obtain a corresponding subdata block;
and the summarizing submodule is used for summarizing all the subdata blocks to obtain the data blocks.
As an optional implementation, the data processing apparatus further comprises:
a meta-information determining module for determining storage-related meta-information of the data block;
the first storage module is used for storing the meta information in a first meta information file in the memory.
As an optional implementation, the meta information includes at least one of the following information:
a data block identifier of the data block and a corresponding first offset;
a file identification of the target data file;
the sub data block identifier of each sub data block and the corresponding second offset;
the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to indicate the relative position of the corresponding sub-block of data in the storage location.
As an optional implementation manner, the data processing apparatus further includes:
a read request receiving module, configured to receive a data read request, where the data read request includes a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;
the meta-information searching module is used for searching target meta-information comprising a target data block identifier in the first meta-information file;
the sub data block searching module is used for reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier;
and the decompression module is used for decompressing the target subdata block in the memory and decoding the target subdata block to obtain a column group serving as a reading result.
As an optional implementation, the data processing apparatus includes:
and the second storage module is used for storing the meta information in a second meta information file in the persistent storage device.
As an alternative embodiment, the space management file is an order-preserving data structure;
wherein, the nodes in the data structure are used for representing the data files; the node information of the node comprises a first offset and a residual space of the corresponding data file; each node is ordered by a corresponding offset.
According to another aspect of an embodiment of the present application, there is provided an electronic apparatus including: a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the above method.
According to yet another aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method
The technical scheme provided by the embodiment of the application has the following beneficial effects:
by obtaining a data storage request, wherein the data storage request comprises data to be stored, caching the data to be stored in a column group mode, compressing the cached data to obtain data blocks when the cached data in a memory reaches a preset size, searching a free storage space of each data file from a space management file, determining a target data file for storing the data blocks and storage positions of the data blocks in the target data file by combining the size of the data blocks, and moving the data blocks to the storage positions for storage, excessive write IO (input/output) generated by a column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
FIG. 1 is a schematic diagram of a columnar storage system for writing data according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a data processing system according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 4 is a schematic flowchart of determining a target data file according to an embodiment of the present application;
FIG. 5 is a block diagram of a data processing system according to another embodiment of the present application;
fig. 6 is a schematic diagram of a data reading flow of a data processing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.
As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Referring to fig. 1, a schematic diagram of writing data in a columnar storage system is exemplarily shown, where the columnar storage system includes 3 column groups, which are ColA, ColB, and ColC, a data structure needs to be newly created at the end of each of the 3 column groups whenever new data needs to be written into the system, and when new data is written into the newly created data structure, a Row of data Row3 is added to the columnar storage system.
In the prior art, a new file is generated each time data is updated, and a large number of small files are generated under the condition of high updating speed, so that the query performance is slowed down; in order to avoid too many small files, data merging needs to be continuously performed in a background, a large amount of CPU and IO are consumed, and system stability is affected.
The present application provides a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which are intended to solve the above technical problems in the prior art.
The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps and the like in different embodiments is not repeated.
The data processing method provided by the present application can be applied to the data processing system of fig. 2, which includes the terminal 110 and the server 120. It should be understood that the data storage system provided by the embodiment of the present application may be a background data storage system of the target application, and the server 120 in the data storage system may communicate with the terminal device 110 running the target application.
Illustratively, after the data to be stored is generated by the target application running on the terminal device 110, the terminal device 110 sends a data storage request to the server 120, where the data storage request includes the data to be stored. The server 120 includes an internal memory and a persistent storage device, such as a disk, a Solid State Drive (SSD) or an NVMe solid state drive (NVMe SSD), and the like, which is a device that persistently stores data with respect to the internal memory.
The server 120 caches the data to be stored in the memory in a column array form, when the cached data in the memory reaches a preset size, the server 120 compresses the cached data to obtain a data block, and searching for a free storage space of each data file from a space management file on the persistent storage device, it should be understood that the space management file is a file on the persistent storage device for recording the free storage space of each data file, the data file is a file for storing a database, the server 120 further determines a target data file for storing the data block and a storage location of the data block in the target data file according to the size of the data block, moves the data block to the storage location for storage, and stores the storage-related meta information of the data block in a first meta information file in the memory, so as to quickly obtain the target data when data is read.
The meta information of the embodiment of the present application includes: a data block identifier of the data block and a corresponding first offset; a file identification of the target data file; at least one of the sub data block identifier and the corresponding second offset of each sub data block. The first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to represent the relative position of the corresponding sub-block of data in the storage location.
When a target application running on the terminal device 110 needs to read data, the terminal device 110 sends a data reading request to the server 120, where the data reading request includes a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block. The server 120 searches the first meta-information file for target meta-information including the identification of the target data block; reading target sub data blocks from a target data file of the persistent storage device to a memory according to a file identifier of the target data file in the target meta information, a first offset corresponding to the target data block identifier and a second offset corresponding to the target sub data block identifier; and decompressing the target subdata block in the memory, and decoding to obtain a column group serving as a reading result.
Referring to fig. 3, a schematic flow chart of a data processing method according to an embodiment of the present application is exemplarily shown, and as shown in the drawing, the method includes:
s101, a data storage request is obtained, the data storage request comprises data to be stored, and the data to be stored are cached in a memory in a column group mode.
The embodiment of the application processes the storage request of the data to be stored in the memory, analyzes the data to be stored into a column type storage form in the memory, and caches the data in the memory. For example, if the memory includes 3 column groups, if each column group includes 3 sub-data at a certain time, when a piece of data to be stored is obtained subsequently, the data to be stored is split into 3 sub-data, and the 3 sub-data is stored at the end of the 3 column groups in sequence.
S102, when the data cached in the memory reaches a preset size, compressing the cached data to obtain a data block.
In the embodiment of the application, a data volume threshold is set for the data cached in the memory, and when the cached data reaches the data volume threshold, the data is compressed. According to the method and the device, too much data does not need to be accumulated, and the timeliness of data updating is guaranteed.
S103, searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and the storage position of the data block in the target data file from each data file according to the size of the data block.
It should be noted that the space management file according to the embodiment of the present application may be located in a memory, so as to improve efficiency of obtaining an empty storage space of a data file, where each data file is located in a persistent storage device, such as a magnetic disk, a Solid State Drive (SSD) or an NVMe solid state drive (NVMe SSD), where the space management file is used to store the empty storage space of each data file, and the size of the empty storage space determines a size of a data block that can be subsequently stored by the data file. Therefore, after the free storage space of each data file is acquired from the space management file, the data file capable of accommodating the data block can be selected according to the size of the data block, and when a plurality of data files capable of accommodating the data block exist, the target data file can be determined in a random mode.
And S104, moving the data block to a storage position for storage.
According to the data processing method, the data storage request is obtained, the data to be stored are cached in a column group mode, when the cached data in the memory reach the preset size, the cached data are compressed to obtain the data blocks, the free storage space of each data file is searched from the space management file, the size of each data block is combined, the target data file used for storing the data blocks and the storage position of each data block in the target data file are determined, the data blocks are moved to the storage position to be stored, excessive write IO (input/output) of a column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.
On the basis of the foregoing embodiments, as an optional embodiment, compressing the cache data to obtain the data block includes:
serializing and compressing each column group in the memory to obtain corresponding subdata blocks;
and summarizing all the subdata blocks to obtain the data blocks.
Taking fig. 1 as an example, fig. 1 has 3 column groups in total, and 3 sub data blocks can be obtained by performing sequence number compression on each column group. And packaging the 3 sub data blocks to obtain the data blocks.
On the basis of the above embodiments, as an alternative embodiment, compressing the cache data into data blocks includes:
determining storage-related meta-information for the data block;
specifically, the meta information in the embodiment of the present application may include the following information:
the data block identification of the data block and a corresponding first offset, wherein the first offset is used for representing the storage position of the corresponding data block in the target data file;
a file identification of the target data file;
the sub-data block identifier of each sub-data block and a corresponding second offset, wherein the second offset is used for indicating the relative position of the corresponding sub-data block in the storage location.
Storing the meta information in a first meta information file in the memory. The meta information is cached in the memory, so that quick access is facilitated, and the reading efficiency is ensured. Specifically, in the embodiment of the present application, the data block identifier of the data block in the meta information may be used as a key, and other information except the data block identifier in the meta information may be used as a value to store the data block identifier.
On the basis of the above embodiments, as an optional embodiment, the space management file is a data structure with an order preserved; for example, the structure may be a red-black tree (also referred to as a self-balancing binary search tree). Nodes in the data structure are used for representing data files; the node information of the node comprises a first offset and a residual space of the corresponding data file; each node is ordered by a corresponding offset.
Referring to fig. 4, which schematically illustrates a flowchart of determining a target data file according to an embodiment of the present application, as shown in the figure, a red-black tree includes 3 nodes, wherein the first offset of node 1 is 200 bytes, the remaining space is 50 bytes, the first offset of node 2 is 0 bytes, the remaining space is 100 bytes, the first offset of node 3 is 400 bytes, the remaining space is 300 bytes, and if the size of the data block is 200 bytes, traversing downwards from the root node of the red and black tree, namely the node 1, determining that the residual space of the node 3 is enough to store the data block, moving the data block to the data file corresponding to the node 3, updating the first offset and the residual space of the data file in the space management file, correspondingly, the first offset recorded by the node 3 is 600(400+200), and the remaining space is 100 (300-.
Referring to fig. 5, which schematically shows a structural diagram of a data processing system according to another embodiment of the present application, as shown in the figure, the overall structure of the system is divided into three parts, including a memory table MemTable101, a data block storage module BlobStore102, and a metadata storage module PageMetaStore 103:
MemTable101 is a memory structure used to store the most recently written data. The data in MemTable has been parsed and stored using a columnar format (called a column set). The MemTable data is persisted (Flush) to BlobStore102 when it reaches a predetermined size.
The data of MemTable101 is first serialized and compressed before Flush, and a data block is obtained, which is called Page. Each Page has a unique ID, i.e., a data block identification Page ID. It should be noted that the size of the Page in the embodiment of the present application is determined by the size of Memtable, so the size of the Page is not fixed.
As an alternative embodiment, MemTable101, adopts a columnar data form, i.e. using arrays to store data for each column, one array corresponding to a column (column array). The writing of data only needs to be appended (appendix) to the end of the array. MemTable101 includes two parts, Imutable (non-modifiable table, which is a temporary form of a Mutable flush phase) and Mutable (modifiable table, in which update data can be normally written). Only the Mutable can be written into, and the Immunable is the intermediate state of the Mutable in the Flush process.
And the BlobStore102 is used for actually storing the content of the Page. The BlobStore102 actually stores data using a data file (BlobFile)1022 on a persistent storage device, and a Free Space (Free Space) of the corresponding BlobFile1022 is recorded by a Space management file (SpaceMap) 1021. The BlobStore102 may contain multiple pairs of spaceMaps 1021 and BlobFile1022, with BlobFile1022 having an upper capacity limit.
As an alternative embodiment, the SpaceMap1021 adopts a data structure of a red-black tree (self-balanced binary search tree). Each node in the tree records a corresponding free space, and the node information comprises (Offset and residual space Length) and is sorted according to the Offset.
The PageMetaStore103 stores meta information (PageMeta) of the Page. Caching the PageMeta using a first meta File (PageDirectory)1031 in memory facilitates fast access while the PageMeta is persistently stored to a second meta File (PageMate WAL File)1032 on a persistent storage device.
As an alternative embodiment, the PageDirectory1031 uses a data structure of a hash table to store meta information, so as to facilitate fast obtaining meta information corresponding to a Page according to a Page id.
The processing flow of the data processing system shown in fig. 5 includes:
s201, processing the write request of the data to be stored in the memory, and caching the data to be stored in the MemTable 101.
MemTable is internally subdivided into two parts: a Mutable and an Immutable. When the data is written with the Mutable, or the upper layer is manually called, triggering the current Mutable to be converted into the ImMutable, and generating a new Mutable for future writing; and the just generated Immutable is serialized and compressed into Page.
Since a feature of columnar storage is that it supports read-only partial columns, an index with each column is required within Page for fast retrieval of Offset for a particular column.
S202, according to the size of the Page, applying for proper Free Space from the SpaceMap1021 in the BlobStore102, and generating the PageMeta.
The contents of the PageMeta include a blob file ID, a first Offset Page Offset, a data block Size Page Size, a second Offset Column Offset per Column in the Page, and so on.
S203, the Page is written into the specific position corresponding to the BlobFile 1022.
S204, the PageMeta is written into the PageMeta WAL File1032 of the persistent storage device.
S205, caching the PageMeta in the PageDirectory1031 in the memory. By storing the meta information in the persistent storage device, the data can be restored to the PageDirectory in the memory in the starting process, and the information such as the PageID, Offset and Size in the PageMeta WAL File can also be used for restoring the space management File SpaceMap, so that backup is provided for coping with the restart of the system.
Referring to fig. 6, a data reading flow of the data processing method according to the embodiment of the present application is exemplarily shown, including:
s301, receiving a data reading request, wherein the data reading request comprises a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;
s302, target meta-information including the target data block identification is searched in the first meta-information file.
When the first meta-information file is stored, the target data block identifier can be used as a key, other meta-information can be used as a value to be stored, and in a data reading stage, the key corresponding to the target database identifier can be searched in the first meta-information file, so that the target meta-information can be quickly located.
S303, reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier.
Specifically, the position of the target data file is determined from the persistent storage device according to a file identifier of the target data file in the target meta information, the storage position of the target data in the persistent storage device is further determined according to a first offset corresponding to the target data block identifier, the storage position of the target sub data block is further determined according to a second offset corresponding to the target sub data block identifier, and the target sub data block is read from the target data file to the memory according to the storage position.
S304, decompressing the target sub data block in the memory, and decoding the target sub data block into a column group as a reading result.
An embodiment of the present application provides a data processing apparatus, and as shown in fig. 7, the data processing apparatus may include: a cache module 101, a compression module 102, a storage location determination module 103, and a storage module 103, wherein,
the cache module 101 is configured to obtain a data storage request, where the data storage request includes data to be stored, and cache the data to be stored in a memory in a column group form;
the compression module 102 is configured to compress cached data to obtain a data block when the cached data in the memory reaches a preset size;
a storage location determining module 103, configured to search a free storage space of each data file from the space management file, and determine, from each data file, a target data file for storing the data block and a storage location of the data block in the target data file in combination with the size of the data block;
the storage module 104 is configured to move the data block to a storage location for storage;
wherein each data file is located in a persistent storage device.
The apparatus of the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus of the embodiments of the present application correspond to the steps in the method of the embodiments of the present application, and for the detailed functional description of the modules of the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.
As an alternative embodiment, the compression module comprises:
the column group compression submodule is used for serializing and compressing each column group in the memory to obtain a corresponding subdata block;
and the summarizing submodule is used for summarizing all the subdata blocks to obtain the data blocks.
As an optional implementation, the data processing apparatus further comprises:
a meta-information determining module for determining storage-related meta-information for the data block;
the first storage module is used for storing the meta information in a first meta information file in the memory.
As an optional implementation, the meta information includes at least one of the following information:
a data block identifier of the data block and a corresponding first offset;
a file identification of the target data file;
the sub data block identifier of each sub data block and the corresponding second offset;
the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to indicate the relative position of the corresponding sub-block of data in the storage location.
As an optional implementation, the data processing apparatus further comprises:
a read request receiving module, configured to receive a data read request, where the data read request includes a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;
the meta-information searching module is used for searching target meta-information comprising a target data block identifier in the first meta-information file;
the sub data block searching module is used for reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier;
and the decompression module is used for decompressing the target sub-data block in the memory and decoding the target sub-data block to obtain a column group serving as a reading result.
As an optional implementation, the data processing apparatus includes:
and the second storage module is used for storing the meta information in a second meta information file in the persistent storage device.
As an optional implementation manner, the space management file is an order-preserving data structure;
wherein, the nodes in the data structure are used for representing the data files; the node information of the node comprises a first offset and a residual space of the corresponding data file; each node is ordered by a corresponding offset.
In an embodiment of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory, where the processor executes the computer program to implement the steps of the data processing method, and compared with the related art, the steps of: by obtaining a data storage request, wherein the data storage request comprises data to be stored, caching the data to be stored in a column group mode, compressing the cached data to obtain data blocks when the cached data in a memory reaches a preset size, searching a free storage space of each data file from a space management file, determining a target data file for storing the data blocks and a storage position of the data blocks in the target data file by combining the size of the data blocks, and moving the data blocks to the storage position for storage, excessive write IO (input/output) generated by a column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.
In an alternative embodiment, an electronic device is provided, as shown in fig. 8, the electronic device 4000 shown in fig. 8 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.
The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.
Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and the corresponding content of the foregoing method embodiments.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and the corresponding contents of the foregoing method embodiments can be implemented.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.
It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.
The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims (10)

1. A data processing method, comprising:
acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a row group form;
when the data cached in the memory reaches a preset size, compressing the cached data to obtain a data block;
searching free storage space of each data file from a space management file, and determining a target data file for storing the data block and a storage position of the data block in the target data file from each data file by combining the size of the data block;
moving the data block to the storage position for storage;
wherein each data file is located in a persistent storage device.
2. The data processing method of claim 1, wherein compressing the cached data to obtain a data block comprises:
serializing and compressing each column group in the memory to obtain corresponding subdata blocks;
and summarizing all the subdata blocks to obtain the data blocks.
3. The data processing method of claim 2, wherein moving the data block to the target storage location for storage further comprises:
determining storage-related meta-information for the data block;
and storing the meta information in a first meta information file in the memory.
4. The data processing method of claim 3, wherein the meta-information comprises at least one of the following information:
a data block identifier of the data block and a corresponding first offset;
a file identifier of the target data file;
the sub data block identifier of each sub data block and a corresponding second offset;
wherein the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to represent the relative position of the corresponding sub-block of data in the storage location.
5. The data processing method of claim 4, wherein moving the data block to the target storage location for storage further comprises:
receiving a data reading request, wherein the data reading request comprises a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;
searching target meta-information comprising the target data block identifier in the first meta-information file;
reading the target sub data block from the target data file to the memory according to a file identifier of the target data file in the target meta information, a first offset corresponding to the target data block identifier and a second offset corresponding to the target sub data block identifier;
and decompressing the target sub data block in the memory, and decoding to obtain a column group serving as a reading result.
6. The data processing method of claim 3, further comprising:
a second meta information file that stores the meta information in the persistent storage device.
7. The data processing method of claim 1, wherein the space management file is an order-preserving data structure;
wherein nodes in the data structure are used to characterize the data file; the node information of the node comprises a first offset and a residual space of the corresponding data file; and sequencing the nodes by corresponding offset.
8. A data processing apparatus, comprising:
the cache module is used for acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a column group form;
the compression module is used for compressing the cached data to obtain a data block when the cached data in the memory reaches a preset size;
a storage location determining module, configured to search a free storage space of each data file from a space management file, and determine, in combination with the size of the data block, a target data file for storing the data block and a storage location of the data block in the target data file from each data file;
the storage module is used for moving the data block to the storage position for storage;
and all the data files are positioned in the persistent storage equipment.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202210287639.7A 2022-03-22 2022-03-22 Data processing method and device, electronic equipment and readable storage medium Pending CN114691681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210287639.7A CN114691681A (en) 2022-03-22 2022-03-22 Data processing method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210287639.7A CN114691681A (en) 2022-03-22 2022-03-22 Data processing method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN114691681A true CN114691681A (en) 2022-07-01

Family

ID=82138685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210287639.7A Pending CN114691681A (en) 2022-03-22 2022-03-22 Data processing method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN114691681A (en)

Similar Documents

Publication Publication Date Title
US9977802B2 (en) Large string access and storage
US8838551B2 (en) Multi-level database compression
US11537578B2 (en) Paged column dictionary
CN107577436B (en) Data storage method and device
US10678779B2 (en) Generating sub-indexes from an index to compress the index
CN110764706A (en) Storage system, data management method, and storage medium
US11221999B2 (en) Database key compression
CN111125033B (en) Space recycling method and system based on full flash memory array
CN113901279B (en) Graph database retrieval method and device
CN111949710A (en) Data storage method, device, server and storage medium
CN104424219A (en) Method and equipment of managing data documents
CN111324665A (en) Log playback method and device
CN111611250A (en) Data storage device, data query method, data query device, server and storage medium
CN116257523A (en) Column type storage indexing method and device based on nonvolatile memory
CN107423425B (en) Method for quickly storing and inquiring data in K/V format
CN111831691B (en) Data reading and writing method and device, electronic equipment and storage medium
CN115114232A (en) Method, device and medium for enumerating historical version objects
CN112306957A (en) Method and device for acquiring index node number, computing equipment and storage medium
CN113253932B (en) Read-write control method and system for distributed storage system
CN114115734A (en) Data deduplication method, device, equipment and storage medium
CN114896250B (en) Key value separated key value storage engine index optimization method and device
CN114297196B (en) Metadata storage method and device, electronic equipment and storage medium
CN114691681A (en) Data processing method and device, electronic equipment and readable storage medium
RU2656721C1 (en) Method of the partially matching large objects storage organization
CN113312414B (en) Data processing method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination