CN114691681A

CN114691681A - Data processing method and device, electronic equipment and readable storage medium

Info

Publication number: CN114691681A
Application number: CN202210287639.7A
Authority: CN
Inventors: 韦万; 黄俊深; 周嘉祺
Original assignee: Pingkai Star Beijing Technology Co ltd
Current assignee: Pingkai Star Beijing Technology Co ltd
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-07-01

Abstract

The embodiment of the application provides a data processing method and device, electronic equipment and a computer readable storage medium, and relates to the technical field of data storage. The method comprises the following steps: acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a row group form; when the cached data in the memory reaches a preset size, compressing the cached data to obtain a data block; searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and a storage position of the data block in the target data file from each data file according to the size of the data block; and moving the data block to a storage position for storage. According to the embodiment of the application, excessive write IO generated by the column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.

Description

Data processing method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of data storage technologies, and in particular, to a data processing method, an apparatus, an electronic device, and a readable storage medium.

Background

In the field of analytical databases, in order to speed up the performance of analysis on data, columnar storage systems are often used to store data. In a columnar storage system, the data of the table is organized in columns, such as one column stored with one data file. Therefore, only the related columns need to be read in the query process, and the reading of irrelevant column data is avoided.

When a batch of updated data is written in the columnar storage system, IO times are required to be equal to the number of columns, so that the existing data updating process has the problems of too high IOPS and generation of a plurality of files, and further the writing capability of the columnar storage system is poor, and even Write Stall (Write Stall) is caused.

Disclosure of Invention

Embodiments of the present application provide a data processing method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which can solve the above problems in the prior art. The technical scheme is as follows:

according to an aspect of an embodiment of the present application, there is provided a data processing method, including:

acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a row group form;

when the cached data in the memory reaches a preset size, compressing the cached data to obtain a data block;

searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and a storage position of the data block in the target data file from each data file by combining the size of the data block;

moving the data block to a storage position for storage;

wherein each data file is located in a persistent storage device.

As an optional implementation, compressing the cache data to obtain the data block includes:

serializing and compressing each column group in the memory to obtain corresponding subdata blocks;

and summarizing all the subdata blocks to obtain the data blocks.

As an optional implementation, moving the data block to the target storage location for storage, and then further including:

determining storage-related meta-information for the data block;

storing the meta information in a first meta information file in the memory.

As an optional implementation, the meta information includes at least one of the following information:

a data block identifier of the data block and a corresponding first offset;

a file identification of the target data file;

the sub data block identifier of each sub data block and the corresponding second offset;

the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to represent the relative position of the corresponding sub-block of data in the storage location.

receiving a data reading request, wherein the data reading request comprises a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;

searching target meta-information comprising a target data block identifier in a first meta-information file;

reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier;

and decompressing the target subdata block in the memory, and decoding to obtain a column group serving as a reading result.

As an optional implementation manner, the method further includes: a second meta information file that stores the meta information in a persistent storage device.

As an alternative embodiment, the space management file is an order-preserving data structure;

wherein, the nodes in the data structure are used for representing the data files; the node information of the node comprises a first offset and a residual space of the corresponding data file; each node is ordered by a corresponding offset.

According to another aspect of embodiments of the present application, there is provided a data processing apparatus including:

the cache module is used for acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into the memory in a row group form;

the compression module is used for compressing the cached data to obtain a data block when the cached data in the memory reaches a preset size;

the storage position determining module is used for searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and the storage position of the data block in the target data file from each data file by combining the size of the data block;

the storage module is used for moving the data block to a storage position for storage;

wherein each data file is located in a persistent storage device.

As an alternative embodiment, the compression module comprises:

the column group compression submodule is used for serializing and compressing each column group in the memory to obtain a corresponding subdata block;

and the summarizing submodule is used for summarizing all the subdata blocks to obtain the data blocks.

As an optional implementation, the data processing apparatus further comprises:

a meta-information determining module for determining storage-related meta-information of the data block;

the first storage module is used for storing the meta information in a first meta information file in the memory.

a data block identifier of the data block and a corresponding first offset;

a file identification of the target data file;

the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to indicate the relative position of the corresponding sub-block of data in the storage location.

As an optional implementation manner, the data processing apparatus further includes:

a read request receiving module, configured to receive a data read request, where the data read request includes a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;

the meta-information searching module is used for searching target meta-information comprising a target data block identifier in the first meta-information file;

the sub data block searching module is used for reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier;

and the decompression module is used for decompressing the target subdata block in the memory and decoding the target subdata block to obtain a column group serving as a reading result.

As an optional implementation, the data processing apparatus includes:

and the second storage module is used for storing the meta information in a second meta information file in the persistent storage device.

According to another aspect of an embodiment of the present application, there is provided an electronic apparatus including: a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the above method.

According to yet another aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method

The technical scheme provided by the embodiment of the application has the following beneficial effects:

by obtaining a data storage request, wherein the data storage request comprises data to be stored, caching the data to be stored in a column group mode, compressing the cached data to obtain data blocks when the cached data in a memory reaches a preset size, searching a free storage space of each data file from a space management file, determining a target data file for storing the data blocks and storage positions of the data blocks in the target data file by combining the size of the data blocks, and moving the data blocks to the storage positions for storage, excessive write IO (input/output) generated by a column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.

FIG. 1 is a schematic diagram of a columnar storage system for writing data according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a data processing system according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic flowchart of determining a target data file according to an embodiment of the present application;

FIG. 5 is a block diagram of a data processing system according to another embodiment of the present application;

fig. 6 is a schematic diagram of a data reading flow of a data processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification in connection with embodiments of the present application, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, as embodied in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, a schematic diagram of writing data in a columnar storage system is exemplarily shown, where the columnar storage system includes 3 column groups, which are ColA, ColB, and ColC, a data structure needs to be newly created at the end of each of the 3 column groups whenever new data needs to be written into the system, and when new data is written into the newly created data structure, a Row of data Row3 is added to the columnar storage system.

In the prior art, a new file is generated each time data is updated, and a large number of small files are generated under the condition of high updating speed, so that the query performance is slowed down; in order to avoid too many small files, data merging needs to be continuously performed in a background, a large amount of CPU and IO are consumed, and system stability is affected.

The present application provides a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product, which are intended to solve the above technical problems in the prior art.

The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application will be described below through descriptions of several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps and the like in different embodiments is not repeated.

The data processing method provided by the present application can be applied to the data processing system of fig. 2, which includes the terminal 110 and the server 120. It should be understood that the data storage system provided by the embodiment of the present application may be a background data storage system of the target application, and the server 120 in the data storage system may communicate with the terminal device 110 running the target application.

Illustratively, after the data to be stored is generated by the target application running on the terminal device 110, the terminal device 110 sends a data storage request to the server 120, where the data storage request includes the data to be stored. The server 120 includes an internal memory and a persistent storage device, such as a disk, a Solid State Drive (SSD) or an NVMe solid state drive (NVMe SSD), and the like, which is a device that persistently stores data with respect to the internal memory.

The server 120 caches the data to be stored in the memory in a column array form, when the cached data in the memory reaches a preset size, the server 120 compresses the cached data to obtain a data block, and searching for a free storage space of each data file from a space management file on the persistent storage device, it should be understood that the space management file is a file on the persistent storage device for recording the free storage space of each data file, the data file is a file for storing a database, the server 120 further determines a target data file for storing the data block and a storage location of the data block in the target data file according to the size of the data block, moves the data block to the storage location for storage, and stores the storage-related meta information of the data block in a first meta information file in the memory, so as to quickly obtain the target data when data is read.

The meta information of the embodiment of the present application includes: a data block identifier of the data block and a corresponding first offset; a file identification of the target data file; at least one of the sub data block identifier and the corresponding second offset of each sub data block. The first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to represent the relative position of the corresponding sub-block of data in the storage location.

When a target application running on the terminal device 110 needs to read data, the terminal device 110 sends a data reading request to the server 120, where the data reading request includes a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block. The server 120 searches the first meta-information file for target meta-information including the identification of the target data block; reading target sub data blocks from a target data file of the persistent storage device to a memory according to a file identifier of the target data file in the target meta information, a first offset corresponding to the target data block identifier and a second offset corresponding to the target sub data block identifier; and decompressing the target subdata block in the memory, and decoding to obtain a column group serving as a reading result.

Referring to fig. 3, a schematic flow chart of a data processing method according to an embodiment of the present application is exemplarily shown, and as shown in the drawing, the method includes:

s101, a data storage request is obtained, the data storage request comprises data to be stored, and the data to be stored are cached in a memory in a column group mode.

The embodiment of the application processes the storage request of the data to be stored in the memory, analyzes the data to be stored into a column type storage form in the memory, and caches the data in the memory. For example, if the memory includes 3 column groups, if each column group includes 3 sub-data at a certain time, when a piece of data to be stored is obtained subsequently, the data to be stored is split into 3 sub-data, and the 3 sub-data is stored at the end of the 3 column groups in sequence.

S102, when the data cached in the memory reaches a preset size, compressing the cached data to obtain a data block.

In the embodiment of the application, a data volume threshold is set for the data cached in the memory, and when the cached data reaches the data volume threshold, the data is compressed. According to the method and the device, too much data does not need to be accumulated, and the timeliness of data updating is guaranteed.

S103, searching the free storage space of each data file from the space management file, and determining a target data file for storing the data block and the storage position of the data block in the target data file from each data file according to the size of the data block.

It should be noted that the space management file according to the embodiment of the present application may be located in a memory, so as to improve efficiency of obtaining an empty storage space of a data file, where each data file is located in a persistent storage device, such as a magnetic disk, a Solid State Drive (SSD) or an NVMe solid state drive (NVMe SSD), where the space management file is used to store the empty storage space of each data file, and the size of the empty storage space determines a size of a data block that can be subsequently stored by the data file. Therefore, after the free storage space of each data file is acquired from the space management file, the data file capable of accommodating the data block can be selected according to the size of the data block, and when a plurality of data files capable of accommodating the data block exist, the target data file can be determined in a random mode.

And S104, moving the data block to a storage position for storage.

According to the data processing method, the data storage request is obtained, the data to be stored are cached in a column group mode, when the cached data in the memory reach the preset size, the cached data are compressed to obtain the data blocks, the free storage space of each data file is searched from the space management file, the size of each data block is combined, the target data file used for storing the data blocks and the storage position of each data block in the target data file are determined, the data blocks are moved to the storage position to be stored, excessive write IO (input/output) of a column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.

On the basis of the foregoing embodiments, as an optional embodiment, compressing the cache data to obtain the data block includes:

and summarizing all the subdata blocks to obtain the data blocks.

Taking fig. 1 as an example, fig. 1 has 3 column groups in total, and 3 sub data blocks can be obtained by performing sequence number compression on each column group. And packaging the 3 sub data blocks to obtain the data blocks.

On the basis of the above embodiments, as an alternative embodiment, compressing the cache data into data blocks includes:

determining storage-related meta-information for the data block;

specifically, the meta information in the embodiment of the present application may include the following information:

the data block identification of the data block and a corresponding first offset, wherein the first offset is used for representing the storage position of the corresponding data block in the target data file;

a file identification of the target data file;

the sub-data block identifier of each sub-data block and a corresponding second offset, wherein the second offset is used for indicating the relative position of the corresponding sub-data block in the storage location.

Storing the meta information in a first meta information file in the memory. The meta information is cached in the memory, so that quick access is facilitated, and the reading efficiency is ensured. Specifically, in the embodiment of the present application, the data block identifier of the data block in the meta information may be used as a key, and other information except the data block identifier in the meta information may be used as a value to store the data block identifier.

On the basis of the above embodiments, as an optional embodiment, the space management file is a data structure with an order preserved; for example, the structure may be a red-black tree (also referred to as a self-balancing binary search tree). Nodes in the data structure are used for representing data files; the node information of the node comprises a first offset and a residual space of the corresponding data file; each node is ordered by a corresponding offset.

Referring to fig. 4, which schematically illustrates a flowchart of determining a target data file according to an embodiment of the present application, as shown in the figure, a red-black tree includes 3 nodes, wherein the first offset of node 1 is 200 bytes, the remaining space is 50 bytes, the first offset of node 2 is 0 bytes, the remaining space is 100 bytes, the first offset of node 3 is 400 bytes, the remaining space is 300 bytes, and if the size of the data block is 200 bytes, traversing downwards from the root node of the red and black tree, namely the node 1, determining that the residual space of the node 3 is enough to store the data block, moving the data block to the data file corresponding to the node 3, updating the first offset and the residual space of the data file in the space management file, correspondingly, the first offset recorded by the node 3 is 600(400+200), and the remaining space is 100 (300-.

Referring to fig. 5, which schematically shows a structural diagram of a data processing system according to another embodiment of the present application, as shown in the figure, the overall structure of the system is divided into three parts, including a memory table MemTable101, a data block storage module BlobStore102, and a metadata storage module PageMetaStore 103:

MemTable101 is a memory structure used to store the most recently written data. The data in MemTable has been parsed and stored using a columnar format (called a column set). The MemTable data is persisted (Flush) to BlobStore102 when it reaches a predetermined size.

The data of MemTable101 is first serialized and compressed before Flush, and a data block is obtained, which is called Page. Each Page has a unique ID, i.e., a data block identification Page ID. It should be noted that the size of the Page in the embodiment of the present application is determined by the size of Memtable, so the size of the Page is not fixed.

As an alternative embodiment, MemTable101, adopts a columnar data form, i.e. using arrays to store data for each column, one array corresponding to a column (column array). The writing of data only needs to be appended (appendix) to the end of the array. MemTable101 includes two parts, Imutable (non-modifiable table, which is a temporary form of a Mutable flush phase) and Mutable (modifiable table, in which update data can be normally written). Only the Mutable can be written into, and the Immunable is the intermediate state of the Mutable in the Flush process.

And the BlobStore102 is used for actually storing the content of the Page. The BlobStore102 actually stores data using a data file (BlobFile)1022 on a persistent storage device, and a Free Space (Free Space) of the corresponding BlobFile1022 is recorded by a Space management file (SpaceMap) 1021. The BlobStore102 may contain multiple pairs of spaceMaps 1021 and BlobFile1022, with BlobFile1022 having an upper capacity limit.

As an alternative embodiment, the SpaceMap1021 adopts a data structure of a red-black tree (self-balanced binary search tree). Each node in the tree records a corresponding free space, and the node information comprises (Offset and residual space Length) and is sorted according to the Offset.

The PageMetaStore103 stores meta information (PageMeta) of the Page. Caching the PageMeta using a first meta File (PageDirectory)1031 in memory facilitates fast access while the PageMeta is persistently stored to a second meta File (PageMate WAL File)1032 on a persistent storage device.

As an alternative embodiment, the PageDirectory1031 uses a data structure of a hash table to store meta information, so as to facilitate fast obtaining meta information corresponding to a Page according to a Page id.

The processing flow of the data processing system shown in fig. 5 includes:

s201, processing the write request of the data to be stored in the memory, and caching the data to be stored in the MemTable 101.

MemTable is internally subdivided into two parts: a Mutable and an Immutable. When the data is written with the Mutable, or the upper layer is manually called, triggering the current Mutable to be converted into the ImMutable, and generating a new Mutable for future writing; and the just generated Immutable is serialized and compressed into Page.

Since a feature of columnar storage is that it supports read-only partial columns, an index with each column is required within Page for fast retrieval of Offset for a particular column.

S202, according to the size of the Page, applying for proper Free Space from the SpaceMap1021 in the BlobStore102, and generating the PageMeta.

The contents of the PageMeta include a blob file ID, a first Offset Page Offset, a data block Size Page Size, a second Offset Column Offset per Column in the Page, and so on.

S203, the Page is written into the specific position corresponding to the BlobFile 1022.

S204, the PageMeta is written into the PageMeta WAL File1032 of the persistent storage device.

S205, caching the PageMeta in the PageDirectory1031 in the memory. By storing the meta information in the persistent storage device, the data can be restored to the PageDirectory in the memory in the starting process, and the information such as the PageID, Offset and Size in the PageMeta WAL File can also be used for restoring the space management File SpaceMap, so that backup is provided for coping with the restart of the system.

Referring to fig. 6, a data reading flow of the data processing method according to the embodiment of the present application is exemplarily shown, including:

s301, receiving a data reading request, wherein the data reading request comprises a target sub data block identifier of a target sub data block and a target data block identifier of a target data block corresponding to the target sub data block;

s302, target meta-information including the target data block identification is searched in the first meta-information file.

When the first meta-information file is stored, the target data block identifier can be used as a key, other meta-information can be used as a value to be stored, and in a data reading stage, the key corresponding to the target database identifier can be searched in the first meta-information file, so that the target meta-information can be quickly located.

S303, reading the target sub data block from the target data file to the memory according to the file identifier of the target data file in the target meta information, the first offset corresponding to the target data block identifier and the second offset corresponding to the target sub data block identifier.

Specifically, the position of the target data file is determined from the persistent storage device according to a file identifier of the target data file in the target meta information, the storage position of the target data in the persistent storage device is further determined according to a first offset corresponding to the target data block identifier, the storage position of the target sub data block is further determined according to a second offset corresponding to the target sub data block identifier, and the target sub data block is read from the target data file to the memory according to the storage position.

S304, decompressing the target sub data block in the memory, and decoding the target sub data block into a column group as a reading result.

An embodiment of the present application provides a data processing apparatus, and as shown in fig. 7, the data processing apparatus may include: a cache module 101, a compression module 102, a storage location determination module 103, and a storage module 103, wherein,

the cache module 101 is configured to obtain a data storage request, where the data storage request includes data to be stored, and cache the data to be stored in a memory in a column group form;

the compression module 102 is configured to compress cached data to obtain a data block when the cached data in the memory reaches a preset size;

a storage location determining module 103, configured to search a free storage space of each data file from the space management file, and determine, from each data file, a target data file for storing the data block and a storage location of the data block in the target data file in combination with the size of the data block;

the storage module 104 is configured to move the data block to a storage location for storage;

wherein each data file is located in a persistent storage device.

The apparatus of the embodiment of the present application may execute the method provided by the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus of the embodiments of the present application correspond to the steps in the method of the embodiments of the present application, and for the detailed functional description of the modules of the apparatus, reference may be specifically made to the description in the corresponding method shown in the foregoing, and details are not repeated here.

As an alternative embodiment, the compression module comprises:

As an optional implementation, the data processing apparatus further comprises:

a meta-information determining module for determining storage-related meta-information for the data block;

a data block identifier of the data block and a corresponding first offset;

a file identification of the target data file;

As an optional implementation, the data processing apparatus further comprises:

and the decompression module is used for decompressing the target sub-data block in the memory and decoding the target sub-data block to obtain a column group serving as a reading result.

As an optional implementation, the data processing apparatus includes:

As an optional implementation manner, the space management file is an order-preserving data structure;

In an embodiment of the present application, there is provided an electronic device, including a memory, a processor, and a computer program stored in the memory, where the processor executes the computer program to implement the steps of the data processing method, and compared with the related art, the steps of: by obtaining a data storage request, wherein the data storage request comprises data to be stored, caching the data to be stored in a column group mode, compressing the cached data to obtain data blocks when the cached data in a memory reaches a preset size, searching a free storage space of each data file from a space management file, determining a target data file for storing the data blocks and a storage position of the data blocks in the target data file by combining the size of the data blocks, and moving the data blocks to the storage position for storage, excessive write IO (input/output) generated by a column type storage system in a frequently updated state is avoided, and a large number of small files are avoided.

In an alternative embodiment, an electronic device is provided, as shown in fig. 8, the electronic device 4000 shown in fig. 8 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.

The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.

Embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, and when being executed by a processor, the computer program may implement the steps and the corresponding content of the foregoing method embodiments.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and the corresponding contents of the foregoing method embodiments can be implemented.

The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.

It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as desired, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.

The foregoing is only an optional implementation manner of a part of implementation scenarios in this application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of this application are also within the protection scope of the embodiments of this application without departing from the technical idea of this application.

Claims

1. A data processing method, comprising:

when the data cached in the memory reaches a preset size, compressing the cached data to obtain a data block;

searching free storage space of each data file from a space management file, and determining a target data file for storing the data block and a storage position of the data block in the target data file from each data file by combining the size of the data block;

moving the data block to the storage position for storage;

wherein each data file is located in a persistent storage device.

2. The data processing method of claim 1, wherein compressing the cached data to obtain a data block comprises:

and summarizing all the subdata blocks to obtain the data blocks.

3. The data processing method of claim 2, wherein moving the data block to the target storage location for storage further comprises:

determining storage-related meta-information for the data block;

and storing the meta information in a first meta information file in the memory.

4. The data processing method of claim 3, wherein the meta-information comprises at least one of the following information:

a data block identifier of the data block and a corresponding first offset;

a file identifier of the target data file;

the sub data block identifier of each sub data block and a corresponding second offset;

wherein the first offset is used for representing the storage position of the corresponding data block in the target data file; the second offset is used to represent the relative position of the corresponding sub-block of data in the storage location.

5. The data processing method of claim 4, wherein moving the data block to the target storage location for storage further comprises:

searching target meta-information comprising the target data block identifier in the first meta-information file;

reading the target sub data block from the target data file to the memory according to a file identifier of the target data file in the target meta information, a first offset corresponding to the target data block identifier and a second offset corresponding to the target sub data block identifier;

and decompressing the target sub data block in the memory, and decoding to obtain a column group serving as a reading result.

6. The data processing method of claim 3, further comprising:

a second meta information file that stores the meta information in the persistent storage device.

7. The data processing method of claim 1, wherein the space management file is an order-preserving data structure;

wherein nodes in the data structure are used to characterize the data file; the node information of the node comprises a first offset and a residual space of the corresponding data file; and sequencing the nodes by corresponding offset.

8. A data processing apparatus, comprising:

the cache module is used for acquiring a data storage request, wherein the data storage request comprises data to be stored, and caching the data to be stored into a memory in a column group form;

a storage location determining module, configured to search a free storage space of each data file from a space management file, and determine, in combination with the size of the data block, a target data file for storing the data block and a storage location of the data block in the target data file from each data file;

the storage module is used for moving the data block to the storage position for storage;

and all the data files are positioned in the persistent storage equipment.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.