CN117348793A - Data reading method, data loading device and communication system - Google Patents

Data reading method, data loading device and communication system Download PDF

Info

Publication number
CN117348793A
CN117348793A CN202211165071.8A CN202211165071A CN117348793A CN 117348793 A CN117348793 A CN 117348793A CN 202211165071 A CN202211165071 A CN 202211165071A CN 117348793 A CN117348793 A CN 117348793A
Authority
CN
China
Prior art keywords
data
loading device
file
query condition
reading
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211165071.8A
Other languages
Chinese (zh)
Inventor
何洋
赵玥
王�锋
罗先强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to PCT/CN2023/087573 priority Critical patent/WO2024001413A1/en
Publication of CN117348793A publication Critical patent/CN117348793A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system

Abstract

The application provides a data reading method, a data loading device and a communication system. According to the method, the data loading device caches the data blocks meeting the query conditions in the data loading device in advance in an asynchronous mode, the data blocks are read in advance, and when the subsequent data reading device needs to request the data blocks, the data loading device can acquire the data blocks locally and return the data blocks to the data reading device, so that the reading speed of the data blocks can be increased, and the data blocks can be quickly returned to the data reading device.

Description

Data reading method, data loading device and communication system
The present application claims priority from the national intellectual property agency, application No. 202210749451.X, chinese patent application entitled "method for accelerated prefetching of file in line", filed on 28, 6, 2022, the entire contents of which are incorporated herein by reference.
Technical Field
The present disclosure relates to the field of wireless communications technologies, and in particular, to a data reading method, a data loading device, and a communication system.
Background
In the big data age, the data warehouse and the data lake and other big data offline analysis systems are widely applied to various production environments, and due to the characteristics of online analysis processing (online analytical processing, OLAP) query, the optimized row-column (optimized row columnar, ORC) and other column-like storage formats are derived. The column type storage can remarkably improve the inquiry performance, has higher compression ratio and can save the storage space.
For a file stored in a column, how to quickly read a data block in the file needs to be solved.
Disclosure of Invention
The application provides a data reading method, a data loading device and a communication system, which are used for improving the reading speed of a data block.
In a first aspect, embodiments of the present application provide a data reading method, which may be performed by a data loading device. The method comprises the following steps: the method comprises the steps that a data loading device receives a first request message sent by a first data reading device, wherein the first request message comprises file information and a first query condition, the file information is used for indicating a target file, and the first query condition is used for indicating the characteristics of a data block to be read in the target file; the data loading device reads a plurality of data blocks meeting the first query condition in the target file from a memory according to the file information; the data loading device caches the plurality of data blocks.
According to the scheme, the data blocks meeting the query conditions are cached in the data loading device in advance in an asynchronous mode, the data blocks are pre-read, and when the first data reading device needs to request the data blocks, the data loading device can acquire the data blocks locally and return the data blocks to the first data reading device, so that the reading speed of the data blocks can be increased, and the data blocks can be quickly returned to the first data reading device.
In a possible implementation method, the data loading device receives a second request message sent by the first data reading device, where the second request message is used to read a data block in the target file that meets the first query condition; the data loading device converts a first data block in the plurality of data blocks into a second data block conforming to the format requirement of the first data reading device; the data loading device sends the second data block to the first data reading device.
According to the scheme, the data loading device acquires the pre-read data block from the local, and is beneficial to the rapid reading of the data block.
In a possible implementation method, the data loading device receives a third request message sent by a second data reading device, where the third request message is used to read a data block in the target file that meets a second query condition, and the second query condition is used to indicate a feature of the data block to be read in the target file; the data loading device converts a third data block in the plurality of data blocks into a fourth data block which meets the format requirement of the second data reading device, wherein the third data block meets the second query condition; the data loading means sends the fourth data block to the second data reading means.
According to the scheme, the data blocks read in advance by the data loading device can be shared by the first data reading device and the second data reading device, and the data loading device only needs to buffer one share of the shared data blocks, so that the buffer space can be saved.
In a possible implementation method, the data loading device reads index data of the data blocks in the target file from the memory or the data loading device according to the file information, wherein the index data is used for indicating characteristics of the data blocks in the target file; and the data loading device reads the plurality of data blocks meeting the first query condition in the target file from the memory according to the first query condition and the index data.
According to the scheme, the data loading device can accurately acquire the characteristics of each data block by reading the index data of the data block in the target file, and further can accurately acquire a plurality of data blocks meeting the first query condition.
In a possible implementation method, the data loading device includes a data processor; the data loading device reads index data of a data block in the target file from the memory according to the file information, and the data loading device comprises: the data loading device reads the index data from the high performance layer of the memory according to the file information.
According to the scheme, the index data of the data block are stored in the high-performance layer, so that the speed of reading the index data is improved, and the reading speed of the data block is improved.
In a possible implementation method, the data loading device includes a computing storage device; the data loading device reads index data of a data block in the target file from the data loading device according to the file information, and the data loading device comprises: the data loading device reads the index data from the high performance layer of the computing storage device according to the file information.
According to the scheme, the index data of the data block are stored in the high-performance layer, so that the speed of reading the index data is improved, and the reading speed of the data block is improved.
In a second aspect, an embodiment of the present application provides a data loading device, where the device has a function of implementing any implementation method of the first aspect. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.
In a third aspect, embodiments of the present application provide a data loading device, including a processor and an interface circuit, where the processor is configured to communicate with other devices through the interface circuit, and perform any implementation method of the first aspect. The processor includes one or more.
In a fourth aspect, an embodiment of the present application provides a data loading device, including a processor coupled to a memory, where the processor is configured to invoke a program stored in the memory to perform any implementation method of the first aspect. The memory may be located within the device or may be located external to the device. And the processor may be one or more.
In a fifth aspect, embodiments of the present application provide a data loading apparatus, including a processor and a memory; the memory is configured to store computer instructions that, when executed by the apparatus, cause the apparatus to perform any of the implementations of the first aspect described above.
In a sixth aspect, embodiments of the present application provide a data loading apparatus, including a unit or means (means) for performing the steps of any implementation method in the first aspect.
In a seventh aspect, embodiments of the present application further provide a computer readable storage medium having instructions stored therein that, when executed on a data loading device, cause any implementation method of the first aspect described above to be performed.
In an eighth aspect, embodiments of the present application further provide a computer program product comprising a computer program or instructions which, when executed by data loading means, cause any implementation of the above described first aspect to be performed.
In a ninth aspect, embodiments of the present application further provide a chip system, including: a processor configured to perform any implementation method of the first aspect.
In a tenth aspect, embodiments of the present application further provide a communication system, including a first data reading device and a data loading device. The first data reading device is used for sending a first request message to the data loading device, wherein the first request message comprises file information and first query conditions, the file information is used for indicating a target file, and the first query conditions are used for indicating the characteristics of a data block to be read in the target file; the data loading device is used for receiving the first request message; reading a plurality of data blocks meeting the first query condition in the target file from a memory according to the file information; and buffering the plurality of data blocks.
In a possible implementation method, the data loading device is further configured to receive a second request message sent by the first data reading device, where the second request message is used to read a data block in the target file that meets the first query condition; converting a first data block of the plurality of data blocks into a second data block conforming to the format requirement of the first data reading device; the second data block is sent to the first data reading device.
In a possible implementation method, the system further includes a second data reading device; the second data reading device is used for sending a third request message to the data loading device, wherein the third request message is used for reading a data block meeting a second query condition in the target file, and the second query condition is used for indicating the characteristics of the data block to be read in the target file; the data loading device is further configured to receive the third request message; converting a third data block of the plurality of data blocks into a fourth data block conforming to the format requirement of the second data reading device, the third data block satisfying the second query condition; the fourth data block is sent to the second data reading device.
In a possible implementation method, the data loading device is specifically configured to read, from the memory or the data loading device, index data of a data block in the target file according to the file information, where the index data is used to indicate a feature of the data block in the target file; and reading the plurality of data blocks meeting the first query condition in the target file from the memory according to the first query condition and the index data.
In one possible implementation, the data loading device includes a data processor; the data loading device is configured to read index data of a data block in the target file from the memory according to the file information, and specifically includes: for reading the index data from the high performance layer of the memory based on the file information.
In one possible implementation, the data loading device includes a computing storage device; the data loading device is configured to read index data of a data block in the target file from the data loading device according to the file information, and specifically includes: for reading the index data from the high performance layer of the computing storage device based on the file information.
Drawings
FIG. 1 is a schematic diagram of a comparison of column store and row store;
FIG. 2 is a schematic layout of an ORC file format;
FIG. 3 is a schematic diagram of a system for storing and reading files according to an embodiment of the present disclosure;
fig. 4 is a flow chart of a data reading method according to an embodiment of the present application;
fig. 5 is a flow chart of a data reading method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a system for storing and reading files according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a system for storing and reading files according to an embodiment of the present disclosure;
FIG. 8 is a schematic structural diagram of a data loading device according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a data loading device according to an embodiment of the present application.
Detailed Description
FIG. 1 is a schematic diagram of a comparison of column-wise storage and row-wise storage. Column stores are largely different from row stores in file layout, which store a record of data consisting of multiple columns, whereas column stores store separate columns of a record of data. Column store may bring greater optimization to OLAP queries.
FIG. 2 is a schematic layout of an ORC file format. An ORC file may be divided into a plurality of blocks (strips) by lines, and includes a plurality of blocks, and further includes a file script (fileFooter) and file description information (PostScript). In this embodiment of the present application, the "block" is also referred to as a "data block", and both have the same meaning, and will not be described in detail later.
The file script contains file hierarchy, line number, column statistical information and the like.
The file description information provides necessary information for interpreting the file, including the length of the file script, the compression type, the version of the file, and the like.
The interior of a block is stored in columns, each column being stored in succession. One block includes Index Data (Index Data), line Data (Row Data), and block script (script Footer).
Wherein the index data includes a plurality of line group indexes (row group indexes), one line group index indicating one line group (row group), one line group including a plurality of lines in one block. I.e. a block is divided into a plurality of row groups, each row group being identified by a row group index, each row group being stored in a column-wise manner.
The line data refers to a plurality of lines of data stored in a column memory manner, and one column (column) in one index data is composed of a plurality of streams (streams), each stream corresponding to one line group.
The block script holds metadata information of a block, including column coding, stream directory and location, and block-level statistics.
Fig. 3 is a schematic diagram of a system structure for storing and reading files according to an embodiment of the present application. The system includes a HOST (HOST) and a Hadoop distributed file system (Hadoop distributed file system, HDFS).
The HDFS is responsible for the persistent storage of data, ensures the high reliability of the data, and provides a file read-write interface, but the HDFS does not sense the internal layout of the file. For example, the HDSF stores data in a columnar storage including, but not limited to, ORC storage.
The host is a calculation engine, the analysis logic of the column format is integrated in the host, and the host synchronously completes the analysis of the column format in the process of reading data. One or more Client Java virtual machines (Client Java virtual machine, client JVM) are included in the host. Client JVM includes applications (e.g., spark, flink), record read module (Record Reader), filter module (Filter), cache module (Cache), and HDFS input stream module (HDFS input stream). The HDFS input module is used for acquiring data from the HDFS.
In the host shown in fig. 3, different client JVMs may enable different applications to process data of the same file. For example, the host includes a client JVM1 and a client JVM 2, where the client JVM1 includes an application Spark, and the client JVM 2 includes an application Flink. The Spark requests the data of the file 1 to the HDFS through the record reading module, the HDFS provides the data of the file 1 to the Spark according to the format requirement of the Spark, and the Spark caches the data of the file 1 to the cache module of the client JVM 1. Similarly, the file requests to acquire the data of the file 1 from the HDFS through the record reading module, and the HDFS provides the data of the file 1 to the file according to the format requirement of the file, and the file caches the data of the file 1 to the cache module of the client JVM 2. Different applications of different client JVMs can thus acquire, store and process data of the same file.
Fig. 4 is a flowchart of a data reading method according to an embodiment of the present application. The method is illustrated by way of example with respect to the system shown in FIG. 3 and the ORC file format shown in FIG. 2. The method comprises the following steps:
in step 401, an application program of the client JVM sends a first request message to a record reading module of the client JVM.
The first request message includes file information and a query condition, and the query condition indicates a condition satisfied by data to be queried. The file information includes, for example, a file name, a file path, and the like.
In step 402, the record reading module reads metadata from the HDFS according to the file information.
The metadata includes one or more of information of a column storage type, a compression ratio, a column number, each column type (such as integer, character type, etc.) of the file.
In one implementation, the metadata is derived from a file script and file description information of the file.
In step 403, the record reading module reads index data of the data block in the file from the HDFS according to the metadata.
The index data is used for indicating statistical properties of the data block, such as maximum value, minimum value, average value and the like in each column.
In step 404, the record reading module sends index data of the data block in the file to the filtering module of the client JVM.
In step 405, the filtering module determines, according to the query condition and the index data, a data block that satisfies the query condition.
For example, the query condition is that a data block containing data greater than 100 in column 5 is queried, and the filtering module determines that the data block 1, the data block 3, the data block 4, the data block 8 and the like meet the query condition according to the index data.
The filtering module acquires the position information of the data blocks meeting the query conditions from the index information after determining the data blocks meeting the query conditions, and then caches the position information into the caching module.
The following steps 406 to 409 describe a process of requesting the record reading module for acquiring data when the data is required by the subsequent application, and the process may be circularly performed.
In step 406, the application program sends a second request message to the record reading module.
The second request message is used for requesting data of a file, and the file refers to a file indicated by the file information in the first request message.
In step 407, the record reading module obtains the location information of the next data block.
According to the description of step 405, the filtering module already stores the location information of the data block that satisfies the query condition in the buffer module, so when the record reading module receives the second request message from the application program, the location information of the data block that satisfies the query condition is sequentially acquired from the buffer module.
In step 408, the record reading module reads the data block from the HDFS according to the location information of the next data block.
In step 409, the record reading module sends the data block to the application.
Through the steps 406 to 409, each time the application program sends the second request message, the record reading module may obtain a data block that satisfies the query condition and return the data block to the application program.
In the above example, when the data block 1, the data block 3, the data block 4, the data block 8, and the like satisfy the query condition, the buffer module stores the position information of the data block 1, the position information of the data block 3, the position information of the data block 4, the position information of the data block 8, and the like, the record reading module receives the second request message for the first time, acquires the data block 1 from the HDFS and returns the data block 1 to the application program according to the position information of the data block 1 acquired from the buffer module, and the record reading module receives the second request message for the second time, acquires the data block 3 from the HDFS and returns the data block 3 to the application program according to the position information of the data block 3 acquired from the buffer module, and so on, which are not described in detail.
According to the scheme, the data meeting the query conditions can be quickly queried, and efficient use of the data is facilitated.
However, how to further optimize the data reading remains to be solved.
Fig. 5 is a flowchart of a data reading method according to an embodiment of the present application. This method is illustrated by way of example in the ORC file format shown in fig. 2. The method comprises the following steps:
in step 501, a first data reading device sends a first request message to a data loading device. Accordingly, the data loading device receives the first request message.
The data loading device may be a data processor (data processing unit, DPU), a computing storage device, or other device, and embodiments of the present application are not limited.
The first data reading device refers to a device that needs to use data, i.e. a requester of the data. The first data reading device may be a HOST (HOST) or a client JVM within a HOST, for example.
The first request message includes file information and a first query condition. The file information is used to indicate a target file, and includes, for example, a file name and/or a file path, etc. The first query condition indicates a characteristic of a data block in the target file to be read. The target file may be a file in a columnar storage format, such as an ORC file, or the like.
In step 502, the data loading device reads a plurality of data blocks meeting the first query condition in the target file from the memory according to the file information.
The memory may be an HDFS.
In one implementation, a data loading device reads index data of a data block in a target file from a memory or the data loading device according to file information, where the index data is used to indicate characteristics of the data block in the target file, such as information of maximum value, minimum value or average value in each column of the data block. And then the data loading device determines a plurality of data blocks meeting the first query condition according to the first query condition and the index data, and the data loading device acquires the plurality of data blocks from the memory.
The data loading device includes a DPU, and the memory includes a high-performance layer and a capacity layer, the high-performance layer stores index data of data blocks in a file, the capacity layer stores data blocks in a target file, and the DPU reads the index data from the high-performance layer of the memory according to file information. And then the DPU determines a plurality of data blocks meeting the first query condition according to the first query condition and the index data, and further the DPU acquires the plurality of data blocks from the capacity layer of the memory. According to the scheme, the index data of the data block are stored in the high-performance layer, so that the speed of reading the index data is improved, and the reading speed of the data block is improved.
The data loading device may include a computing type storage device including a high-performance layer, the memory including a capacity layer, the high-performance layer storing index data of data blocks in the file, the capacity layer storing data blocks in the target file, the computing type storage device reading the index data from the high-performance layer of the computing type storage device based on the file information. And then the computing type storage device determines a plurality of data blocks meeting the first query condition according to the first query condition and the index data, and further acquires the plurality of data blocks from the capacity layer of the memory. According to the scheme, the index data of the data block are stored in the high-performance layer, so that the speed of reading the index data is improved, and the reading speed of the data block is improved.
In step 503, the data loading device caches the read plurality of data blocks in the data loading device.
In one implementation, the data loading device may further cache the location information of the plurality of read data blocks in the memory to the data loading device.
Illustratively, the first query condition is: a data block in column 5 containing data greater than 100 is queried. The data loading device obtains index data of each data block of the target file according to the file information, and determines that the data block 1, the data block 3, the data block 4 and the data block 8 meet the first query condition according to the index data. The data loading device then obtains the data block 1, the data block 3, the data block 4 and the data block 8 from the memory and caches the data blocks locally. Optionally, the data loading device also caches the position information of the data block 1, the data block 3, the data block 4 and the data block 8 in the memory to the data loading device.
According to the scheme, the data blocks meeting the query conditions are cached in the data loading device in advance in an asynchronous mode, the data blocks are pre-read, and when the first data reading device needs to request the data blocks, the data loading device can acquire the data blocks locally and return the data blocks to the first data reading device, so that the reading speed of the data blocks can be increased, and the data blocks can be quickly returned to the first data reading device.
In an implementation method, after step 503, the following steps 504 to 506 are further included. The following steps 504 to 506 describe a process of requesting the data loading device for acquiring data when the first data reading device needs the data, and the process may be circularly performed.
In step 504, the first data reading device sends a second request message to the data loading device.
The second request message is used for reading a data block meeting the first query condition in a target file, and the target file is the same file as the target file indicated by the file information in the first request message.
In step 505, the data loading device converts a first data block of the plurality of data blocks into a second data block that meets the format requirements of the first data reading device.
Since the data loading device has cached the plurality of data blocks meeting the first query condition in the data loading device, when the data loading device receives the second request message, the first data block meeting the first query condition is acquired from the data loading device, and the first data block can be one data block or a plurality of data blocks. The data loading device then converts the first data block into a second data block conforming to the format requirements of the first data reading device according to the format requirements of the first data reading device.
In step 506, the data loading device sends the second data block to the first data reading device.
Through the steps 504 to 506, each time the first data reading device sends the second request message, the data loading device may acquire one or more data blocks from the locally cached multiple data blocks, and return the acquired one or more data blocks to the first data reading device according to the format requirement of the first data reading device.
For example, in the above example, the data loading device has cached therein data block 1, data block 3, data block 4, and data block 8. The data loading device receives the second request message for the first time, acquires the data block 1 from the data loading device and returns the data block 1 to the first data reading device according to the format requirement of the first data reading device, the data loading device receives the second request message for the second time, acquires the data block 3 from the data loading device and returns the data block 3 to the first data reading device according to the format requirement of the first data reading device, and so on.
In one possible implementation method, the data loading device may update the cached data block that satisfies the first query condition periodically, so as to ensure that the data loading device has sufficient storage capacity. For example, the data loading device deletes the data blocks which are not read in the set time period, and for example, the data loading device deletes the data blocks with the buffer time period exceeding the set threshold value, and the like. Based on the update policy of the data loading device, after the first data reading device sends the second request message, the data loading device may not obtain the data block meeting the first query condition locally based on the second request message. If the data loading device cannot obtain the data blocks meeting the first query condition locally, the data loading device can obtain the corresponding data blocks again from the memory and return the corresponding data blocks to the first data reading device based on the cached position information of the plurality of data blocks meeting the first query condition, and the data loading device can cache the obtained data blocks again.
In the embodiment of the application, the data loading device can provide the pre-reading and returning service of the data blocks for a plurality of data reading devices. For example, the data loading device can provide the pre-reading and return services of the data blocks for the first data reading device and the second data reading device at the same time. Through steps 501 to 503, the data loading device obtains a plurality of data blocks satisfying a first query condition from the first data reading device and caches the data blocks locally. Through steps similar to steps 501 to 503, the data loading device obtains a plurality of data blocks satisfying a second query condition from the second data reading device, and caches the data blocks locally, where the second query condition is used to indicate the characteristics of the data blocks to be read in the target file. The second query condition may be the same as or different from the first query condition. That is, the first data reading device and the second data reading device query the same target file, but the query conditions may be the same or different. The data loading device may acquire a plurality of data blocks meeting the second query condition and a plurality of data blocks meeting the first query condition, and for the repeated data blocks, the data loading device only needs to cache one copy and does not need to cache repeatedly. When the data loading device receives a third request message sent by the second data reading device, the third request message is used for reading the data blocks meeting the second query condition in the target file, if the data loading device locally caches the data blocks meeting the second query condition, the data loading device converts the third data blocks meeting the second query condition in the plurality of data blocks into fourth data blocks meeting the format requirement of the second data reading device, and then the data loading device sends the fourth data blocks to the second data reading device.
The following is described in connection with an example. It is assumed that, through the operations from step 501 to step 503, the data loading device locally caches a plurality of data blocks that satisfy the first query condition sent by the first data reading device, and specifically includes: data block 1, data block 3, data block 4, data block 8. Through operations similar to the above steps 501 to 503, the data loading device locally caches a plurality of data blocks that satisfy the second query condition sent by the second data reading device, and specifically includes: data block 4, data block 5, data block 8, data block 10. Wherein, the data block 4 is a repeated data block, the data loading device only caches one copy, and likewise, the data block 8 is a repeated data block, and the data loading device also caches one copy. The data blocks actually buffered by the data loading means therefore comprise: data block 1, data block 3, data block 4, data block 5, data block 8, data block 10. Subsequently, when the first data reading device requests the data loading device to read the data blocks meeting the first query condition, the data loading device returns the data blocks 1, 3, 4 and 8 to the first data reading device according to the format requirement of the first data reading device. Similarly, when the second data reading device requests the data loading device to read the data blocks meeting the second query condition, the data loading device returns the data blocks 4, 5, 8 and 10 to the second data reading device according to the format requirement of the second data reading device.
According to the scheme, the data blocks read in advance by the data loading device can be shared by the first data reading device and the second data reading device, and the data loading device only needs to buffer one share of the shared data blocks, so that the buffer space can be saved.
In the above description, two data reading apparatuses are taken as an example, and the number of data reading apparatuses is not limited in practical application.
Fig. 6 is a schematic diagram of a system structure for storing and reading files according to an embodiment of the present application. The system includes a host, a DPU, and an HDFS, which are specific examples of the data reading device (i.e., the first data reading device or the second data reading device), the data loading device, and the memory, respectively, in the embodiment of fig. 5.
Wherein the host is a computing engine, and the host comprises one or more client JVMs. The client JVM includes applications (e.g., spark, flink) and a read interface (reader interface). Wherein the read interface is for retrieving data from the DPU.
The DPU includes a read engine module (reader engine), a filter module (filter), a data loader module (data loader), a data conversion module (data conversion), and a cache module (cache). In connection with the embodiment of fig. 5, the read engine module is configured to perform step 501 described above and to initiate the data loading module to perform steps 502 and 503 described above, and is also configured to perform step 504 and to initiate the filtering module to perform step 505. The filtering module is used for determining a plurality of data blocks meeting the query condition according to the index data and the query condition. The cache module is used for storing the data blocks meeting the query conditions and index data of the data blocks, and a elimination strategy of the data blocks is arranged in the cache module, so that the data can be deleted from the cache module. The data conversion module is used for converting the data block into data meeting the format requirement of the host.
The HDFS is responsible for the persistent storage of data, ensures the high reliability of the data, and provides a file read-write interface, but the HDFS does not sense the internal layout of the file. For example, the HDSF stores data in a columnar storage including, but not limited to, ORC storage. The HDF includes a high performance layer in which index data of data blocks in a file are stored, and a capacity layer in which data blocks in a file are stored.
The system of fig. 6 has the following advantages compared to the system of fig. 3:
first, in the system of fig. 6, the cache module may not be deployed in the client JVM of the host, but the cache module and the data conversion module may be deployed within the DPU. The DPU can read the data blocks of the file from the HDFS in advance and store the data blocks into a buffer module of the DPU, and when each subsequent client JVM needs to acquire the data of the file, the DPU correspondingly converts the format of the data blocks of the file through a data conversion module according to the data format requirements of different client JVMs and sends the data blocks after format conversion to the corresponding client JVMs. The system thus stores a copy of the data in the DPU and shares it with each client JVM, without storing the same data inside each client JVM as in the system of fig. 3, and thus can reduce the storage capacity requirements for the client JVM.
Second, in the system of fig. 3, since the application program of the JVM at the client is a temporary pull-up process during running, the memory is released after the task is finished, so that the data in the cache module is also deleted, resulting in low data usage efficiency. In the system of fig. 6, the buffer module of the DPU can store the data for a longer time, so that the data use efficiency is higher.
Third, in the system of fig. 6, the index data of the data block is stored in the high performance layer, which is helpful to increase the speed of reading the index data, and further increase the reading speed of the data block.
Fig. 7 is a schematic diagram of a system structure for storing and reading files according to an embodiment of the present application. The system includes a host, a computing storage device, and an HDFS, which are specific examples of data reading devices, data loading devices, and memory, respectively, in the embodiment of fig. 5. The main differences between the system shown in fig. 7 and the system shown in fig. 6 are: the data loading device is specifically a computing-type storage device, and a high-performance layer for storing index data of a data block of a file is provided in the computing-type storage device.
It will be appreciated that, in order to implement the functions of the above embodiments, the data loading device includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the elements and method steps of the examples described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application scenario and design constraints imposed on the solution.
Fig. 8 and fig. 9 are schematic structural diagrams of a possible data loading device according to an embodiment of the present application. These data loading devices may be used to implement the functions of the data loading devices in the above method embodiments, so that the beneficial effects of the above method embodiments may also be implemented. In an embodiment of the present application, the data loading device may be a DPU or a computing storage device, or the like.
The data loading device 800 shown in fig. 8 includes a processing unit 810 and a transceiving unit 820. The data loading device 800 is used to implement the functions of the data loading device in the above-described method embodiment.
A transceiver unit 820, configured to receive a first request message sent by a first data reading apparatus, where the first request message includes file information and a first query condition, where the file information is used to indicate a target file, and the first query condition is used to indicate a feature of a data block to be read in the target file; a processing unit 810, configured to read, from a memory, a plurality of data blocks in the target file that satisfy the first query condition according to the file information; the plurality of data blocks is cached.
In a possible implementation method, the transceiver unit 820 is further configured to receive a second request message sent by the first data reading device, where the second request message is used to read a data block in the target file that meets the first query condition; the processing unit 810 is further configured to convert a first data block of the plurality of data blocks into a second data block that meets a format requirement of the first data reading device; the transceiver unit 820 is further configured to send the second data block to the first data reading device.
In a possible implementation method, the transceiver unit 820 is further configured to receive a third request message sent by the second data reading device, where the third request message is used to read a data block in the target file that meets a second query condition, and the second query condition is used to indicate a feature of the data block to be read in the target file; the processing unit 810 is further configured to convert a third data block of the plurality of data blocks into a fourth data block that meets the format requirement of the second data reading device, where the third data block meets the second query condition; the transceiver unit 820 is further configured to send the fourth data block to the second data reading device.
In a possible implementation method, the processing unit 810 is further configured to read, according to the file information, index data of a data block in the target file from the memory or the data loading device, where the index data is used to indicate a feature of the data block in the target file; the plurality of data blocks satisfying the first query condition are determined according to the first query condition and the index data.
In one possible implementation, the data loading device includes a data processor; the processing unit 810 is specifically configured to read the index data from the high performance layer of the memory according to the file information.
In one possible implementation, the data loading device includes a computing storage device; the processing unit 810 is specifically configured to read the index data from the high performance layer of the computing storage device according to the file information.
The more detailed descriptions of the processing unit 810 and the transceiver unit 820 may be directly obtained by referring to the related descriptions in the above method embodiments, and are not repeated herein.
Fig. 9 is a schematic structural diagram of a data loading device according to an embodiment of the present application. These data loading devices may be used to implement the functions of the data loading device in the method embodiment of fig. 5, so that the beneficial effects of the method embodiment described above may also be implemented. The data loading device 900 shown in fig. 9 includes a processor 910 and an interface circuit 920. The processor 910 and the interface circuit 920 are coupled to each other. It is understood that the interface circuit 920 may be a transceiver or an input-output interface. Optionally, the data loading device 900 may further include a memory 930 for storing instructions executed by the processor 910 or for storing input data required by the processor 910 to execute the instructions or for storing data generated after the processor 910 executes the instructions.
When the data loading device 900 is used to implement the method embodiment of fig. 5, the processor 910 is configured to implement the functions of the reading engine module, the filtering module, the data loading module, and the data conversion module in the data loading device. The memory 930 is used to implement the functions of the cache module in the data loading device described above.
It is to be appreciated that the processor in embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. The general purpose processor may be a microprocessor, but in the alternative, it may be any conventional processor.
The method steps in the embodiments of the present application may be implemented by hardware, or may be implemented by a processor executing software instructions. The software instructions may be comprised of corresponding software modules that may be stored in random access memory, flash memory, read-only memory, programmable read-only memory, erasable programmable read-only memory, electrically erasable programmable read-only memory, registers, hard disk, removable disk, compact disk read-only memory (compact disc read-only memory), or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. In addition, the ASIC may reside in a base station or terminal device. The processor and the storage medium may reside as discrete components in a base station or terminal device.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer programs or instructions. A Computer Program (english: computer Program) refers to a set of instructions that instruct an electronic Computer or other device with message processing capabilities to perform each step, typically written in a programming language, to run on a target architecture. When the computer program or instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are performed in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, a base station, a terminal device, or other programmable apparatus. The computer program or instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program or instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired or wireless means. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that integrates one or more available media. The usable medium may be a magnetic medium, e.g., floppy disk, hard disk, tape; but also optical media such as digital video discs; but also semiconductor media such as solid state disks. The computer readable storage medium may be volatile or nonvolatile storage medium, or may include both volatile and nonvolatile types of storage medium.
In the various embodiments of the application, if there is no specific description or logical conflict, terms and/or descriptions between the various embodiments are consistent and may reference each other, and features of the various embodiments may be combined to form new embodiments according to their inherent logical relationships.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. In the text description of the present application, the character "/", generally indicates that the associated object is an or relationship; in the formulas of the present application, the character "/" indicates that the front and rear associated objects are a "division" relationship.
It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application. The sequence number of each process does not mean the sequence of the execution sequence, and the execution sequence of each process should be determined according to the function and the internal logic.

Claims (14)

1. A data reading method, comprising:
the method comprises the steps that a data loading device receives a first request message sent by a first data reading device, wherein the first request message comprises file information and a first query condition, the file information is used for indicating a target file, and the first query condition is used for indicating the characteristics of a data block to be read in the target file;
the data loading device reads a plurality of data blocks meeting the first query condition in the target file from a memory according to the file information;
the data loading device caches the plurality of data blocks.
2. The method of claim 1, wherein the method further comprises:
the data loading device receives a second request message sent by the first data reading device, wherein the second request message is used for reading a data block meeting the first query condition in the target file;
the data loading device converts a first data block in the plurality of data blocks into a second data block conforming to the format requirement of the first data reading device;
the data loading device sends the second data block to the first data reading device.
3. The method of claim 1 or 2, wherein the method further comprises:
the data loading device receives a third request message sent by a second data reading device, wherein the third request message is used for reading a data block meeting a second query condition in the target file, and the second query condition is used for indicating the characteristics of the data block to be read in the target file;
the data loading device converts a third data block in the plurality of data blocks into a fourth data block which meets the format requirement of the second data reading device, wherein the third data block meets the second query condition;
the data loading means sends the fourth data block to the second data reading means.
4. A method according to any one of claims 1 to 3, wherein the data loading means reading from memory a plurality of data blocks in the target file that meet the first query condition, based on the file information, comprises:
the data loading device reads index data of the data blocks in the target file from the memory or the data loading device according to the file information, wherein the index data is used for indicating the characteristics of the data blocks in the target file;
And the data loading device reads the plurality of data blocks meeting the first query condition in the target file from the memory according to the first query condition and the index data.
5. The method of claim 4, wherein the data loading device comprises a data processor;
the data loading device reads index data of a data block in the target file from the memory according to the file information, and the data loading device comprises:
the data loading device reads the index data from the high performance layer of the memory according to the file information.
6. The method of claim 4, wherein the data loading device comprises a computational storage device;
the data loading device reads index data of a data block in the target file from the data loading device according to the file information, and the data loading device comprises:
the data loading device reads the index data from the high performance layer of the computing storage device according to the file information.
7. A communication system comprising first data reading means and data loading means;
the first data reading device is configured to send a first request message to the data loading device, where the first request message includes file information and a first query condition, the file information is used to indicate a target file, and the first query condition is used to indicate a feature of a data block to be read in the target file;
The data loading device is used for receiving the first request message; reading a plurality of data blocks meeting the first query condition in the target file from a memory according to the file information; and caching the plurality of data blocks.
8. The system of claim 7, wherein the data loading device is further configured to receive a second request message sent by the first data reading device, where the second request message is used to read a data block in the target file that meets the first query condition; converting a first data block of the plurality of data blocks into a second data block conforming to the format requirements of the first data reading device; and sending the second data block to the first data reading device.
9. The system of claim 7 or 8, wherein the system further comprises a second data reading device;
the second data reading device is configured to send a third request message to the data loading device, where the third request message is used to read a data block in the target file that meets a second query condition, and the second query condition is used to indicate a feature of the data block to be read in the target file;
The data loading device is further configured to receive the third request message; converting a third data block of the plurality of data blocks into a fourth data block conforming to the format requirements of the second data reading device, the third data block satisfying the second query condition; and sending the fourth data block to the second data reading device.
10. The system according to any of the claims 7 to 9, wherein the data loading means is in particular adapted to read index data of data blocks in the target file from the memory or the data loading means, based on the file information, the index data being adapted to indicate characteristics of the data blocks in the target file; and reading the plurality of data blocks meeting the first query condition in the target file from the memory according to the first query condition and the index data.
11. The system of claim 10, wherein the data loading means comprises a data processor;
the data loading device is configured to read, from the memory, index data of a data block in the target file according to the file information, and specifically includes: and the index data is read from a high-performance layer of the memory according to the file information.
12. The system of claim 10, wherein the data loading device comprises a computing storage device;
the data loading device is configured to read index data of a data block in the target file from the data loading device according to the file information, and specifically includes: and the index data is read from a high performance layer of the computing type storage device according to the file information.
13. A data loading device comprising a processor and interface circuitry for receiving signals from other data loading devices than the data loading device and transmitting signals from the processor to the processor or sending signals from the processor to other data loading devices than the data loading device, the processor being operable to implement the method of any one of claims 1 to 6 by logic circuitry or executing code instructions.
14. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program or instructions which, when executed by data loading means, implement the method of any of claims 1 to 6.
CN202211165071.8A 2022-06-28 2022-09-23 Data reading method, data loading device and communication system Pending CN117348793A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2023/087573 WO2024001413A1 (en) 2022-06-28 2023-04-11 Data reading method, data loading apparatus, and communication system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210749451X 2022-06-28
CN202210749451 2022-06-28

Publications (1)

Publication Number Publication Date
CN117348793A true CN117348793A (en) 2024-01-05

Family

ID=89354555

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211165071.8A Pending CN117348793A (en) 2022-06-28 2022-09-23 Data reading method, data loading device and communication system

Country Status (2)

Country Link
CN (1) CN117348793A (en)
WO (1) WO2024001413A1 (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809883B1 (en) * 2007-10-16 2010-10-05 Netapp, Inc. Cached reads for a storage system
CN102111448B (en) * 2011-01-13 2013-04-24 华为技术有限公司 Data prefetching method of DHT memory system and node and system
CN112486858A (en) * 2016-03-17 2021-03-12 华为技术有限公司 Data prefetching method and device
CN110471894A (en) * 2019-07-22 2019-11-19 腾讯科技(深圳)有限公司 A kind of data prefetching method, device, terminal and storage medium
CN113419824A (en) * 2021-01-25 2021-09-21 阿里巴巴集团控股有限公司 Data processing method, device, system and computer storage medium

Also Published As

Publication number Publication date
WO2024001413A1 (en) 2024-01-04

Similar Documents

Publication Publication Date Title
US10466932B2 (en) Cache data placement for compression in data storage systems
US10402096B2 (en) Unaligned IO cache for inline compression optimization
JP4131514B2 (en) Network system, server, data processing method and program
US10572378B2 (en) Dynamic memory expansion by data compression
KR102235047B1 (en) Methods, devices, and systems for caching data items
US20190057090A1 (en) Method and device of storing data object
US11314689B2 (en) Method, apparatus, and computer program product for indexing a file
CN114817341B (en) Method and device for accessing database
US11474699B1 (en) Systems and methods for optimizing data management within key value storage
US20220164316A1 (en) Deduplication method and apparatus
CN111159176A (en) Method and system for storing and reading mass stream data
US11226778B2 (en) Method, apparatus and computer program product for managing metadata migration
CN112148736A (en) Method, device and storage medium for caching data
US20180075116A1 (en) Information processing system, control device, and computer-readable recording medium having processing program recorded therein
CN115470156A (en) RDMA-based memory use method, system, electronic device and storage medium
US20200089784A1 (en) Method and system for reduced data movement compression using in-storage computing and a customized file system
CN116414304B (en) Data storage device and storage control method based on log structured merging tree
JPH07239808A (en) Distributed data managing system
US10762047B2 (en) Relocating compressed extents using file-system hole list
CN110737397B (en) Method, apparatus and computer program product for managing a storage system
CN117348793A (en) Data reading method, data loading device and communication system
CN115509437A (en) Storage system, network card, processor, data access method, device and system
CN113722623A (en) Data processing method and device, electronic equipment and storage medium
CN113419792A (en) Event processing method and device, terminal equipment and storage medium
KR100785774B1 (en) Obeject based file system and method for inputting and outputting

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination