CN117591040A - Data processing method, device, equipment and readable storage medium - Google Patents

Data processing method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN117591040A
CN117591040A CN202410074857.1A CN202410074857A CN117591040A CN 117591040 A CN117591040 A CN 117591040A CN 202410074857 A CN202410074857 A CN 202410074857A CN 117591040 A CN117591040 A CN 117591040A
Authority
CN
China
Prior art keywords
file
remote
storage
data block
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410074857.1A
Other languages
Chinese (zh)
Inventor
陈再妮
赵阳
王宏博
吕夫洋
伍鑫
林子彦
潘安群
雷海林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202410074857.1A priority Critical patent/CN117591040A/en
Publication of CN117591040A publication Critical patent/CN117591040A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0643Management of files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a device, equipment and a readable storage medium, wherein the method comprises the following steps: acquiring a storage data block to be cached; acquiring remote storage information and remote attribute information of a storage data block to be cached in a remote storage database; determining a local file name of a corresponding cache of the storage data block in the target computing device according to the remote storage information and the remote attribute information; and creating a local file uniquely indicated by the local file name in a local disk of the target computing device, and caching the stored data block to the local file uniquely indicated by the local file name. The method and the device can be applied to various scenes such as map fields, traffic fields, automatic driving fields, vehicle-mounted scenes, cloud technologies, artificial intelligence, intelligent traffic, auxiliary driving and the like, and data query requests are promoted.

Description

Data processing method, device, equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.
Background
With the rapid development of computer technology, the scale of related business data is continuously enlarged, and the problem that the data processing efficiency is affected due to mismatching of computing capacity and storage capacity is also increasingly serious. In order to improve data processing performance and extend flexibility, a separate architecture is used in many scenarios for data processing, where data is typically stored on separate remote storage such as object storage, and computing tasks are performed by one or more separate computing devices.
However, under the architecture of separated storage and calculation, any computing device needs to query the data stored in the remote storage through the network, and the remote storage needs to transmit the remote data to the computing device through the network, so that the data query mode is very slow, and the data processing efficiency is still reduced due to low data query efficiency.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, equipment and a readable storage medium, which can improve data query efficiency and data processing efficiency in a memory separation architecture.
In one aspect, an embodiment of the present application provides a data processing method, where the method is applied to a memory separation architecture, where the memory separation architecture includes a computing device cluster and a remote storage database that are independent of each other, and the method is performed by a target computing device, where the target computing device is any computing device in the computing device cluster, and the method includes:
acquiring a storage data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with the query heat higher than a preset threshold value;
Acquiring remote storage information and remote attribute information of a storage data block to be cached in a remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote storage database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote storage database;
determining a local file name of a corresponding cache of the storage data block in the target computing device according to the remote storage information and the remote attribute information;
and creating a local file uniquely indicated by the local file name in a local disk of the target computing device, and caching the stored data block to the local file uniquely indicated by the local file name.
In one aspect, the embodiment of the present application provides a data processing apparatus, where the apparatus is applied to a separate storage architecture, where the separate storage architecture includes a computing device cluster and a remote storage database that are independent of each other, and the apparatus is applied to a target computing device, where the target computing device is any computing device in the computing device cluster; the device comprises:
the block acquisition module is used for acquiring a storage data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with the query heat higher than a preset threshold value;
The information acquisition module is used for acquiring remote storage information and remote attribute information of the storage data block to be cached in the remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote storage database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote storage database;
the file name determining module is used for determining the local file name of the corresponding cache of the storage data block in the target computing equipment according to the remote storage information and the remote attribute information;
and the block caching module is used for creating a local file uniquely indicated by the local file name in the local disk of the target computing device and caching the stored data block to the local file uniquely indicated by the local file name.
In one embodiment, the file name determining module determines a specific implementation manner of a local file name of a corresponding cache of the storage data block in the target computing device according to the remote storage information and the remote attribute information, including:
acquiring a default file size configured for a local disk; the default file size is used for designating the size of any one file in the local disk;
determining a local file identifier of a corresponding cache of the stored data block in the target computing device according to the default file size and the remote storage information;
Fusing the local file identification with the remote attribute information to obtain fused information;
and determining the fusion information as the local file name of the corresponding cache of the storage data block in the target computing device.
In one embodiment, the file name determining module determines a specific implementation manner of a local file identifier of a corresponding cache of the storage data block in the target computing device according to a default file size and remote storage information, and includes:
according to the initial identification mapping rule, performing initial identification mapping processing on the remote storage information and the default file size to obtain an initial file identification;
according to the end identifier mapping rule, performing end identifier mapping processing on the remote storage information and the default file size to obtain an end file identifier;
and determining the local file identification of the corresponding cache of the storage data block in the target computing device according to the initial file identification and the end file identification.
In one embodiment, the remote storage information includes an offset of the storage data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the file name determining module performs initial identification mapping processing on the remote storage information and a default file size according to an initial identification mapping rule to obtain a specific implementation mode of initial file identification, and the method comprises the following steps:
Determining the offset of the storage data block in the remote storage file as a remote offset;
determining a first quotient result of the remote offset with respect to a default file size according to the initial identity mapping rule;
and determining the first quotient result as the initial file identification.
In one embodiment, the remote storage information includes an offset of the storage data block within the remote storage file and an amount of storage data occupied by the storage data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the file name determining module performs end identifier mapping processing on the remote storage information and a default file size according to an end identifier mapping rule to obtain a specific implementation mode of end file identifiers, and the method comprises the following steps:
determining the offset of the storage data block in the remote storage file as a remote offset, and determining the storage data amount occupied by the storage data block in the remote storage file as a remote data amount;
summing the remote offset and the remote data according to the end identifier mapping rule;
determining a second quotient result of the summation result obtained by the summation processing with respect to the default file size;
And determining the second quotient result as the end file identification.
In one embodiment, the local file name comprises a first local file name and a second local file name;
the specific implementation manner of creating the local file uniquely indicated by the local file name in the local disk of the target computing device by the block cache module comprises the following steps:
creating a first local file uniquely indicated by a first local file name in a local disk;
creating a second local file uniquely indicated by a second local file name in the local disk;
and determining the first local file and the second local file as the local files uniquely indicated by the local file names.
In one embodiment, the data caching order indicated by the first local file name precedes the data caching order indicated by the second local file name;
the specific implementation mode of the block caching module for caching the stored data block to the local file uniquely indicated by the local file name comprises the following steps:
acquiring a default file size configured for a local disk; the default file size is used for designating the size of any one file in the local disk;
performing offset calculation processing on the default file size and remote storage information according to an initial offset determination rule to obtain an initial file offset of a storage data block in a first local file;
Performing offset calculation processing on the default file size and remote file storage information according to an end offset determination rule to obtain an end file offset of a storage data block in a second local file;
and determining the initial file offset in the first local file to the end file offset in the second local file as a buffer interval position of the stored data block, and writing the stored data block into the buffer interval position.
In one embodiment, the remote file storage information includes an offset of the stored data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the block cache module performs offset calculation processing on the default file size and the remote storage information according to a starting offset determination rule to obtain a specific implementation mode of the starting file offset of the storage data block in the first local file, and the method comprises the following steps:
determining the offset of the storage data block in the remote storage file as a remote offset;
determining a first residual result of the remote offset with respect to a default file size according to a starting offset determination rule;
the first remainder result is determined as a starting file offset of the stored data block in the first local file.
In one embodiment, the remote file storage information includes an offset of the storage data block within the remote storage file and an amount of storage data occupied by the storage data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the block caching module performs offset calculation processing on the default file size and remote file storage information according to an end offset determination rule to obtain a specific implementation mode of the end file offset of the storage data block in the second local file, and the method comprises the following steps:
determining the offset of the storage data block in the remote storage file as a remote offset, and determining the storage data amount occupied by the storage data block in the remote storage file as a remote data amount;
summing the remote offset and the remote data according to the end offset determining rule;
determining a second residual result of the summation result obtained by the summation processing with respect to the default file size;
and determining the second residual result as an end file offset of the stored data block in the second local file.
In one embodiment, after the block cache module caches the stored data block to the local file uniquely indicated by the local file name, the data processing apparatus further includes:
The index table adding module is used for determining the local file uniquely indicated by the local file name as a target local file;
the index table adding module is also used for acquiring any spare node from the spare node set as a target file cache node of the target local file; any free node in the free node set is a preconfigured node for recording file cache information of any file in the local disk;
the index table adding module is also used for adding the remote attribute information into the target file cache node to obtain a filling file cache node;
the index table adding module is also used for obtaining a target data table where the target service data is located in the remote storage database and constructing an index relation between the target data table and the filling file cache node;
the index table adding module is also used for adding the index relation between the target data table and the filling file cache node into the shared cache index table of the target computing equipment; and the remote attribute information in the shared cache index table is used for determining the local file name of the corresponding cache of the storage data block in the target computing device based on the remote attribute information and the remote storage information in the shared cache index table when the target computing device queries the target service data, and reading the target service data from the local file uniquely indicated by the local file name.
In one embodiment, after the index table adding module adds the index relationship between the target data table and the fill file cache node to the shared cache index table of the target computing device, the data processing apparatus further includes:
the data query module is used for receiving a data query request sent by a query object; the data query request is used for requesting to query the target service data from the target data table;
the data query module is also used for determining a storage data block where the target business data is in the remote storage database and remote storage information of the storage data block in the remote storage database based on the data query request;
the data query module is also used for traversing the shared cache index table;
the data query module is further used for acquiring a filling file cache node with an index relation with the target data table from the shared cache index table if the target data table exists in the shared cache index table, and determining a local file name of a corresponding cache of the storage data block in the target computing device according to remote attribute information and remote storage information recorded in the filling file cache node;
and the data query module is also used for reading out the target service data from the local file uniquely indicated by the local file name in the local disk and returning the target service data to the query object.
In one aspect, a computer device is provided, including: a processor and a memory;
the memory stores a computer program that, when executed by the processor, causes the processor to perform the methods of embodiments of the present application.
In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, where the computer program includes program instructions that, when executed by a processor, perform a method in an embodiment of the present application.
In one aspect of the present application, a computer program product is provided that includes a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program to cause the computer device to perform the method provided in an aspect of the embodiments of the present application.
In the embodiment of the application, for the service data with higher query heat stored in the remote storage database in the separate storage architecture, the service data can be cached in the local disk of the computing device in advance, so that the computing device can directly read the related service data from the local disk during data query, the service data is not required to be queried from the remote storage database through a network, the data transmission delay caused by network remote transmission can be reduced, and the data query efficiency and the data processing efficiency are improved. In the process of caching service data to a local disk, the application caches the service data through a storage data block where the service data is located in a remote storage database, specifically, a local file name in the local disk is determined together based on remote storage information and remote attribute information of the storage data block in the remote storage database, and then the computing device can create a local file named as the local file name in the local disk and cache the service data in the local file. By utilizing the mode of remote storage information and remote attribute information of the storage data blocks in the remote storage database, service data belonging to the same storage data block in the remote storage database can be correspondingly stored in the same local file in the local disk, so that in the data query process, all service data belonging to the same storage data block do not need to be traversed, the corresponding local file can be quickly positioned based on the remote storage information and the remote attribute information, and further, the corresponding service data can be quickly read out. In summary, the present application may improve data query efficiency and data processing efficiency in a memory separation architecture.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic architecture diagram of a solution system provided in an exemplary embodiment of the present application;
FIG. 2 is a flow chart of a method for processing data according to an exemplary embodiment of the present application;
fig. 3 is a schematic diagram of a scenario of data buffering provided in an embodiment of the present application;
fig. 4 is a schematic structural design diagram of index information according to an embodiment of the present application;
FIG. 5 is a flowchart of adding file cache index information of a local file storing a data block to a shared cache index table according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a logic architecture for caching data by a computing device according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
Fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The embodiments of the present application relate to related technologies such as artificial intelligence and deposit separation, and for ease of understanding, the following description will give priority to the simple explanation of the related technical terms and concepts.
1. Artificial intelligence (Artificial Intelligence AI)
Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Further, the embodiment of the application mainly relates to a big data processing technology in the artificial intelligence technology, which is used for caching hot data in a remote storage database.
2. Memory and calculation separation architecture
In a separate storage architecture, data to be stored for persistence is stored in a remote network storage (Network Attached Storage, NAS), an object storage system, or a distributed storage system. While computing tasks are performed by one or more separate computing devices. The separation enables the computing and storage resources to be independently expanded according to the requirements, so that the system is easier to adapt to the continuously-changing workload, and the performance, the expansibility and the flexibility can be improved.
3. OLAP database
The online analytical processing (On-Line Analytical Processing, OLAP) database is a special type of database that is specifically used to support online analytical processing (OLAP). In contrast to conventional relational databases (for online transactions, OLTP), OLAP databases are primarily used to store and manage large amounts of historical data so that users can conduct quick queries and multidimensional analysis. OLAP databases are commonly used with data warehouse and Business Intelligence (BI) applications to support decision-making and data-driven business processes.
4. Data operation language
The data manipulation language (Data Manipulation Language, DML) is a subset of the structured query language (Structured Query Language, SQL) for managing and manipulating data in a relational database. DML includes statements and commands for inserting, updating, deleting, and retrieving data into database tables. The main statements of DML are: SELECT, retrieving data from one or more tables; INSERT, add a new row (record) to the table; UPDATE, modify existing rows in the table; DELETE, DELETE row from table. These statements allow the user to interact with the data stored in the database and perform various operations to maintain and analyze the data.
5. Data definition language
The data definition language (Data Definition Language, DDL) is a subset of SQL that is used to define and manage data structures in relational databases. DDL consists essentially of operations to create, modify, and delete database objects. The main operations of DDL are: CREATE, CREATE database object, for example: tables, indexes, views, triggers, etc.; ALTER, modify structure of existing database objects, such as: adding or deleting columns of tables, modifying data types of columns, renaming tables, etc.; DROP, delete existing database objects, e.g.: delete tables, indexes, views, etc.; trunk, delete all data in the table, but keep the table structure and definition; RENAME, renaming database objects, for example: renaming the table. Through these operations, a user may define and manage data structures in the database to meet different data storage and analysis requirements.
In practical applications, for a separate-to-store database (e.g., OLAP database) of a separate-to-store architecture, the separate-to-store architecture may include a computing device cluster (including one or more computing devices) and a remote-storage database, where the remote-storage database may record different service data in different data tables, and then combine the different service data in the data tables to obtain a data block, where the data block may be stored in a file in the remote-storage database. Further, when a user wants to query a data table for certain service data, the user can read the service data from the data table of the remote storage database through the SELECT related SQL statement. However, in the conventional technology, when data is queried under the architecture of memory separation, because communication between any computing device and the remote storage database needs to be transmitted through a network, a certain time delay is caused by a data remote transmission mode, so that the data processing time delay exists when the computing device processes the data, and the data processing efficiency is seriously affected.
In order to reduce the data query time delay and improve the data query efficiency and the data processing efficiency, the application provides a data caching mechanism, which can cache service data (which can be called hot data) with higher query heat in a remote storage database into a local disk of computing equipment in advance, so that when a data processing instruction is received, the computing equipment can acquire the service data required by data processing from the local disk, thereby reducing network access between the computing equipment and the remote storage database and improving the data query efficiency; moreover, since the service data is cached in the local disk of the computing device, no other cache devices need to be introduced, and therefore no additional maintenance cost is generated.
Specifically, the data caching scheme provided in the embodiment of the present application may generally include: firstly, for any computing device, service data with higher query heat may be obtained (herein, the service data may be queried by the computing device and the query heat is higher than a preset threshold, where the preset threshold may be a frequency threshold corresponding to the query frequency, and the query heat is higher than the preset threshold, which refers to the query frequency being higher than the frequency threshold, and the preset threshold may be configured and determined according to specific service requirements, for example, the preset threshold may be configured to be 1, 2, 10, etc.), and a data block where the service data is located in a remote storage database may be obtained, where the data block may be used as a storage data block to be cached to a local disk of the computing device. It should be appreciated that the manner in which the remote storage database stores business data is typically: and forming a data block by partial business data belonging to the same data table, and storing the data block into a certain file (for convenience of distinction, the file in the remote storage database can be called as a remote file), wherein when the data is cached, each data block in the remote storage database is cached into the same file one by one (for convenience of distinction, the file in a local disk of the computing device can be called as a local file) or a plurality of files which are continuous in sequence, and then for any data block which is stored in the remote storage database and contains thermal data (namely, the business data with the query heat higher than a preset threshold value), the data block can be used as a storage data block to be cached.
Further, after the storage data block to be cached is acquired, remote storage information and remote attribute information of the storage data block in the remote storage database can be acquired, where the remote storage information refers to information capable of reflecting a storage position (a storage position is stored in which remote file, and how much offset exists in the remote file) of the storage data block in the remote storage database, and the remote attribute information refers to information capable of reflecting a storage attribute (such as a physical file name (a physical file name is a data table where the storage data block is in the remote storage database) of the storage data block in the remote storage database, a partition identifier, a sequence identifier, and the like of the storage data block); according to the remote storage information and the remote attribute information of the storage data block, a local file name can be determined; the computing device may then create a local file in the local disk indicated by the local file name and cache the stored data block in the local file accordingly.
Notably, as the local disk of the computing device is cached with the service data with higher query heat, the query frequency of the service data is higher, so that the data query requests received by the computing device are very likely to be directed against the hot data, the computing device can quickly read the corresponding service data from the local disk and return the corresponding service data to the query object (such as a user initiating the data query request), remote network access through a remote storage database is not needed, the data query time can be saved, and the data query efficiency is improved; in addition, the data blocks in the remote storage database are correspondingly cached in the local files in the local disk, so that the service data of one data block in the remote storage database is cached in the same file in the local disk, when data inquiry is carried out, the local file names cached with the storage data blocks can be rapidly positioned through the remote storage information and the remote attribute information of the storage data blocks, and the service data to be inquired is rapidly read out from the local file names without traversing all the files in the local disk, so that the service data can be efficiently read, and the data inquiry efficiency is further improved.
The scheme provided by the embodiment of the application can be applied to any application scene in which a memory-calculation separation architecture is required to store service data: including but not limited to: short video push scenes, game scenes, etc.
A short video push scene may refer to a scene in which video data is continuously pushed to a user, who may request update display of next video data by performing an operation of pulling video data (e.g., an operation of sliding a video display interface of a terminal device). In a short video push scenario, the user may continually refresh through the different video data by continually performing the operation of pulling the video data.
A game scenario may refer to a scenario in which a user passes through a terminal device to play an immersive game.
In summary, according to the scheme of the data caching provided by the embodiment of the application, the data query efficiency can be improved, the data processing efficiency is improved, and the service coverage (such as expanding the applicable scene) is effectively improved to a certain extent.
It should be noted that, the above-mentioned several application scenarios are only examples, and do not limit the application scenarios applicable to the scheme provided in the embodiments of the present application.
Further, the solution provided by the embodiments of the present application may be executed by a computer device, where the computer device may refer to any computing device in a separate storage architecture (a computing device performing the solution may be referred to as a target computing device), where the computing device includes a terminal or a server, and the computing device may also include a terminal and a server. In order to facilitate understanding of the solution provided in the embodiments of the present application, an application scenario related to the embodiments of the present application is described below in conjunction with a system schematic diagram shown in fig. 1; fig. 1 is a schematic architecture diagram of a solution system provided in an exemplary embodiment of the present application, where, as shown in fig. 1, the system includes a terminal 101 and a server 102; wherein:
1) The terminal 101 may comprise a terminal device used by a user. Of course, according to the application scenario and the field to which the scheme is applied, the terminal providing the scheme provided by the embodiment of the application is different. The terminal device may include, but is not limited to: terminal devices such as smartphones (e.g., smartphones deployed with Android systems or smartphones deployed with internet operating systems (Internetworking Operating System, IOS)), tablet computers, portable personal computers, mobile internet devices (Mobile Internet Devices, MIDs), vehicle devices, head-mounted devices, smart home devices, and smart voice interaction devices, are not limited to the types of terminal devices, and are described herein.
For example, in a short video push scenario, the terminal device may be a smart phone; that is, in this implementation manner, the solution provided in the embodiment of the present application may be deployed on the smart phone; when a user uses a smart phone to use a short video pushing application, a storage device (such as a server) corresponding to a remote storage database can acquire related business data (such as viewed historical video data, praised historical video data, collected historical video data and the like) generated by the user in the short video pushing application and store the related business data into the remote storage database; the smart phone can acquire service data (such as historical video data which is praised by a user) with higher query heat on the device, and acquire data blocks where the service data are located from a remote storage database, wherein any data block can be used as a storage data block to be cached; for each storage data block, the smart phone can determine the local file name corresponding to the storage data block in the local disk according to the remote storage information and the remote attribute information in the remote storage database, and based on the local file name, the smart phone can create a local file named as the local file name and cache the storage data block in the local file. Thus, after receiving the data query request about the service data, the smart phone can quickly read out from the local and return the data query request to the user. For another example, in an intelligent vehicle-mounted scenario, an application program deployed with the scheme provided by the embodiment of the application is a vehicle-mounted application program; the types of in-vehicle applications may include, but are not limited to: music, video, or games, etc.
Wherein an application may refer to a computer program that performs some particular task or tasks; the application programs are classified according to different dimensions (such as the running mode, the function and the like of the application programs), and the types of the same application program under different dimensions can be obtained. For example: the applications may include, but are not limited to, by way of their manner of operation: a client installed in a terminal, an applet that can be used without downloading an installation (as a subroutine of the client), a World Wide Web (Web) application opened through a browser, and the like. And the following steps: applications may include, but are not limited to, by functional type of application: instant messaging (Instant Messaging, IM) applications, content interaction applications, audio applications or video applications, and so forth. Wherein, the instant messaging application program refers to an application program of instant messaging and social interaction based on internet, and the instant messaging application program can include but is not limited to: an application program containing a communication function, a map application program containing an interactive function, a game application program, and the like. The content interaction application is an application capable of realizing content interaction, and may be, for example, an application such as a sharing platform, personal space, news, and the like. An audio application refers to an application that implements audio functions based on the internet, and may include, but is not limited to: music applications with music playing and editing capabilities, radio applications with radio playing capabilities, live broadcast applications with live broadcast capabilities, etc. A video application refers to an application capable of playing pictures, and may include, but is not limited to: applications with short videos (video length is often short, e.g. seconds or minutes, etc.), applications with long videos (e.g. video playing frequently longer like movies or television shows), etc.
Of course, the solution provided in the embodiment of the present application may be directly deployed in a device (such as a smart phone) or deployed outside an application program as described above, and may also be deployed in a device or an application program in a plug-in form. The embodiment of the application does not limit the carrier of the deployment scheme.
2) The server 102 may be a server corresponding to the terminal for data interaction with the terminal to enable computing and application service support for the terminal. Specifically, the server is a background server corresponding to an application deployed in the terminal, and is configured to interact with the terminal to provide computing and application servers for the application. The server 102 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligence platform.
The terminal 101 and the server 102 may be directly or indirectly connected through a wired or wireless communication manner, which is not limited herein. In addition, the number of terminals and servers is not limited in the embodiment of the present application; in fig. 1, the number of terminals 101 and servers 102 is merely an example, and a plurality of servers distributed in a distributed manner may be included in a practical application, which is described herein.
The general flow of the data caching scheme in the application scenario is described below with reference to the system shown in fig. 1. In the specific implementation, firstly, a storage data block to be cached can be obtained, wherein the storage data block is defined based on each data block in a remote storage database; further, remote storage information and remote attribute information of the storage data block to be cached in the remote storage database can be obtained; according to the remote storage information and the remote attribute information, a local file name corresponding to the local cache can be determined; a local file named the local file name may be created in the local disk, and the stored data block is cached in the local file. The computer device may then efficiently provide data query functionality to the user based on the locally cached stored data block.
Based on the above described scheme and system architecture, the following points should be described:
(1) the system shown in fig. 1 mentioned above in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application. As can be appreciated by those skilled in the art, with the evolution of the system architecture and the appearance of new service scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems. For example, the foregoing describes an application scenario of the present application by taking an example in which the execution body "computer device" in the embodiment of the present application includes a terminal and a server, that is, the solution provided in the embodiment of the present application is executed by the terminal and the server together; it should be understood that, in practical applications, the computer device may also be a terminal or a server, that is, support the solution provided by the embodiments of the present application executed by the terminal or the server alone.
(2) In the embodiment of the application, the relevant data collection and processing should strictly obtain the personal information according to the requirements of relevant laws and regulations, so that the personal information needs to be informed or agreed (or has the legal basis of information acquisition), and the subsequent data use and processing behaviors are developed within the authorized range of laws and regulations and the personal information body. For example, when the embodiment of the present application is applied to a specific product or technology, for example, when feature data of a user is acquired, permission or consent of the user needs to be obtained, and collection (such as the above-mentioned historical video data that the user endorses), use and processing (such as video data recommendation processing for the user based on the historical video data that the user endorses) of relevant data need to comply with relevant laws and regulations and standards of relevant regions.
Based on the above-described scheme, the embodiment of the present application proposes a more detailed data processing method, and the data processing method proposed by the embodiment of the present application will be described in detail below with reference to the accompanying drawings.
Referring to fig. 2, fig. 2 is a schematic flow chart of a data processing method according to an exemplary embodiment of the present application, where the flow may refer to a flow of a data buffering scheme according to an embodiment of the present application. The data processing method may be performed by a computer device in the aforementioned system, such as the computer device being a terminal and/or a server; in practical application, the computer device refers to a target computing device in a separate storage architecture, where the separate storage architecture includes a computing device cluster and a remote storage database, the computing device cluster is responsible for computing, the remote storage database is responsible for storing, and the target computing device may refer to any computing device in the computing device cluster. The data processing method may at least include the following steps S101 to S104:
Step S101, obtaining a storage data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with the query heat higher than a preset threshold.
In this application, for any business application or business data generated in a business scenario, it may be stored in a remote storage database in a computing separation architecture. The remote storage database may include, but is not limited to: cloud storage database, distributed object storage, global storage area network. In the remote storage database, part of service data from the same data table is combined to form a data block (the data block is taken as the minimum unit of storage access in the application and can be called as a silo), and then the data block is stored in a certain remote file of the remote storage database. Based on the above, the data blocks in the remote storage database can be cached in the local area of the computer device one by one.
Because the space of the local disk of the computer device is limited, based on the characteristic, the application only caches the hot data with higher query heat into the local disk of the computing device, and when the data is cached, the data which are included in the hot data with higher query heat aiming at the target computing device can be acquired first. For example, if some business data is queried one or more times by the computing device, the business data may be determined to be hot data with high query heat, which may be cached from a remote storage database to a local disk. That is, the basis for determining whether the service data is the hot data with higher query heat in the present application may be: and configuring a query frequency threshold as a preset threshold of query heat, comparing the actual query frequency of the service data with the preset threshold, and if the actual query frequency reaches or is higher than the preset threshold, considering the service data as hot data with higher query heat, and caching the hot data to a local disk. The query frequency threshold may be set based on actual service requirements, for example, may be set to 0, 1, 10, etc.
It should be appreciated that for hot data that needs to be cached locally, which may be referred to as target business data, the present application may cache the target business data locally based on its storage in a remote storage database. Specifically, firstly, a data block where the target service data is located in a remote storage database can be obtained, the data block can be used as a storage data block to be cached to the local, and the caching position of the storage data block, which should be cached in the local disk, can be determined based on the related information of the storage data block in the remote storage data.
Step S102, obtaining remote storage information and remote attribute information of a storage data block to be cached in a remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote database.
In the application, because the related information (such as the storage position and the storage attribute) of the storage data block in the remote storage database is fixed, based on the characteristic, the application can determine the cache position in the local disk of the target computing device for the storage data block by utilizing the remote storage information and the remote attribute information of the storage data block in the remote storage database.
That is, the remote storage information in the present application may refer to information capable of reflecting a storage location of a storage data block in a remote database, and the remote storage information may specifically include, but is not limited to: the identification of the storage data block in the remote storage database (which may be used to uniquely characterize the storage data block, which may be a number, an ID, a name, etc.), the remote file in which the storage data block is located in the remote storage database (i.e., the remote file in which the storage data block is stored, which may be referred to as a remote storage file), the offset of the storage data block in the remote storage file, the space occupied by the storage data block in the remote storage file (which may be understood as the amount or size of data occupied by the storage data block). The remote storage information may be obtained from an object metadata table corresponding to the remote storage database, where the object metadata table mainly records related description information of each data block, and the related description information includes a correspondence between an identifier of each data block and the corresponding description information. For example, the object metadata table may be as shown in Table 1:
TABLE 1
The object metadata table shown in table 1 may be used for describing information of each data block in a certain remote storage file, for example, the description information of the data block identified as 1 may include at least an offset in the remote storage file, an occupied size (i.e., a size of the data block), a partition identifier (shadrid, used to characterize which partition in the remote storage database the stored data block is in), and an additional sequence identifier (extraseqID). Through the object metadata table, remote storage information of any data block can be obtained.
Remote attribute information may refer to information capable of reflecting the storage attributes of a stored data block in a remote storage database, and may include, but is not limited to: a physical file name (relfilenode) of the stored data block in the remote stored data block, an attribute value (attrnum, which is used primarily to identify attributes in the remote stored file), a partition identification (shardrid, which is used to characterize which partition in the remote stored database the stored data block is in), and an additional sequence identification (extraseqID). Wherein the remote attribute information can be obtained from an object metadata table in which all metadata (description information) corresponding to each data block is recorded.
Step S103, according to the remote storage information and the remote attribute information, determining the local file name of the corresponding cache of the storage data block in the target computing device.
In the application, according to the remote storage information and the remote attribute information, the local file name of the corresponding cache of the storage data block in the target computing device can be determined. In particular implementations, for determining a local file name for a corresponding cache of a stored data block in a target computing device based on remote storage information and remote attribute information, the particular implementation may include, but is not limited to: firstly, a default file size configured for the local disk can be obtained, where the default file size refers to a fixed data amount stored in a local file that is configured in advance, and can be used to specify the size (size) of any one file in the local disk, that is, the size of data that can be cached by any one file, and the default file size can be 64M, 128M, 70M, and the like, which is not limited herein; then, according to the default file size and the remote storage information, a local file identifier (the file identifier in the application may refer to a file ID) of the storage data block corresponding to the cache in the target computing device can be determined, specifically, a method for determining the cache file identifier can be configured, according to the method, the default file size and the remote storage information can be correspondingly operated to obtain an operation result, and the obtained operation result can be used as the local file identifier of the storage data block corresponding to the cache in the target computing device. It should be understood that, since the size of data that can be cached by one local file is fixed in the present application, that is, the default file size is fixed, then one stored data block may not be completely cached in the same local file, and two or more local files need to be cached, then the present application may configure a start identifier mapping rule and an end identifier mapping rule, and the identifier of the first local file (start cache file) of the stored data block may be obtained by performing an operation on the default file size and remote storage information through the start identifier mapping rule, and similarly, the identifier of the last local file (end cache file) of the stored data block may be obtained by performing a corresponding operation on the default file size and remote storage information through the end identifier mapping rule.
That is, in a specific implementation, the specific process of determining the local file identifier of the corresponding cache of the storage data block in the target computing device according to the default file size and the remote storage information may include, but is not limited to: firstly, according to a starting identifier mapping rule, carrying out starting identifier mapping processing on the remote storage information and a default file size, thereby obtaining a starting file identifier; likewise, according to the end identifier mapping rule, the remote storage information and the default file size can be subjected to end identifier mapping processing, so that an end file identifier can be obtained; further, a local file identifier of the corresponding cache of the stored data block in the target computing device can be determined according to the initial file identifier and the end file identifier. It should be noted that, according to the identification mapping rule in the present application, the start file identification may be smaller than the end file identification, and the present application may determine the start file identification, the intermediate file identification between the start file identification and the end file identification, and the end file identification as the local file identification corresponding to the cache of the storage data block in the target computing device. For example, the start file identifier is 1, the end file identifier is 5, and then the intermediate file identifiers are 2, 3, and 4, and 1, 2, 3, 4, and 5 can be determined as the local file identifiers of the corresponding caches of the storage data blocks.
For ease of understanding, please refer to formula (1), formula (1) may be used to characterize the start identifier mapping rule configured in the present application, as shown in formula (1):
start_fileid=offset/file_size formula (1)
Wherein, the start_fileid as shown in formula (1) may be used to characterize the Start file identifier; the Offset can be used for representing the Offset of a stored data block in a remote storage file in remote storage information; the file size may be used to characterize a default file size of the remotely stored file, e.g., 64M.
Based on the start identifier mapping rule shown in the above formula (1), the offset of the stored data block in the remote storage file (i.e. the file storing the stored data block in the remote storage database) is included in the remote storage information, and if the start file identifier needs to be calculated, the offset in the remote storage information needs to be obtained. That is, in a specific implementation, according to a starting identifier mapping rule, a specific process of performing a starting identifier mapping process on the remote storage information and a default file size to obtain a starting file identifier may include, but is not limited to: first, for ease of distinction, the offset of a stored data block within a remote storage file may be determined as a remote offset (i.e., referred to as a remote offset); then, according to the initial identifier mapping rule, a quotient result (which is convenient to distinguish from the following, and is called as a first quotient result) of the remote offset with respect to the default file size can be obtained; the first quotient result thus obtained can be used as the starting file identification.
Accordingly, for ease of understanding, please refer to formula (2), formula (2) may be used to characterize the end identifier mapping rule configured in the present application, as shown in formula (2):
end_fileid= (offset+size)/file_size formula (2)
Wherein end_fileid as shown in formula (2) may be used to characterize the End file identity; the Offset can be used for representing the Offset of a stored data block in a remote storage file in remote storage information; size may be used to characterize the amount of data that a block of storage data occupies in the remote storage file (i.e., the size/dimensions of the block of storage data); the file size may be used to characterize a default file size of the remotely stored file, e.g., 64M.
Based on the end identifier mapping rule shown in the above formula (2), the remote storage information includes an offset of the storage data block in the remote storage file (that is, a file storing the storage data block in the remote storage database), and also includes a data amount occupied by the storage data block in the remote storage file (referred to as a storage data amount), and if the end file identifier needs to be calculated, the offset and the storage data amount occupied by the storage data block in the remote storage information need to be acquired. That is, in a specific implementation, according to the end identifier mapping rule, the specific process of performing end identifier mapping processing on the remote storage information and the default file size to obtain the end file identifier may include, but is not limited to: firstly, for convenience of distinction, the offset of the storage data block in the remote storage file may be determined as a remote offset, and the storage data amount occupied by the storage data block in the remote storage file may be determined as a remote data amount (i.e., referred to as remote data amount); further, according to the end identifier mapping rule, the remote offset and the remote data volume can be summed; then, a second quotient result of the summation result obtained by the summation processing with respect to the default file size can be determined; the second quotient result obtained by the calculation can be determined as the end file identification.
Further, after the start file identifier and the end file identifier are determined, an intermediate file identifier between the start file identifier and the end file identifier can be obtained, and the start file identifier, the intermediate file identifier and the end file identifier are determined together to be the local file identifier of the corresponding cache of the storage data block in the target computing device. That is, each start file identification, each intermediate file identification, each end file identification may be referred to as a local file identification. After determining the local file identifiers, according to the file name generation rule configured in the application, the following processing may be performed for each local file identifier: the local file identifier and the remote attribute information are fused to obtain fusion information, wherein the fusion can be referred to as splicing, and by the knowledge that the remote attribute information comprises a physical file name (relfilenode), an attribute value (attrnum), a partition identifier (sharid) and an additional sequence identifier (extraseqID), the local file identifier and the remote attribute information can be spliced, and the splicing is the fusion, and the obtained fusion information can be regarded as the local file name uniquely corresponding to the local file identifier, namely the fusion information can be determined as the local file name of the storage data block corresponding to the cache in the target computing device.
For example, assuming that a stored data block is a silo2 in a remote storage file, a default file size in a local disk is 64M, a remote offset of the stored data block is 70M, and an occupied stored data amount is 68M, a start file identifier of the silo2 may be determined to be 1 based on the above formula (1), and an end file identifier of the silo2 may be determined to be 2 based on the above formula (2). Then the start file identifier 1 and the end file identifier 2 can be used as a local file identifier, and a start local file name (such as relfilenode. Attrnum. Shelf id. ExtraseqID. 1) can be obtained after the start file identifier 1 is spliced with relfilenode, attrnum, shardid of silo2 and extraseqID; for the starting file identifier 2, after the starting file identifier 2 is spliced with relfilenode, attrnum, shardid and extraseqID of the silo2, an ending local file name (such as relfilenode, attrnum, shrid, extraseqid.2) can be obtained, and the starting local file name and the ending local file name can be used as a local file name. Notably, since the remote attribute information of each silo in the remote storage database is different from each other, and the remote storage information of different silos in the remote storage file is also different, each determined local file name is also different, and by having a unique local file name, the cache location of each silo in the local disk (such as what the name of the local file caching the silo is) can be efficiently and accurately searched.
Step S104, a local file uniquely indicated by the local file name is created in the local disk of the target computing device, and the stored data block is cached to the local file uniquely indicated by the local file name.
In the application, after determining the local file name of the storage data block corresponding to the cache in the target computing device, a local file named as the local file name can be created in a local disk of the target computing device to serve as a local file uniquely indicated by the local file name, and then the storage data block can be cached in the local file uniquely indicated by the local file name.
It should be appreciated that since the default file size for each local file within the target computing device is limited and the size of the individual stored data blocks is non-fixed, there may be situations where one local file cannot completely cache a stored data block, which may require two or more local files to cache the stored data block. In other words, the number of the local file names is one or more, and one local file name may uniquely indicate one local file, so the number of local files storing the data block is also one or more. Taking at least two local file names as an example, if the local file names include a first local file name and a second local file name, when creating a local file in a local disk, one local file (which may be referred to as a first local file) uniquely indicated by the first local file name needs to be created in the local disk, and one local file (which may be referred to as a second local file) uniquely indicated by the second local file name needs to be created in the local disk.
Further, the stored data block may be cached in the first local file and the second local file. It should be noted that, when caching data to local files in the local disk, since a default file size is set for each local file, when caching data to a different local file, the cache space of the previous local file should be utilized, and then the next local file should be considered to be continuously filled with data. For example, when the storage data block to be cached is 80M and the default file size in the local disk is 64M, if the local file corresponding to the storage data block to be cached includes local file 1 and local file 2, and both local file 1 and local file 2 are empty, the storage data block needs to be split so as to cache a portion of data with the data size of 64M in the storage data block into local file 1, and then cache the remaining data (with the data size of 16M) in the storage data block into local file 2. Therefore, the buffer space of the local file 1 is fully utilized, and the buffer space remains in the local file 2, and if the initial local file with another stored data block corresponding to the buffer is calculated to be the local file 2, the data of the stored data block can be buffered from the local file 2, so that the buffer space of the local file 2 is utilized. That is, after the corresponding local file is calculated for each stored data block, the application configures the data buffering sequence of each local file according to the sequence from small to large between the text identifiers, for example, the initial text identifier is smaller than the intermediate text identifier, the intermediate text identifier is smaller than the end text identifier, then the stored data block preferentially buffers part of the data into the local file uniquely indicated by the initial text identifier, and when no buffering space exists in the local file, sequentially buffers the rest of the data into the local file uniquely indicated by the intermediate text identifier and the local file uniquely indicated by the end text identifier. Compared with the file random caching mode, the caching mode configured by the method can maximally utilize the caching space of the local file, and can reduce the waste caused by idle caching space.
Based on the above, after obtaining the local files indicated by the local file names, the data buffering order between the local files can be determined according to the numerical value between the local file identifiers. Taking the first local file and the second local file as an example, assuming that the data buffering sequence indicated by the first local file name precedes the data buffering sequence indicated by the second local file name, it can be seen that the stored data block needs to be preferentially buffered in the first local file, and the specific process for buffering the stored data block in the local file uniquely indicated by the local file name may include, but is not limited to: acquiring a default file size configured for a local disk; the default file size is used for designating the size of any one file in the local disk; then, a starting offset determination rule may be obtained, and according to the starting offset determination rule, offset calculation may be performed on the default file size and remote storage information to obtain a starting file offset of a storage data block in the first local file, where it should be understood that, based on the foregoing, since the default file size is limited, when a certain storage data block is cached, it may be necessary to split the storage data block to cache it into different local files, and then, for any storage data block, when the storage data block is cached, a part of data of another storage data block may be already cached in the first local file (the one indicated by the starting file identifier) of the storage data block, and then, within the starting local file, the storage data block cannot be cached from scratch. Based on this, for any one storage data block, the default file size of the local disk and the remote storage information of the storage data block need to be calculated according to the start offset determining rule, so as to calculate the start file offset of the storage data block in the start local file. For ease of understanding, please refer to equation (3), equation (3) may be used to characterize the start offset determination rule configured herein, as shown in equation (3):
Start_file_seek=offset% file_size formula (3)
Wherein, the start_file_seek as shown in formula (3) may be used to characterize a Start file offset (i.e., a Start cache file location in a Start local file) of the stored data block within the Start local file (e.g., the first local file); the Offset can be used for representing the Offset of a stored data block in a remote storage file in remote storage information; the file_size may be used to characterize a default file size of the remotely stored file, e.g., 64M; % is available for the table solicitation remainder.
Based on the initial offset determination rule shown in the above formula (3), the remote storage information includes the offset of the storage data block in the remote storage file (i.e. the file storing the storage data block in the remote storage database), and the initial file identifier and the initial file offset of the storage data block in the initial local file need to be calculated by using the remote offset. Based on this, in a specific implementation, according to the initial offset determining rule, the specific process of performing offset calculation processing on the default file size and the remote storage information to obtain the initial file offset of the storage data block in the first local file may include, but is not limited to: firstly, for convenience of distinction, the offset of the storage data block in the remote storage file can be determined as the remote offset; then, according to the starting offset determination rule, a first remainder of the remote offset with respect to the default file size may be determined (i.e., the remote offset is to be determined Default file size, quotient and remainder values may be obtained,the quotient can be used as a starting file identification of the stored data block, and the remainder can be used as a remainder result); the first remainder result may be determined as a starting file offset for storing the data block in the first local file.
Similarly, according to the end offset determining rule, offset calculation processing may be performed on the default file size and the remote file storage information to obtain an end file offset of the stored data block in the second local file, that is, an end cache file location of the stored data block in the end local file (e.g., the second local file). For ease of understanding, please refer to equation (4) together, equation (4) may be used to characterize the end offset determination rules configured herein, as shown in equation (4):
end_file_seek= (offset+size)% file_size equation (4)
Wherein end_file_seek as shown in equation (4) may be used to characterize the End file offset (i.e., the End cache file location in the End local file) of the stored data block within the End local file (e.g., the second local file); the Offset can be used for representing the Offset of a stored data block in a remote storage file in remote storage information; the file_size may be used to characterize a default file size of the remotely stored file, e.g., 64M; % is available for the table solicitation remainder.
Based on the initial offset determination rule shown in the above formula (4), the remote storage information includes the offset of the storage data block in the remote storage file (that is, the file storing the storage data block in the remote storage database), and further includes the storage data amount occupied by the storage data block in the remote storage file, and the end file identifier and the end file offset of the storage data block in the end local file are calculated by using the remote offset and the storage data amount. Based on this, in a specific implementation, according to the end offset determining rule, the specific process of performing offset calculation processing on the default file size and the remote file storage information to obtain the end file offset of the stored data block in the second local file may include, but is not limited to:for convenience of distinction, firstly, the offset of the storage data block in the remote storage file is determined to be a remote offset, and the storage data amount occupied by the storage data block in the remote storage file is determined to be a remote data amount; then, according to the end offset determination rule, the remote offset and the remote data amount can be summed; further, a second remainder of the summed result of the summing process with respect to the default file size may be determined (i.e., the summed result Default file size, a quotient and remainder can be obtained, the quotient can be identified as the ending file of the stored data block, and the remainder can be the remainder result; and finally, determining the second residual result as the end file offset of the stored data block in the second local file.
It should be understood that after determining the start file offset in the first local file and the end file offset in the second local file, the start file offset in the first local file to the end file offset in the second local file (i.e., the interval between the start file cache location and the end file cache location) may be determined as the cache interval location of the stored data block, and the stored data block may be written to the cache interval location.
For ease of understanding, please refer to fig. 3, fig. 3 is a schematic diagram of a scenario of data buffering according to an embodiment of the present application. Assuming that the stored data block is a silo2 in a certain remote stored file, the default file size in the local disk is 64M, the remote offset of the stored data block is 70M, and the occupied stored data amount is 68M, then based on the above formula (1), it can be determined that the starting file identifier of the silo2 in the target computing device is 1, and based on the above formula (3), it can be determined that the starting file offset of the silo2 in the starting local file is 6M; based on the above formula (1), it may be determined that the end file identifier of the silo2 in the target computing device is 2, and based on the above formula (4), it may be determined that the end file offset of the silo2 in the end local file is 10M. Based on the remote attribute information of the start file identifier 1 and the silo2, the start local file name of the silo2 in the target computing device is relfilenode, attrnum, shrid, extraseqid.1, and the end local file name is relfilenode, attrnum, shrid, extraseqid.2. When the local file corresponding to the start local file name is referred to as a file 1 and the local file indicated by the end local file name is referred to as a file 2, as shown in fig. 3, the cache space position of the silo2 in the local disk may be from the 6M of the file 1 to the 10M of the file 2 (i.e., the silo2 exists only in the portion shifted from the 6M of the file 1 to the 10M of the file 2).
Similarly, assuming that another storage data block is silo3 in the remote storage file, the default file size in the local disk is 64M, the remote offset of the storage data block is 138M, and the occupied storage data amount is 30M, then based on the above formula (1), it can be determined that the starting file identifier of the silo2 in the target computing device is 2, and based on the above formula (3), it can be determined that the starting file offset of the silo2 in the starting local file is 10M; based on the above formula (1), it may be determined that the end file identifier of the silo2 in the target computing device is 2, and based on the above formula (4), it may be determined that the end file offset of the silo2 in the end local file is 40M. Based on the remote attribute information of the start file identifier 1 and the silo2, the start local file name of the silo3 in the target computing device is relfilenode, attrnum, shrid, extraseqid.2, and the end local file name is relfilenode, attrnum, shrid, extraseqid.2. Then as shown in fig. 3, the cache space position of the silo3 in the local disk may be from the 10M of the file 2 to the 40M of the file 2 (i.e. the silo3 exists only in the portion shifted from the 10M of the file 2 to the 40M of the file 2). Caching according to the fixed data block size (namely caching according to the default file size) can effectively solve the problem of too many cached small files.
In the embodiment of the application, for the service data with higher query heat stored in the remote storage database in the separate storage architecture, the service data can be cached in the local disk of the computing device in advance, so that the computing device can directly read the related service data from the local disk during data query, the service data is not required to be queried from the remote storage database through a network, the data transmission delay caused by network remote transmission can be reduced, and the data query efficiency and the data processing efficiency are improved. In the process of caching service data to a local disk, the application caches the service data through a storage data block where the service data is located in a remote storage database, specifically, a local file name in the local disk is determined together based on remote storage information and remote attribute information of the storage data block in the remote storage database, and then the computing device can create a local file named as the local file name in the local disk and cache the service data in the local file. By utilizing the mode of remote storage information and remote attribute information of the storage data blocks in the remote storage database, service data belonging to the same storage data block in the remote storage database can be correspondingly stored in the same local file in the local disk, so that in the data query process, all service data belonging to the same storage data block do not need to be traversed, the corresponding local file can be quickly positioned based on the remote storage information and the remote attribute information, and further, the corresponding service data can be quickly read out.
It should be understood that, since each data block in the remote storage database is generally formed by service data in the same data table, the service data in the same data table is cached in the same location by means of caching a certain data block, so that the service data in one data table can be conveniently searched. In order to further improve the data query efficiency in the local disk, the application can use a cache space in the local disk as a shared memory, and the shared memory can record the relevant index information of each local file cache in the local disk. For ease of understanding, please refer to fig. 4, fig. 4 is a schematic diagram of an index information structure according to an embodiment of the present application. As shown in fig. 4, the index information structure may include at least initialization data, a list of spare nodes, a list of used nodes, a hash list, a hash index, and a file cache entry. Details will be set forth below:
initializing data: which may be used to characterize a series of data generated when a shared memory is initialized within a local disk.
List of spare nodes: the free node list comprises a plurality of free nodes, each free node can be understood as a free memory block which does not contain any data, and each free node is provided with an identification (such as a number, an ID and the like) for uniquely characterizing the control node.
List of used nodes: the used node list may refer to a node recorded with information of a certain local file name, and if a certain spare node is used to record information of a certain local file (such as a local file name), the spare node may be added to the used node list as a used node.
Hash list: the hash list contains a node record space corresponding to each data table, and the node record space should record the identification of the used node corresponding to the data table. For example, the number of local file identifications calculated by a certain stored data block is 2, then the stored data block needs two local files to be cached, then in the index information structure, two spare nodes need to be applied for recording file information of each local file one by one (one spare node records file information of one local file), then the applied two spare nodes can be used as used nodes, and the two used nodes are connected to the head of a used node list according to the size sequence among the local file identifications (the two used nodes, the more the corresponding local file identifications are in front, the more the positions in the used node list are in front); then, the data table where the stored data block is located can be obtained, and the identifiers of the two used nodes are added to the node record space corresponding to the data table in the hash list in sequence. For example, the starting file identifier obtained by storing the data block is 1, the ending file identifier is 2, assuming that the node which is the free node applied for the starting file identifier 1 is node 30, and the node which is the free node applied for the ending file identifier 2 is node 40, 30-40 may be added into the hash list (the node identifier 30 precedes the node identifier 40), the node which records the local file information of the data table may be determined to be node 30 and node 40 by the node identifier 30-40, and based on the connection sequence of 30-40, it may be known that the node identifier 30 records the information of the local file indicated by the starting file identifier, and the node identifier 40 records the information of the local file indicated by the ending file identifier.
Hash index: the hash index is used for recording the index relation between a data table and the corresponding starting node. For example, for data table a, in the hash list, the node identification information recorded in the node record space is "30-50-78-98", then it can be seen that node 30 records a start local file, node 50 records a first intermediate local file, node 78 records a second intermediate local file, and node 98 records an end local file. An index relationship between data table a and the node 30 may be constructed and added to the hash index.
File cache item: the file cache item comprises a spare file cache block corresponding to each spare node, if a certain spare node is used for recording information of a certain local file, the spare node can be used as a used node, the spare file cache block corresponding to the used node can be obtained in the file cache item, and the information of the local file (such as the calculated local file name, remote attribute information of a storage data block and file name generation rules) is recorded in the spare cache block.
It should be understood that, through the index information of the shared memory, after receiving a data query request for a certain service data in a certain data table, the target computing device may first traverse the hash index in the index information to determine whether the data table is cached in the local disk, if the hash index includes the data table, the target computing device may determine node identifiers of relevant local file information recorded with the data table based on the hash index and the hash list together, through the node identifiers, may determine file cache blocks in which specific information of each local file is recorded in a file cache item, may restore local file names through the specific information of files recorded in the file cache blocks, and then may find corresponding local files through the local file names, so as to read corresponding service data from the local files. If the hash index does not contain the data table, the target computing device may read the requested service data from the remote storage data block, and accordingly, the target computing device may cache the data block where the service data is located into the local disk in the manner described above, and add the relevant file cache index information into the index information.
It should be understood that, because the shared memory hash list is used as the cache index, all concurrent query processes of the same computing device can share the shared memory hash list and the local cache file of the computing device, so that the storage space can be well saved. The maximum size for the shared memory can be determined as shown in equation (5), as shown in equation (5):
formula (5)
Wherein, as shown in formula (5)The maximum size of the shared memory can be used for representing the shared memory; />The method can be used for representing the data size occupied by the initialization data in the index information; />Can be used for ensuring the maximum number of files which can be cached in the local disk (taking the cache space of the local disk as 1TFor example, the default FILE size is 64M, there are approximately 16384 FILEs for 1T/64M, MAX_COS_FILE_CACHE_NUM is 16384); />The method can be used for representing the size occupied by the file cache project; />Can be used to characterize the size occupied by the hash index.
Based on the index information structure shown in fig. 4, after the storage data block is cached to the local file indicated by the local file name, information of the relevant data table of the storage data block and file information of the local file may also be added to the index information. Referring to fig. 5, fig. 5 is a schematic flow chart of adding file cache index information of a local file storing a data block to a shared cache index table according to an embodiment of the present application. As shown in fig. 5, the flow may include at least the following steps S501 to S505:
In step S501, the local file uniquely indicated by the local file name is determined as the target local file.
In a specific implementation, for convenience of distinction, the local file uniquely indicated by the local file name may be first determined as the target local file (i.e., referred to as the target local file).
Step S502, any free node is obtained from the free node set and used as a target file cache node of a target local file; any free node in the free node set is a preconfigured node for recording file cache information of any file in the local disk.
In a specific implementation, after the start file identifier and the end file identifier are calculated, the number of local files for caching the stored data block may be determined, and then an equal number of idle nodes may be applied from the idle node set (i.e., the idle node list in the embodiment corresponding to fig. 4 above) to record the local file information. For example, assuming that the start file identifier is 1 and the end file identifier is 5, 5 local files for caching the stored data block are required, each local file can be used as a target local file, and 5 spare nodes need to be applied for, so that relevant file information of each local file is recorded in a one-to-one correspondence. The 5 free nodes as applied comprise a free node with a node identifier of 12, a free node with a node identifier of 13, a free node with a node identifier of 50, a free node with a node identifier of 49 and a free node with a node identifier of 43, and the 5 free nodes and the 5 local files can be in one-to-one correspondence, so that one free node corresponds to one local file, and the free node corresponding to one local file can be used as a target file cache node of the local file.
It should be noted that, based on the embodiment corresponding to fig. 4, after the target file cache node of the target local file is applied, the target file cache node is connected to the header of the used node list, and meanwhile, the node identifier of the target file cache node is added to the hash list, and the node record space corresponding to the data table where the stored data block is located. Based on the hash list, it is determined which nodes record the data table.
Step S503, adding the remote attribute information into the target file cache node to obtain a filling file cache node.
In a specific implementation, the remote attribute information is added to the target file cache node, and in fact, the remote attribute information of the stored data block and the local file identifier of the target local file, that is, the file generation rule (such as a rule that the remote attribute information and the local file identifier are spliced) are added together to the target file cache block corresponding to the target file cache node in the file cache project. In the case that the file cache block corresponding to each free node in the embodiment corresponding to fig. 4 actually belongs to a part of the memory space in the free node, it may be determined that the remote attribute information is added to the target file cache node. The target file cache node (target file cache block) to which the remote attribute information, the local file identification, and the file generation rule are added may be referred to as a fill file cache node (or fill file cache block).
Step S504, a target data table of the target business data in the remote storage database is obtained, and an index relation between the target data table and the filling file cache node is constructed.
In a specific implementation, a data table where the target service data is located in the remote storage database may be obtained, and referred to as a target data table, and an index relationship between the target data table and the filling file cache node may be constructed. In essence, based on the embodiment corresponding to fig. 4, the node identifier of the target file cache node is added to the hash table, and the addition of the node identifier can be regarded as constructing the index relationship.
Step S505, adding the index relation between the target data table and the filling file cache node to a shared cache index table of the target computing device; and the remote attribute information in the shared cache index table is used for determining the local file name of the corresponding cache of the storage data block in the target computing device based on the remote attribute information and the remote storage information in the shared cache index table when the target computing device queries the target service data, and reading the target service data from the local file uniquely indicated by the local file name.
Specifically, the shared cache index table may correspond to the index information structure in the embodiment corresponding to fig. 4. And determining the local file names of the corresponding caches of the storage data blocks in the target computing equipment based on the remote attribute information and the remote storage information in the shared cache index table when the target computing equipment inquires the target service data through the index relation between the target data table and the filling file cache node, and reading the target service data from the local file uniquely indicated by the local file names.
For easy understanding, taking the example that the target computing device receives the data query request sent by the query object, after receiving the data query request sent by the query object, if the target computing device can analyze that the data query request is used for requesting to query the target service data from the target data table, the target computing device can determine a storage data block where the target service data is located in the remote storage database based on the data query request; then, the target computing device can acquire remote storage information and remote attribute information of the storage data block; then, the target computing device may traverse the shared cache index table to query whether the target data table is included, and if the target data table exists, it may be considered that relevant service data of the data table is stored in the local disk, a file-filling cache node having an index relationship with the target data table may be obtained from the shared cache index table, and a file generation rule may be read from the file-filling cache node (remote attribute information may also be cached in the file-filling cache node); according to the remote storage information and the default file size, the target computing device can calculate to obtain a local file identifier, and the local file identifier and the remote attribute information can be spliced through the file generation rule so as to restore to obtain a local file name; the target computing device may then read the target business data from the local file indicated by the local file name and return it to the query object.
In the embodiment of the application, by creating the shared memory, namely the shared cache index table, the computing device can read the local shared cache index table when the data needs to be read, and find the specific storage file position of the data in the local disk according to the local shared cache index table, so that the reading efficiency of the local data can be improved; if the data cannot be found in the shared cache index table, the data can be read from the remote storage database and cached to the local for the next time, the data can be directly read from the local cache.
Further, for ease of understanding, please refer to fig. 6, fig. 6 is a schematic diagram of a logic architecture of a computing device for caching data according to an embodiment of the present application. The logic architecture is based on the shared cache index table as the index information structure in the embodiment corresponding to fig. 4, and based on the logic architecture, the flow of caching data to the local may at least include: 1) Firstly, a data reading component in the computing device can receive a data query request (request to read certain service data from a certain data table) sent by a query object, and the data reading component can determine which data block (silo) the service data requested to be queried by the query object belongs to based on the data query request; 2) Then, the data reading component can acquire the remote storage information and the remote attribute information of the data block, and then based on the formula (1) -formula (4), the local file identifier, namely the file offset, of the data block corresponding to the cache in the local disk is calculated; 3) The data reading component can traverse the hash index in the shared cache index table to inquire whether the data table requiring inquiry is contained, if the data table requiring inquiry does not exist, the computing device needs to remotely inquire the service data from the remote storage database, and for the service data, the computing device can cache the service data to the local for being directly read from the local cache when inquiring next time, based on the local file identification calculated by the node application component based on the data reading component, the service data is applied to a corresponding number of spare nodes from the spare node list; 4) The node connection component may then connect the spare node to the header of the list of used nodes; 5) The node adding component can add the identifier of the applied spare node to the hash list; 6) Next, the information filling component may obtain a file cache block corresponding to the applied free node from the file cache item, and add relevant file information (such as a file generation rule) to the file cache block; 7) And finally, the data caching component can find a corresponding local file according to the local file name, and cache the data block into a corresponding cache interval position according to the calculated file offset.
Further, after the data is cached to the local area of the computing device through the caching process in the embodiment corresponding to fig. 6, when there is another data query request, and the data is requested to be read again, the computing device may hit the local area cache, and the computing device may read the corresponding service data from the local area and return to the query object. For example, taking the relevant service data of the data B table already cached in the local cache as an example, the process of querying the relevant data in the B table from the local cache may at least include: 1) Firstly, a data query request received by a computing device is a request for querying part of service data from a data B table, and based on the data query request, the data blocks to which the service data belong can be obtained, and for any one data block, the subsequent processing can be executed; 2) For any data block (assumed to be silo 7), based on the remote storage information of the data block and the default file size, calculating to obtain a local file identifier (assumed to be 8 for a start file identifier and 9 for an end file identifier) of a corresponding cache and a file offset in the local file; 3) Traversing the hash index to confirm whether the business data of the data B table is cached in the local cache, and traversing the hash list to determine each node identifier (assuming node identifiers are 2 and 3) recorded with the data B table after traversing to confirm that the business data exists; 4) In the used node list, the hit nodes are identified as 2 and 3 used nodes, and the used nodes are moved to the head of the LRU; 5) Reading file cache blocks corresponding to node identifiers 2 and 3 to acquire relevant file information of a local file name of a starting local file and relevant file information of a local file name of an ending local file; 6) According to the related information recorded in the file cache block, the local file name can be spliced; 7) And obtaining a specific buffer interval position from the local file indicated by the local file name according to the file offset, and then reading corresponding service data from the buffer interval position.
It should be noted that, for the service data cached in the local disk, a cache replacement algorithm may be used to remove the cold data with low query heat (such as the service data which has not been queried in a certain period of time) from the local cache, so as to save the local cache space. For example, the least recently used (Least Recently Used, LRU) algorithm may be used to replace the service data in the local cache, specifically, each time a certain data block is cached, the spare node applied for the data block may be connected to the head of the used node list, so as to characterize the data block corresponding to the used node as the hot data which is just cached; meanwhile, in each query process, once some service data are hit, the corresponding used nodes are replaced to the head of a used node list for representing the service data, which are the hot data just queried currently; based on this, since the used node corresponding to the hot data with high query heat is continuously replaced to the head of the used node list, the tail of the used node list can be regarded as cold data with low query heat, and all the cache data associated with the tail can be removed.
It should be noted that, the data caching scheme provided in the present application will not lose cached data after the system is restarted. For example, after the system is restarted, the computing device may read all local files under the cache directory and skip the files with tmp suffixes (all the files are dead or complete files not cached), and the file cache item may be restored according to the local file name (relfilenode_attrnum. Shard. Extraseqid. Fileid), so that the data in the shared cache index table may be constructed according to the normal cache flow, so that the cache before the last system exits is continuously available. In addition, the DML cannot influence the data caching in the application, because the common remote storage database generally adopts an additionally written column storage format, only the object metadata can be modified through delete operation, but the user data is unchanged, the application caches the related service data of the user, the deleted data cannot be accessed later, and the caching can be automatically replaced slowly; the data added by Insert operation corresponds to new data writing into the cache logic; the update operation is equal to delete+insert, so that the cache is not affected; and the data table is changed after DDL, and then a new hash index is created according to the new relfilenode access. Tmp suffix (tmp suffix cache file will skip automatic recovery when computing node is restarted), because old relfilenode will not be used and will be replaced out slowly naturally without special modification.
The data caching scheme provided by the application can be applied to a TDSQL-A distributed computation separation OLAP database, the performance processing capability of the local cache can be provided under the condition of hit caching, the cost required by the local cache is much less than that required by an integrated computation, the storage cost of the integrated computation can be well saved, a third party cache system is not required to be introduced in the local cache, and the maintenance cost of a third party can be saved. And in the aspect of caching performance, through a TPCH test experiment of 100G data volume, compared with the traditional separate storage system, the separate storage system based on the data caching scheme has almost no difference in performance, but the separate storage system based on the data caching scheme can save storage cost and improve data query efficiency.
Further, referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus may be a computer program (including program code) running in a computer device, for example the data processing apparatus is an application software; the data processing apparatus may be used to perform the method shown in fig. 3 (i.e. the apparatus is applied to a memory separation architecture). As shown in fig. 7, the data processing apparatus 1 may include: a block acquisition module 11, an information acquisition module 12, a file name determination module 13, and a block cache module 14.
A block obtaining module 11, configured to obtain a stored data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with the query heat higher than a preset threshold value;
the information acquisition module 12 is configured to acquire remote storage information and remote attribute information of a storage data block to be cached in a remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote storage database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote storage database;
a file name determining module 13, configured to determine, according to the remote storage information and the remote attribute information, a local file name of the storage data block corresponding to the cache in the target computing device;
the block cache module 14 is configured to create a local file uniquely indicated by a local file name in a local disk of the target computing device, and cache the stored data block to the local file uniquely indicated by the local file name.
The specific implementation manners of the block obtaining module 11, the information obtaining module 12, the file name determining module 13, and the block caching module 14 may be referred to the descriptions of step S101 to step S104 in the embodiment corresponding to fig. 3, and will not be described herein.
In one embodiment, the file name determining module 13 determines a specific implementation manner of the local file name of the corresponding cache of the storage data block in the target computing device according to the remote storage information and the remote attribute information, including:
acquiring a default file size configured for a local disk; the default file size is used for designating the size of any one file in the local disk;
determining a local file identifier of a corresponding cache of the stored data block in the target computing device according to the default file size and the remote storage information;
fusing the local file identification with the remote attribute information to obtain fused information;
and determining the fusion information as the local file name of the corresponding cache of the storage data block in the target computing device.
In one embodiment, the file name determining module 13 determines a specific implementation manner of the local file identifier of the corresponding cache of the storage data block in the target computing device according to the default file size and the remote storage information, where the specific implementation manner includes:
according to the initial identification mapping rule, performing initial identification mapping processing on the remote storage information and the default file size to obtain an initial file identification;
according to the end identifier mapping rule, performing end identifier mapping processing on the remote storage information and the default file size to obtain an end file identifier;
And determining the local file identification of the corresponding cache of the storage data block in the target computing device according to the initial file identification and the end file identification.
In one embodiment, the remote storage information includes an offset of the storage data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the file name determining module 13 performs initial identifier mapping processing on the remote storage information and a default file size according to an initial identifier mapping rule, so as to obtain a specific implementation mode of initial file identifiers, which includes:
determining the offset of the storage data block in the remote storage file as a remote offset;
determining a first quotient result of the remote offset with respect to a default file size according to the initial identity mapping rule;
and determining the first quotient result as the initial file identification.
In one embodiment, the remote storage information includes an offset of the storage data block within the remote storage file and an amount of storage data occupied by the storage data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the file name determining module 13 performs end identifier mapping processing on the remote storage information and the default file size according to the end identifier mapping rule, to obtain a specific implementation manner of the end file identifier, including:
Determining the offset of the storage data block in the remote storage file as a remote offset, and determining the storage data amount occupied by the storage data block in the remote storage file as a remote data amount;
summing the remote offset and the remote data according to the end identifier mapping rule;
determining a second quotient result of the summation result obtained by the summation processing with respect to the default file size;
and determining the second quotient result as the end file identification.
In one embodiment, the local file name comprises a first local file name and a second local file name;
a specific implementation of the block cache module 14 to create a local file within a local disk of a target computing device that is uniquely indicated by a local file name includes:
creating a first local file uniquely indicated by a first local file name in a local disk;
creating a second local file uniquely indicated by a second local file name in the local disk;
and determining the first local file and the second local file as the local files uniquely indicated by the local file names.
In one embodiment, the data caching order indicated by the first local file name precedes the data caching order indicated by the second local file name;
The specific implementation manner of the block caching module 14 for caching the stored data block to the local file uniquely indicated by the local file name includes:
acquiring a default file size configured for a local disk; the default file size is used for designating the size of any one file in the local disk;
performing offset calculation processing on the default file size and remote storage information according to an initial offset determination rule to obtain an initial file offset of a storage data block in a first local file;
performing offset calculation processing on the default file size and remote file storage information according to an end offset determination rule to obtain an end file offset of a storage data block in a second local file;
and determining the initial file offset in the first local file to the end file offset in the second local file as a buffer interval position of the stored data block, and writing the stored data block into the buffer interval position.
In one embodiment, the remote file storage information includes an offset of the stored data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the block cache module 14 performs offset calculation processing on the default file size and the remote storage information according to the initial offset determination rule, to obtain a specific implementation manner of the initial file offset of the storage data block in the first local file, including:
Determining the offset of the storage data block in the remote storage file as a remote offset;
determining a first residual result of the remote offset with respect to a default file size according to a starting offset determination rule;
the first remainder result is determined as a starting file offset of the stored data block in the first local file.
In one embodiment, the remote file storage information includes an offset of the storage data block within the remote storage file and an amount of storage data occupied by the storage data block within the remote storage file; the remote storage file refers to a file storing the storage data block in a remote storage database;
the block cache module 14 performs offset calculation processing on the default file size and the remote file storage information according to the end offset determination rule to obtain a specific implementation manner of the end file offset of the storage data block in the second local file, including:
determining the offset of the storage data block in the remote storage file as a remote offset, and determining the storage data amount occupied by the storage data block in the remote storage file as a remote data amount;
summing the remote offset and the remote data according to the end offset determining rule;
Determining a second residual result of the summation result obtained by the summation processing with respect to the default file size;
and determining the second residual result as an end file offset of the stored data block in the second local file.
In one embodiment, after the block buffer module 14 buffers the stored data block to the local file uniquely indicated by the local file name, the data processing apparatus 1 further includes: the index table adding module 15.
An index table adding module 15, configured to determine a local file uniquely indicated by a local file name as a target local file;
the index table adding module 15 is further configured to obtain any spare node from the spare node set as a target file cache node of the target local file; any free node in the free node set is a preconfigured node for recording file cache information of any file in the local disk;
the index table adding module 15 is further configured to add remote attribute information to the target file cache node, so as to obtain a filled file cache node;
the index table adding module 15 is further configured to obtain a target data table where the target service data is located in the remote storage database, and construct an index relationship between the target data table and the filling file cache node;
The index table adding module 15 is further configured to add an index relationship between the target data table and the filling file cache node to a shared cache index table of the target computing device; and the remote attribute information in the shared cache index table is used for determining the local file name of the corresponding cache of the storage data block in the target computing device based on the remote attribute information and the remote storage information in the shared cache index table when the target computing device queries the target service data, and reading the target service data from the local file uniquely indicated by the local file name.
For a specific implementation manner of the index table adding module 15, reference may be made to the description of step S501 to step S505 in the embodiment corresponding to fig. 5, which will not be repeated here.
In one embodiment, after the index table adding module 15 adds the index relationship between the target data table and the filling file cache node to the shared cache index table of the target computing device, the data processing apparatus further includes: a data query module 16.
A data query module 16, configured to receive a data query request sent by a query object; the data query request is used for requesting to query the target service data from the target data table;
The data query module 16 is further configured to determine, based on the data query request, a storage data block in which the target service data is located in the remote storage database, and remote storage information of the storage data block in the remote storage database;
the data query module 16 is further configured to traverse the shared cache index table;
the data query module 16 is further configured to, if the target data table exists in the shared cache index table, obtain a file-filled cache node having an index relationship with the target data table from the shared cache index table, and determine a local file name of the corresponding cache of the stored data block in the target computing device according to the remote attribute information and the remote storage information recorded in the file-filled cache node;
the data query module 16 is further configured to read the target service data from the local file uniquely indicated by the local file name in the local disk, and return the target service data to the query object.
For a specific implementation of the data query module 16, refer to the description of step S505 in the embodiment corresponding to fig. 5, which will not be described herein.
In the embodiment of the application, for the service data with higher query heat stored in the remote storage database in the separate storage architecture, the service data can be cached in the local disk of the computing device in advance, so that the computing device can directly read the related service data from the local disk during data query, the service data is not required to be queried from the remote storage database through a network, the data transmission delay caused by network remote transmission can be reduced, and the data query efficiency and the data processing efficiency are improved. In the process of caching service data to a local disk, the application caches the service data through a storage data block where the service data is located in a remote storage database, specifically, a local file name in the local disk is determined together based on remote storage information and remote attribute information of the storage data block in the remote storage database, and then the computing device can create a local file named as the local file name in the local disk and cache the service data in the local file. By utilizing the mode of remote storage information and remote attribute information of the storage data blocks in the remote storage database, service data belonging to the same storage data block in the remote storage database can be correspondingly stored in the same local file in the local disk, so that in the data query process, all service data belonging to the same storage data block do not need to be traversed, the corresponding local file can be quickly positioned based on the remote storage information and the remote attribute information, and further, the corresponding service data can be quickly read out.
Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 8, the above-described computer device 8000 may include: processor 8001, network interface 8004, and memory 8005, and further, the above-described computer device 8000 further includes: a user interface 8003, and at least one communication bus 8002. Wherein a communication bus 8002 is used to enable connected communications between these components. The user interface 8003 may include a Display screen (Display), a Keyboard (Keyboard), and the optional user interface 8003 may also include standard wired, wireless interfaces, among others. Network interface 8004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). Memory 8005 may be a high speed RAM memory or a non-volatile memory, such as at least one disk memory. Memory 8005 may optionally also be at least one memory device located remotely from the aforementioned processor 8001. As shown in fig. 8, an operating system, a network communication module, a user interface module, and a device control application program may be included in the memory 8005, which is one type of computer-readable storage medium.
In the computer device 8000 shown in fig. 8, the network interface 8004 may provide a network communication function; while user interface 8003 is primarily an interface for providing input to the user; and the processor 8001 may be used to invoke a device control application stored in the memory 8005 to implement:
acquiring a storage data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with the query heat higher than a preset threshold value;
acquiring remote storage information and remote attribute information of a storage data block to be cached in a remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote storage database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote storage database;
determining a local file name of a corresponding cache of the storage data block in the target computing device according to the remote storage information and the remote attribute information;
and creating a local file uniquely indicated by the local file name in a local disk of the target computing device, and caching the stored data block to the local file uniquely indicated by the local file name.
It should be understood that the computer device 8000 described in the embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3 to 6, and may also perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 7, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.
Furthermore, it should be noted here that: the embodiments of the present application further provide a computer readable storage medium, where a computer program executed by the computer device 8000 for data processing mentioned above is stored, and the computer program includes program instructions, when the processor executes the program instructions, the description of the data processing method in the embodiments corresponding to fig. 3 to 6 can be executed, and therefore, will not be repeated herein. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.
The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
In one aspect of the present application, a computer program product is provided that includes a computer program stored in a computer readable storage medium. A processor of a computer device reads the computer program from a computer-readable storage medium, and the processor executes the computer program to cause the computer device to perform the method provided in an aspect of the embodiments of the present application.
The terms first, second and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function, and works together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The methods and related devices provided in the embodiments of the present application are described with reference to the method flowcharts and/or structure diagrams provided in the embodiments of the present application, and each flowchart and/or block of the method flowcharts and/or structure diagrams may be implemented by computer program instructions, and combinations of flowcharts and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.
The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims (15)

1. A data processing method, wherein the method is applied to a separate-from-memory architecture, the separate-from-memory architecture comprising a cluster of computing devices and a remote storage database that are independent of each other, the method being performed by a target computing device, the target computing device being any computing device in the cluster of computing devices; the method comprises the following steps:
acquiring a storage data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with query heat higher than a preset threshold value;
acquiring remote storage information and remote attribute information of the storage data block in the remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote storage database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote storage database;
Determining a local file name of the storage data block corresponding to the cache in the target computing device according to the remote storage information and the remote attribute information;
and creating a local file uniquely indicated by the local file name in a local disk of the target computing device, and caching the stored data block to the local file uniquely indicated by the local file name.
2. The method of claim 1, wherein the determining a local file name for the corresponding cache of the stored data block in the target computing device based on the remote storage information and the remote attribute information comprises:
acquiring a default file size configured for the local disk; the default file size is used for specifying the size of any one file in the local disk;
determining a local file identifier of the storage data block corresponding to the cache in the target computing device according to the default file size and the remote storage information;
fusing the local file identification with the remote attribute information to obtain fusion information;
and determining the fusion information as a local file name of the storage data block corresponding to the cache in the target computing device.
3. The method of claim 2, wherein the determining, based on the default file size and the remote storage information, a local file identification of the stored data block corresponding to the cache in the target computing device comprises:
performing initial identification mapping processing on the remote storage information and the default file size according to an initial identification mapping rule to obtain an initial file identification;
according to the ending identification mapping rule, carrying out ending identification mapping processing on the remote storage information and the default file size to obtain ending file identification;
and determining a local file identifier of the corresponding cache of the storage data block in the target computing device according to the initial file identifier and the end file identifier.
4. The method of claim 3, wherein the remote storage information includes an offset of the stored data block within a remote storage file; the remote storage file refers to a file storing the storage data block in the remote storage database;
performing initial identifier mapping processing on the remote storage information and the default file size according to an initial identifier mapping rule to obtain an initial file identifier, including:
Determining the offset of the storage data block in the remote storage file as a remote offset;
determining a first quotient result of the remote offset with respect to the default file size according to a starting identifier mapping rule;
and determining the first quotient result as the initial file identification.
5. The method of claim 3, wherein the remote storage information includes an offset of the stored data block within a remote storage file and an amount of stored data occupied by the stored data block within the remote storage file; the remote storage file refers to a file storing the storage data block in the remote storage database;
and performing end identifier mapping processing on the remote storage information and the default file size according to an end identifier mapping rule to obtain an end file identifier, wherein the end identifier mapping processing comprises the following steps:
determining the offset of the storage data block in the remote storage file as a remote offset, and determining the storage data volume occupied by the storage data block in the remote storage file as a remote data volume;
summing the remote offset and the remote data according to the end identifier mapping rule;
Determining a second quotient result of the summation result obtained by the summation process with respect to the default file size;
and determining the second quotient result as the end file identification.
6. The method of claim 1, wherein the local file name comprises a first local file name and a second local file name;
the creating the local file uniquely indicated by the local file name in the local disk of the target computing device includes:
creating a first local file uniquely indicated by the first local file name in the local disk;
creating a second local file uniquely indicated by the second local file name in the local disk;
and determining the first local file and the second local file as the local files uniquely indicated by the local file names.
7. The method of claim 6, wherein the data caching order indicated by the first local file name precedes the data caching order indicated by the second local file name;
the caching the stored data block to the local file uniquely indicated by the local file name comprises the following steps:
acquiring a default file size configured for the local disk; the default file size is used for specifying the size of any one file in the local disk;
Performing offset calculation processing on the default file size and the remote storage information according to an initial offset determination rule to obtain an initial file offset of the storage data block in the first local file;
performing offset calculation processing on the default file size and the remote file storage information according to an end offset determination rule to obtain an end file offset of the storage data block in the second local file;
and determining the offset from the initial file in the first local file to the offset from the end file in the second local file as the buffer interval position of the storage data block, and writing the storage data block into the buffer interval position.
8. The method of claim 7, wherein the remote file storage information includes an offset of the stored data block within a remote storage file; the remote storage file refers to a file storing the storage data block in the remote storage database;
and performing offset calculation processing on the default file size and the remote storage information according to an initial offset determination rule to obtain an initial file offset of the storage data block in the first local file, wherein the initial file offset comprises:
Determining the offset of the storage data block in the remote storage file as a remote offset;
determining a first remainder result of the remote offset with respect to the default file size according to a starting offset determination rule;
and determining the first residual result as a starting file offset of the storage data block in the first local file.
9. The method of claim 7, wherein the remote file storage information includes an offset of the stored data block within a remote storage file and an amount of stored data occupied by the stored data block within the remote storage file; the remote storage file refers to a file storing the storage data block in the remote storage database;
and performing offset calculation processing on the default file size and the remote file storage information according to an end offset determining rule to obtain an end file offset of the storage data block in the second local file, where the offset calculation processing includes:
determining the offset of the storage data block in a remote storage file as a remote offset, and determining the storage data volume occupied by the storage data block in the remote storage file as a remote data volume;
Summing the remote offset and the remote data according to an end offset determination rule;
determining a second remainder of the summed result of the summing process with respect to the default file size;
and determining the second residual result as an end file offset of the stored data block in the second local file.
10. The method of claim 1, wherein after said caching the stored data block to the local file uniquely indicated by the local file name, the method further comprises:
determining the local file uniquely indicated by the local file name as a target local file;
any free node is obtained from the free node set and used as a target file cache node of the target specimen local file; any free node in the free node set is a preconfigured node for recording file cache information of any file in the local disk;
adding the remote attribute information into the target file cache node to obtain a filling file cache node;
acquiring a target data table of the target service data in the remote storage database, and constructing an index relation between the target data table and the filling file cache node;
Adding an index relation between the target data table and the filling file cache node to a shared cache index table of the target computing device; and the remote attribute information in the shared cache index table is used for determining the local file name of the corresponding cache of the storage data block in the target computing device based on the remote attribute information and the remote storage information in the shared cache index table when the target computing device inquires the target service data, and reading the target service data from the local file uniquely indicated by the local file name.
11. The method of claim 10, wherein after adding the index relationship between the target data table and the fill file cache node to the shared cache index table of the target computing device, the method further comprises:
receiving a data query request sent by a query object; the data query request is used for requesting to query the target service data from the target data table;
determining the storage data block where the target business data are located in the remote storage database and remote storage information of the storage data block in the remote storage database based on the data query request;
Traversing the shared cache index table;
if the target data table exists in the shared cache index table, acquiring the filling file cache node with an index relation with the target data table from the shared cache index table, and determining a local file name of the corresponding cache of the storage data block in the target computing equipment according to the remote attribute information and the remote storage information recorded in the filling file cache node;
and reading the target service data from the local file uniquely indicated by the local file name in the local disk, and returning the target service data to the query object.
12. A data processing apparatus applied to a separate-from-memory architecture comprising a cluster of computing devices and a remote storage database that are independent of each other, the apparatus being applied to a target computing device that is any computing device in the cluster of computing devices; the device comprises:
the block acquisition module is used for acquiring a storage data block to be cached; the storage data block refers to any one of the data blocks stored in the remote storage database, and the storage data block contains target service data, wherein the target service data refers to service data with query heat higher than a preset threshold value;
The information acquisition module is used for acquiring remote storage information and remote attribute information of the storage data block to be cached in the remote storage database; the remote storage information is used for indicating the storage position of the storage data block in the remote storage database, and the remote attribute information is used for indicating the storage attribute of the storage data block in the remote storage database;
the file name determining module is used for determining a local file name of the storage data block corresponding to the cache in the target computing device according to the remote storage information and the remote attribute information;
and the block caching module is used for creating the local file uniquely indicated by the local file name in the local disk of the target computing device and caching the stored data block to the local file uniquely indicated by the local file name.
13. A computer device, comprising: a processor, a memory, and a network interface;
the processor is connected to the memory and the network interface, wherein the network interface is configured to provide a network communication function, the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1-11.
14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-11.
15. A computer program product, characterized in that the computer program product comprises a computer program stored in a computer readable storage medium, the computer program being adapted to be read and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1-11.
CN202410074857.1A 2024-01-18 2024-01-18 Data processing method, device, equipment and readable storage medium Pending CN117591040A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410074857.1A CN117591040A (en) 2024-01-18 2024-01-18 Data processing method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410074857.1A CN117591040A (en) 2024-01-18 2024-01-18 Data processing method, device, equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN117591040A true CN117591040A (en) 2024-02-23

Family

ID=89918751

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410074857.1A Pending CN117591040A (en) 2024-01-18 2024-01-18 Data processing method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN117591040A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1831781A2 (en) * 2004-07-26 2007-09-12 M-Systems Flash Disk Pioneers Ltd. Method of managing local and remote data storage as a single logical volume
US20200183602A1 (en) * 2018-12-10 2020-06-11 Veritas Technologies Llc Systems and methods for storing information within hybrid storage with local and cloud-based storage devices
CN116467275A (en) * 2023-04-23 2023-07-21 北京沃东天骏信息技术有限公司 Shared remote storage method, apparatus, system, electronic device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1831781A2 (en) * 2004-07-26 2007-09-12 M-Systems Flash Disk Pioneers Ltd. Method of managing local and remote data storage as a single logical volume
US20200183602A1 (en) * 2018-12-10 2020-06-11 Veritas Technologies Llc Systems and methods for storing information within hybrid storage with local and cloud-based storage devices
CN116467275A (en) * 2023-04-23 2023-07-21 北京沃东天骏信息技术有限公司 Shared remote storage method, apparatus, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN106886375B (en) The method and apparatus of storing data
CN104679898A (en) Big data access method
CN102117338B (en) Data base caching method
CN104778270A (en) Storage method for multiple files
CN104731516A (en) Method and device for accessing files and distributed storage system
CN108614837B (en) File storage and retrieval method and device
CN106970958B (en) A kind of inquiry of stream file and storage method and device
CN102780603B (en) Web traffic control method and device
CN113010476B (en) Metadata searching method, device, equipment and computer readable storage medium
CN109766318B (en) File reading method and device
CN113468199A (en) Index updating method and system
CN113590027B (en) Data storage method, data acquisition method, system, device and medium
CN117149777B (en) Data query method, device, equipment and storage medium
CN109299352B (en) Method and device for updating website data in search engine and search engine
CN102724301B (en) Cloud database system and method and equipment for reading and writing cloud data
CN104021137A (en) Method and system for opening and closing file locally through client side based on catalogue authorization
CN109213950B (en) Data processing method and device for browser application of IPTV (Internet protocol television) intelligent set top box
US9164922B2 (en) Technique for passive cache compaction using a least recently used cache algorithm
CN104391947A (en) Real-time processing method and system of mass GIS (geographic information system) data
JP2023531751A (en) Vehicle data storage method and system
CN117591040A (en) Data processing method, device, equipment and readable storage medium
CN114756509B (en) File system operation method, system, device and storage medium
CN107291875B (en) Metadata organization management method and system based on metadata graph
CN115878625A (en) Data processing method and device and electronic equipment
CN111552740B (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination