CN115221186A - Data query method, system and device and electronic equipment - Google Patents

Data query method, system and device and electronic equipment Download PDF

Info

Publication number
CN115221186A
CN115221186A CN202210646892.7A CN202210646892A CN115221186A CN 115221186 A CN115221186 A CN 115221186A CN 202210646892 A CN202210646892 A CN 202210646892A CN 115221186 A CN115221186 A CN 115221186A
Authority
CN
China
Prior art keywords
data
query request
cache database
database
data query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210646892.7A
Other languages
Chinese (zh)
Inventor
肖文浩
於圣楠
张宇昂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202210646892.7A priority Critical patent/CN115221186A/en
Publication of CN115221186A publication Critical patent/CN115221186A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24539Query rewriting; Transformation using cached or materialised query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data query method, a system, a device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: responding to a first data query request sent by a computing engine received by a cache database, and querying result data corresponding to the first data query request from the cache database; responding to result data inquired from the cache database, and sending the result data to the calculation engine by the cache database; and responding to the result data which is not inquired from the cache database, inquiring the result data from the storage database by the cache database, and sending the result data acquired from the storage database to the calculation engine. According to the method, the cache database is deployed between the calculation engine and the storage database, so that the access pressure of the storage database is reduced, the data transmission distance is shortened, and the problems of low data query speed, low data transmission speed and low data processing efficiency in the prior art are solved.

Description

Data query method, system and device and electronic equipment
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data query method, system, apparatus, electronic device, and computer-readable storage medium.
Background
With the development of digital economy, the scale of enterprise data is rapidly enlarged, and the problem that the data processing efficiency is influenced due to mismatching of calculated amount and storage amount becomes more and more prominent. At present, the problem of inflexible configuration and expansion of storage resources and computing resources is solved by adopting a storage and computation separation architecture, but under the storage and computation separation architecture, as a computing system needs to perform remote data query in a storage system through a network and the storage system needs to perform remote data transmission to the computing system through the network, the problems of low data query speed, low data transmission speed and low data processing efficiency still exist.
Disclosure of Invention
The application provides a data query method, a data query system, a data query device, electronic equipment and a computer readable storage medium, which are used for solving the problems of low data query speed, low data transmission speed and low data processing efficiency of the existing data processing method.
The embodiment of the application provides a data query method, which is applied to data query under a storage and separation architecture; a cache database is arranged between a computing engine and a storage database in the storage and computation separation architecture, and part of data stored in the storage database is stored in the cache database to provide short-circuit data transmission for the computing engine; the method comprises the following steps:
responding to a first data query request sent by the computing engine and received by the cache database, and querying result data corresponding to the first data query request from the cache database;
responding to result data corresponding to the first data query request queried from the cache database, and sending the result data corresponding to the first data query request to the computing engine by the cache database;
in response to the result data corresponding to the first data query request is not queried from the cache database, the cache database queries the result data corresponding to the first data query request from a storage database, and sends the result data corresponding to the first data query request acquired from the storage database to the computing engine
Optionally, after sending the result data corresponding to the first data query request obtained from the storage database to the computing engine, the method further includes: and storing result data corresponding to the first data query request acquired from the storage database into the cache database.
Optionally, after storing the result data corresponding to the first data query request acquired from the storage database into the cache database, the method further includes: setting a first cache duration threshold for result data corresponding to the first data query request stored in the cache database, specifically:
and if the storage duration of the result data corresponding to the first data query request in a cache database is greater than the first cache duration threshold, deleting the result data corresponding to the first data query request from the cache database. .
Optionally, the method further includes: responding to a second data query request, and sending result data corresponding to the second data query request stored in the cache database to a computing engine;
the result data is obtained by responding to the first data query request and storing the result data corresponding to the first data query request acquired from the storage database into the cache database; the query time of the first data query request is earlier than the query time of the second data query request, and the query content of the first data query request is the same as the query content of the second data query request.
Optionally, the method further includes: and co-clustering and deploying the cache database and the computing engine.
Optionally, the cache database further includes a data sending unit, where the data sending unit is configured to send result data corresponding to the first data query request to the computing engine;
the computing engine further comprises a data receiving unit, and the data receiving unit is used for receiving result data corresponding to the first data query request sent by the cache database;
the method further comprises the following steps: and the data sending unit and the data receiving unit are deployed in a common node mode.
Optionally, the method further includes: storing thermal data into the cache database according to a preset time frequency, wherein the thermal data specifically comprise: and data with the query times exceeding a preset query time threshold value in a preset time period.
Optionally, the method further includes: setting a second cache duration threshold for the thermal data stored in the cache database, specifically: and if the storage time length of the hot data in the cache database is greater than the second cache time length threshold value, deleting the hot data from the cache database.
Optionally, after the hot data is queried, the storage duration of the hot data in the cache database is calculated again.
Optionally, the querying, in response to the cache database receiving the first data query request sent by the computing engine, result data corresponding to the first data query request from the cache database is specifically:
generating an execution plan corresponding to the first data query request in the computing engine, wherein the execution plan comprises: data demand information corresponding to the first data query request and distribution information of result data corresponding to the first data query request in the cache database;
the calculation engine sends the data demand information and the distribution information to the cache database;
and the cache database queries the result data corresponding to the first data query request in the cache database according to the data demand information and the distribution information.
Optionally, the execution plan further includes: the computing engine is used for computing information of result data corresponding to the first data query request; the method further comprises the following steps:
and in response to the calculation engine receiving the result data corresponding to the first data query request sent by the cache database, the calculation engine calculates the result data corresponding to the first data query request according to the calculation information.
An embodiment of the present application further provides a data query system, where the system includes: the system comprises a calculation engine, a cache database and a storage database; the cache database is deployed between the computing engine and the storage database, and part of data stored in the storage database is stored in the cache database and provides short-circuit data transmission for the computing engine;
the calculation engine, comprising: an information transmitting unit; a data receiving unit;
the information sending unit is used for sending a data query request to the cache database;
the data receiving unit is used for receiving result data corresponding to the data query request sent by the cache database;
the cache database comprises: the device comprises a first data query unit, a second data query unit, a data sending unit and a data caching unit;
the first data query unit is used for querying result data corresponding to the data query request from the cache database;
the second data query unit is used for querying result data corresponding to the data query request from the storage database;
the data sending unit is used for sending result data corresponding to the data query request to the computing engine;
the data caching unit is used for caching result data corresponding to the data query request;
the storage database comprises: a data storage unit;
and the data storage unit is used for persistently storing the result data corresponding to the query request.
Optionally, the cache database and the computing engine are deployed in a cluster.
Optionally, the data sending unit in the cache database and the data receiving unit in the computing engine are deployed in a node-sharing manner.
The embodiment of the application also provides a data query device, which is applied to data query under the storage and computation separation architecture; a cache database is arranged between a computing engine and a storage database in the storage and computation separation architecture, and part of data stored in the storage database is stored in the cache database to provide short-circuit data transmission for the computing engine; the device comprises: the system comprises a first data query module, a second data query module and a data sending module;
the first data query module is used for responding to a first data query request sent by a computing engine and querying result data corresponding to the first data query request from the cache database;
the second data query module is used for responding to the result data corresponding to the first data query request which is not queried from the cache database, and querying the result data corresponding to the first data query request from a storage database;
the data sending module is used for responding to result data corresponding to the first data query request queried from the cache database and sending the result data corresponding to the first data query request to a computing engine; and the data processing device is also used for sending result data corresponding to the first data query request acquired from the storage database to the computing engine.
An embodiment of the present application further provides an electronic device, including: a memory, a processor;
the memory to store one or more computer instructions;
the processor is configured to execute the one or more computer instructions to implement the above-described method.
Embodiments of the present application also provide a computer-readable storage medium having one or more computer instructions stored thereon, which when executed by a processor, perform the above-mentioned method.
Compared with the prior art, the data query method provided by the application is applied to data query under a deposit and separation architecture; deploying a cache database between a computing engine and a storage database in the storage and computation separation architecture, wherein the cache database stores part of data stored in the storage database and provides short-circuit data transmission for the computing engine, and the method comprises the following steps: responding to a first data query request sent by the computing engine and received by the cache database, and querying result data corresponding to the first data query request from the cache database; responding to result data corresponding to the first data query request queried from the cache database, and sending the result data corresponding to the first data query request to the computing engine by the cache database; and in response to the result data corresponding to the first data query request is not queried in the cache database, the cache database queries the result data corresponding to the first data query request from a storage database, and sends the result data corresponding to the first data query request acquired from the storage database to the computing engine. The method comprises the steps that a cache database is arranged between a calculation engine and a storage database, the cache database is used as a first sequential query database, the storage database is used as a second sequential query database, when a data query request is received, result data corresponding to the data query request can be queried in the cache database, if the result data corresponding to the data query request are queried, the result data in the cache database are directly sent to the calculation engine, if the result data corresponding to the data query request are not queried, the cache database is further used for querying in the storage database, and the result data obtained from the storage database are sent to the calculation engine. According to the data query method, the cache database is arranged between the computing engine and the storage database, when result data corresponding to the data query request are queried in the cache database, remote data query can be omitted in the storage database, the access pressure of the storage database is reduced, the data transmission distance is shortened, and the problems of low data query speed, low data transmission speed and low data processing efficiency in the prior art are solved.
Drawings
FIG. 1 is a diagram of an application system of a data query method provided by an embodiment of the present application;
FIG. 2 is a diagram of an application system of another data query method provided by an embodiment of the present application;
FIG. 3 is a flowchart of a data query method according to a first embodiment of the present application;
fig. 4 is a signaling flow chart of a data query method according to a second embodiment of the present application;
FIG. 5 is a schematic structural diagram of a data query system according to a third embodiment of the present application;
fig. 6 is a schematic structural diagram of a data query device according to a fourth embodiment of the present application;
fig. 7 is a schematic structural diagram of an electronic device according to a fifth embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terms referred to in the embodiments of the present application are explained below for convenience of understanding.
An access and computation separation architecture refers to a data processing system that separates data Storage from computation, and under the access and computation separation architecture, data that needs to be persistently stored is stored in a remote Network Attached Storage (NAS), an object Storage system, or a distributed Storage system. Before big data, the data is processed by using the database, namely, calculation and storage of the database are integrated together and are collectively called as a database engine, and the calculation engine is separated from the storage engine along with the increase of the calculation data volume, so that higher flexibility is brought to reasonable configuration and optimized expansion of calculation resources and storage resources.
The computing engine is a computing system under a storage separation architecture, is a high-abstraction aggregate of computing rules, is a program specially used for processing data, a user writes corresponding interface codes according to a specified mode, and the computing engine executes the corresponding interface codes to obtain a required result. Currently common computing engines include: mapReduce for batch processing (history file), spark for stream processing (real-time data), and Flink for stream batch processing.
The Spark is a fast and general computing engine specially designed for large-scale data processing, and may be understood as computing methods of Spark RDD (flexible Distributed data set), and these methods can act on each Partition (Partition) of the RDD. For RDD, each slice is processed by one compute task, and each slice can be processed in parallel. The RDD in Spark is an elastic distributed data set. Since each conversion of RDD will generate a new RDD, a dependency relationship between RDDs will be formed. Therefore, when a data error or data loss occurs in a certain fragment, spark can calculate the data of the fragment according to the dependency relationship before and after, and does not need to calculate and repair all RDDs. The RDD of Spark belongs to a lazy computation, which must rely on a specific trigger to start the computation, otherwise no result will be produced. The Spark operator can be broadly divided into a Transformation operator and an Action operator, wherein the Transformation operator, such as map, flatMap, and glom, does not trigger submission of a job, but only completes processing in the middle of the job process. Action operators, such as foreach, collect, etc., trigger execution and commit jobs.
The Storage database refers to a Storage system under a compute-compute separation architecture, and is used for persistently storing data at a remote end of a compute engine, and includes a Network Attached Storage (NAS), an object Storage system, or a distributed Storage system.
The HDFS (Hadoop Distributed File System) is a Distributed File System which is easy to expand, stores ultra-large files in a streaming data access mode, and runs on a commercial hardware cluster.
The cache database is a temporary storage system for caching data, the data in the storage database can be temporarily stored in the cache database, and when the same data request is responded, the data in the cache database can be directly returned without inquiring the storage database, so that the data acquisition efficiency can be improved. The cache database can periodically clear data with overlong storage time and low use rate, and can also incorporate data with high use rate.
The cluster is a large server system composed of a plurality of mutually independent servers and by utilizing a high-speed communication network, wherein each server is a node. The servers can communicate with each other to provide services for users cooperatively. When a user sends a service request, the cluster gives the user the perception of being an independent server, but in reality it is a cluster of servers that serves the user.
Remote reading refers to a process in which a computing engine obtains data from a database that is remote (for example, the computing engine reads data from databases of other clusters), and the computing engine needs to transmit the data through a remote network to obtain the data.
Short-circuit reading refers to a process in which a computing engine obtains data from a relatively close database (for example, the computing engine reads data from a database of a cluster to which the computing engine belongs), and the computing engine can obtain the data through short-circuit network transmission, even without network transmission (for example, a data receiving unit in the computing engine and a data sending unit in the database share a cluster and are deployed at a node), which is equivalent to local reading.
The hot data is online data which needs to be frequently accessed by the computing node, has high requirement on storage performance, has certain timeliness and is generally gradually changed into warm data and cold data along with the lapse of time.
At present, with the rapid development of digital economy, big data processing has become a necessary technical means for supporting digital economy. A computationally-integrated architecture (e.g., a database engine) that couples data storage with data computation has not been able to meet the demands of big data processing. The storage and separation architecture gradually becomes a mainstream of big data processing, and the storage and separation architecture is to separately deploy a storage system and a computing system (for example, to deploy the storage system and the computing system on different servers or server clusters, respectively), so that higher flexibility can be brought to reasonable configuration and expansion of storage resources and computing resources.
Under the storage integrated architecture, if the storage system cannot meet the storage requirement of data, the storage system needs to be expanded, and because the storage system and the computing system are coupled with each other, the computing system is necessarily expanded while the storage system is expanded, so that the computing resource is wasted, or in order to avoid the waste of the computing resource and the expansion of the storage system, the problem that the storage space is insufficient and the data cannot be stored occurs.
Under the storage and computation separation architecture, if the storage system cannot meet the storage requirement of data, the storage system needs to be expanded, and because the storage system and the computation system are separated from each other, the storage system can be deployed as an independent system, so that the computation system cannot be expanded when the storage system is expanded. That is, in the separated architecture, which system needs to be expanded can be expanded only for the system without any influence on another system.
Although the storage and computation separation architecture solves the problem that the computing resources and the storage resources of the original storage and computation integrated architecture cannot be independently configured and expanded as required, due to the separation of the storage system and the computing system, the computing system needs to remotely acquire data to be computed from a remote storage system, the data to be computed is remotely acquired and needs to be transmitted through a network, and if the data volume is extremely large, the time consumed by remote data transmission is long, so that the efficiency of overall data processing is influenced.
The flow of the data processing method based on the memory separation architecture comprises the following steps:
firstly, a user side sends a data processing request;
secondly, the computing system responds to a data processing request sent by a user side, analyzes the data processing request and acquires an execution plan (the execution plan comprises the content of data to be processed which needs to be processed by the computing system, the node from which the data to be processed is acquired from the storage and storage system, how to calculate the data to be processed, and the like);
thirdly, the computing system performs remote data query in the storage system according to the execution plan;
fourthly, the storage system transmits the data to be processed to the computing system through remote network transmission;
and fifthly, the computing system computes the data to be processed according to the execution plan.
The problems of the existing data processing method based on the storage and separation architecture mainly focus on the third step and the fourth step, namely, the processes of remote data query and remote data transmission. Both remote data query and remote data transmission bring network transmission consumption, and for the situations of large data volume of data to be processed, large access quantity of a storage system, large data to be transmitted of the storage system and the like, the efficiency of data processing is seriously influenced by remote data query and remote data transmission. Therefore, how to improve the defects of remote data query and remote data transmission under the storage and computation architecture is the main content of optimizing the big data processing method based on the storage and computation separation architecture, and is also an important link for promoting the economic development of big data. Based on the above, the application provides a data query method, which optimizes and improves a storage separation framework, so that to-be-processed data can be subjected to short circuit query and transmission through a network, even local reading is not performed through the network, the pressure of network transmission is greatly reduced, and the problems of low data query speed, low data transmission speed and low data processing efficiency caused by remote network transmission are solved.
The data query method, system, apparatus, electronic device, and computer-readable storage medium described in this application are further described in detail below with reference to specific embodiments and accompanying drawings.
Fig. 1 is a diagram of an application system of a data query method provided in an embodiment of the present application. As shown in fig. 1, the system includes a user side 101 and a server side 102. The user terminal 101 and the server terminal 102 are in communication connection through a network. The user terminal 101 may be an enterprise terminal or a third-party big data analysis server terminal, and the user terminal 101 sends a data processing request to the server terminal 102 through a computer terminal, such as a laptop computer, a desktop computer, or other devices. The server 102 includes a storage system and a computing system under a separate architecture, and the server 102 includes a plurality of servers or server clusters, where the servers or server clusters are used to deploy the storage system and the computing system. The server 102 is deployed with the data query method provided by the present application. In response to the data processing request sent by the user side 101 to the server side 102, the server side 102 obtains a data query plan, a data calculation plan and the like corresponding to the data processing request according to the data processing request, queries and obtains data to be processed according to the data query method provided by the application, and further calculates the data to be processed. The server 102 sends the calculation result of the data to be processed to the user 101.
Fig. 2 is a diagram of an application system of another data query method provided in an embodiment of the present application. As shown in fig. 2, the system includes a plurality of clients 201 and a server 202. The plurality of user terminals 201 and the service terminal 202 are communicatively connected through a network. The plurality of user terminals 201 may include a plurality of enterprise terminals and may also include a third-party big data analysis server terminal, and the plurality of user terminals 201 may send a plurality of data processing requests to the server terminal 202 through a computer terminal. The server 202 includes a storage system and a computing system in a separate architecture, and the server 202 includes a plurality of servers or server clusters, where the servers or server clusters are used to deploy the storage system and the computing system, and the servers or server clusters may be cloud servers or cloud server clusters, and deploy the storage system and the computing system. The server 202 is deployed with the data query method provided by the present application. In response to the plurality of data processing requests sent by the plurality of user terminals 201 to the server 202, the server 202 performs data query, data acquisition, and data calculation respectively according to the plurality of data processing requests, and correspondingly sends a plurality of calculation results to the plurality of user terminals 201.
The first embodiment of the application provides a data query method, the data query method provided by the embodiment is a data query method based on a storage separation architecture, and relates to a computing engine, a storage database and a cache database, wherein the computing engine and the storage database correspond to a computing system and a storage system under the storage separation architecture, the computing engine and the storage database are remotely and separately deployed and are deployed in different server clusters. And a cache database is arranged between the calculation engine and the storage database in the calculation and separation architecture, and plays a role of an intermediate bridge between the calculation engine and the storage database. The cache database may be deployed in a cluster together with the computing engine, or may not be deployed in a cluster together with the computing engine, but is to be deployed in a server or a server cluster near the server cluster where the computing engine is located, so as to provide short-circuit data transmission for the computing engine.
The cache database stores a part of data stored in the storage database, and the part of data may include: responding to the current data query request, and querying result data in the storage database, and may further include: frequently queried data.
Fig. 3 is a flowchart of a data query method provided in this embodiment.
As shown in fig. 3, the data query method provided in this embodiment includes the following steps:
step S301, in response to the cache database receiving a first data query request sent by the computing engine, querying result data corresponding to the first data query request from the cache database.
The first data query request may be a data query request sent by a user side, or may be a data query request acquired in response to a data processing request sent by the user side.
In response to the first data query request, the method provided in this embodiment is to take the cache database as a first sequential query database, and query the cache database for result data corresponding to the first data query request. The specific method comprises the following steps: responding to a first data query request, sending the first data query request to a cache database by a computing engine, and querying whether result data corresponding to the first query request is cached in the cache database by the cache database.
The embodiment provides a more specific implementation manner for querying result data corresponding to the first data query request from the cache database, and the implementation manner includes the following steps:
step S301-1, generating an execution plan corresponding to the first data query request in the computing engine, where the execution plan includes: and the data demand information corresponding to the first data query request and the distribution information of the result data corresponding to the first data query request in the cache database.
When a user side sends a data query request, the data query request is obtained by a computing engine and is analyzed into an execution plan corresponding to the data query request.
The execution plan may be content that needs to be executed by the computing engine and the cache database for the data query request. Such as: a calculation method executed by a calculation engine is needed, data contents acquired by a database are needed to be cached, and the like.
The execution plan described in this embodiment at least includes: data demand information corresponding to the first data query request, and distribution information of result data corresponding to the first data query request in the cache database.
The data requirement information may be related information of the result data to be queried corresponding to the first data query request, such as: content information, feature information, quantity information, etc.
The distribution information may be an approximate distribution of result data corresponding to the first data query request in the cache database. The data storage unit of the cache database may be divided into several partitions of different data types, each partition storing a corresponding type of data, such as: video data partitions, image data partitions, game data partitions, and the like. The partition to which the result data corresponding to the first data query request belongs is determined according to the first data query request, and the partition is distribution information of the result data corresponding to the first data query request in the cache database. In the subsequent result data query process, whether the result data are cached in the partition to which the result data belong is only required to be queried according to the distribution information, and the whole data storage unit of the cache database is not required to be queried in a whole disk manner, so that the time for querying the data in the cache database is shortened.
Specifically, in this step, a data parsing unit in the computing engine parses the first data query request, and generates an execution plan corresponding to the first data query request.
Step S301-2, the calculation engine sends the data demand information and the distribution information to the cache database.
And sending the plan content needing to be executed by the cache database in the execution plan obtained by analysis to the cache database by the computing engine, wherein the data requirement information corresponding to the first data query request and the distribution information of the result data corresponding to the first data query request in the cache database are sent to the cache database so as to indicate the cache database to perform the next query work.
Specifically, in this step, an information sending unit in the computing engine sends the data demand information and the distribution information to the cache database.
And S301-3, the cache database queries result data corresponding to the first data query request in the cache database according to the data demand information and the distribution information.
The cache database acquires the detailed information of the data to be queried through the received data demand information, and acquires the approximate position of the data to be queried in the cache database, namely the storage partition of the data to be queried, through the received distribution information.
Therefore, the cache database can query the result data corresponding to the first data query request in the cache database according to the data demand information and the distribution information.
Specifically, a first data query unit in the cache database queries, according to the data demand information and the distribution information, result data corresponding to the first data query request in a data cache unit in the cache database.
Step S302, in response to the result data corresponding to the first data query request being queried from the cache database, the cache database sends the result data corresponding to the first data query request to the computing engine.
If the result data corresponding to the first data query request is queried in the cache database, the cache database can directly send the result data to the calculation engine. Specifically, the data sending unit in the cache database sends the result data corresponding to the first data query request to the calculation engine.
Since the cache database is deployed in a server or a server cluster near the server cluster where the computing engine is located, even in the server cluster where the computing engine is located, the sending of the result data by the cache database to the computing engine belongs to short-circuit network transmission, even local transmission. And the result data corresponding to the first data query request is directly obtained from the cache database, so that the data transmission distance is greatly shortened, and the pressure of a calculation engine on remote access to the cache database is reduced.
Step S303, in response to that the result data corresponding to the first data query request is not queried from the cache database, the cache database queries the result data corresponding to the first data query request from a storage database, and sends the result data corresponding to the first data query request obtained from the storage database to the computing engine.
And if the result data corresponding to the first data query request is not queried in the cache database, taking the storage database as a second sequence query database, and querying the result data corresponding to the first data query request in the storage database. The specific method comprises the following steps: responding to a first data query request, sending the first data query request to a cache database by a computing engine, querying whether result data corresponding to the first query request is cached in the cache database by the cache database, if the result data is not queried in the cache database, continuing querying the result data corresponding to the first query request in a storage database by the cache database, and sending the queried result data to the computing engine. Specifically, the second data query unit in the cache database queries the result data corresponding to the first data query request from the storage database, and the data sending unit in the cache database sends the result data corresponding to the first data query request obtained from the storage database to the computing engine.
Querying the result data in the storage database is also performed by the cache database. That is to say, in the data query method provided in this embodiment, the cache database not only plays a role of data caching, but also plays a role of data query. The data query includes not only a query for result data in the cache database but also a query for results in the storage database. In addition, the cache database also plays a role of sending the result data to the calculation engine.
The execution plan further includes: and the calculation engine is used for calculating the calculation information of the result data corresponding to the first data query request.
Therefore, the method of this embodiment further includes at least: and in response to the calculation engine receiving the result data corresponding to the first data query request sent by the cache database, the calculation engine calculates the result data corresponding to the first data query request according to the calculation information. Specifically, after the data receiving unit in the computing engine receives the result data corresponding to the first data query request, the computing unit in the computing engine computes the result data corresponding to the first data query request according to the computing information.
The data query method provided by the present embodiment further includes: and storing result data corresponding to the first data query request acquired from the storage database into the cache database.
One optional implementation is: and asynchronously storing the result data corresponding to the first data query request acquired from the storage database into the cache database.
The asynchronous storage means that the cache database sends result data corresponding to the first data query request acquired from the storage database to the calculation engine, and then caches the result data in the cache database. And synchronous storage is opposite to asynchronous storage, wherein the synchronous storage refers to that the cache database caches result data corresponding to the first data query request acquired from the storage database into the cache database and then sends the result data to the calculation engine. Therefore, the asynchronous storage enables the computation engine to acquire the result data earlier and faster, so that the efficiency of returning the result data or the computation result data corresponding to the result data to the user side by the computation separation framework can be improved.
After storing the result data corresponding to the first data query request acquired from the storage database into the cache database, the data query method provided in this embodiment further includes: and setting a first cache duration threshold value for result data corresponding to the first data query request stored in the cache database.
The first cache duration threshold refers to a duration limit that result data corresponding to a first data query request set by a human can be stored in a cache database, and specifically, if the storage duration of the result data corresponding to the first data query request in the cache database is greater than the first cache duration threshold, the result data corresponding to the first data query request is deleted from the cache database. In this way, the infinite expansion of the data in the cache database can be prevented, so that the result data corresponding to the new data query request can not be cached in the cache database any more. The data query method provided by this embodiment further includes: responding to a second data query request, and sending result data corresponding to the second data query request stored in the cache database to a computing engine; the result data is obtained by responding to the first data query request and storing the result data corresponding to the first data query request acquired from the storage database into the cache database; the query time of the first data query request is earlier than that of the second data query request, and the query content of the first data query request is the same as that of the second data query request.
Specifically, in response to a first data query request and when result data corresponding to the first data query request is not queried in the cache database, the result data corresponding to the first data query request is queried in the storage database, and the result data corresponding to the first data query request acquired from the storage database is cached in the cache database, so that the result data corresponding to the first data query request is cached in the cache database.
In response to the second data query request, since the query content of the second data query request is the same as the query content of the first data query request, the result data corresponding to the second data query request is the same as the result data corresponding to the first data query request, and then the result data can be directly obtained from the cache database.
Such as: the first data query request and the second data query request are respectively training data query requests sent by the first user terminal and the second user terminal aiming at building the first neural network model and the second neural network model, the training data needed by the first neural network model and the second neural network model are the same, and the time for sending the first data query request by the first user terminal is earlier than the time for sending the second data query request by the second user terminal. If there is no previous data query request with the same query content as the first data query request, the corresponding training data cannot be queried in the cache database in response to the first data query request, and the training data needs to be acquired from the storage database. After the cache database obtains the training data from the storage database, the training data is cached into the cache database. Therefore, in response to the second data query request, the training data can be queried in the cache database and can be directly obtained from the cache database.
Thus, caching data cached in the database may be result data obtained from a storage database and cached in the cache database in response to a prior data query request. Of course, the data cached in the cache database may also be hot data loaded into the cache database.
Based on this, the data query method provided in this embodiment further includes: and storing the thermal data into the cache database according to a preset time frequency.
The thermal data may refer to online data frequently accessed by the computing node within a period of time, and the thermal data in this embodiment is specifically: and data with the query times exceeding a preset query time threshold value in a preset time period. Such as: setting data with query times exceeding 10 times in one day as hot data, such as: data for which the number of queries exceeds 30 times a week is set as hot data. The limitation of the hot data needs to be comprehensively considered according to various factors such as the storage capacity of the cache database, the frequency of data query requests and the like. Such as: the large storage capacity of the cache database can reduce the definition standard of the hot data, so that more data are defined as the hot data and stored in the cache database, for example: when the data query request is not frequent, the total number of data query requests in a week is 100, the data with the query number of 20 in a week can be limited to be the hot data, and when the data query request is frequent, the total number of data query requests in a week is 1000, the data with the query number of 20 in a week is not the hot data. There are, of course, additional reference factors that will not be described in detail herein.
The preset time frequency refers to the frequency of updating the hot data to the cache database, which is set manually, and represents the times of loading the hot data to the cache database within a period of time. The setting of the time frequency needs to be determined according to the actual application situation, the time frequency can be set more frequently for a big data scene with faster data update, and the time frequency can be set more widely for a big data scene with slower data update.
The hot data is loaded into the cache database regularly, which is equivalent to caching the result data with high possibility of being inquired into the cache database in advance, so that when the data inquiry request is responded, the possibility that the result data corresponding to the data inquiry request is inquired into the cache database is increased, and the short-circuit transmission of the result data is realized with the maximum possibility.
The thermal data has certain timeliness, and generally gradually changes into temperature data and cold data over time.
Therefore, this embodiment further provides an optional implementation manner, that is, setting a second cache duration threshold for the hot data stored in the cache database, specifically: and if the storage time of the hot data in the cache database is longer than the second cache duration threshold, deleting the hot data from the cache database.
The second cache duration threshold may be a duration limit for which artificially set hot data can be stored in the cache database, and when the storage time of the hot data in the cache database is longer than the cache time threshold, it may be considered that the hot data has become warm data or cold data, and the hot data may be removed from the cache database. In this way, the infinite expansion of the data in the cache database can be prevented, so that new hot data can not be cached in the cache database, and the hot data in the current time period is guaranteed to be always stored in the cache database.
Of course, different durations of changing hot data into warm data or cold data are different inevitably, the hot time of some hot data is longer, the hot time of some hot data is shorter, and in order to ensure that the hot data with longer hot time can be stored in the storage database for a longer time, the embodiment provides an optional implementation manner, that is, the implementation manner is that; and after the hot data is inquired, the storage time of the hot data in the cache database restarts to be calculated. The reason why the hot data is that the hot data is accessed by the computing nodes for a large number of times, so that the hot data with longer hot time can be distinguished from the hot data with shorter hot time by a method of restarting the computation of the storage duration of the hot data in the cache database after the hot data is queried, and the hot data with longer hot time can be stored in the cache database for a longer time. Based on the method, when the data query request is responded, the possibility that the result data corresponding to the data query request is directly queried in the cache database is increased, and the short-circuit transmission of the result data is realized to the maximum possibility.
In order to provide a better data query mode, the data query method provided by this embodiment further includes: and co-clustering and deploying the cache database and the computing engine.
The cluster may refer to a server cluster in which the computing engine or the cache database or the storage database is deployed, each server cluster includes a plurality of servers, the number of the servers is not fixed, and needs to be determined according to actual service conditions, if the amount of service data is small, only a few servers may be needed in one cluster, and if the amount of service data is excessively large, tens or even hundreds of servers may be included in one cluster.
The cache database and the calculation engine are deployed in a cluster, namely, the distance between the cache database and the calculation engine is pulled, so that the query speed and the data transmission speed are higher, and short-circuit transmission of data is realized.
Further, the cache database further comprises a data sending unit, and the data sending unit is used for sending the result data corresponding to the first data query request to the computing engine; the calculation engine further comprises a data receiving unit, and the data receiving unit is used for receiving result data corresponding to the first data query request sent by the cache database. On the premise that the cache database and the computing engine are deployed in a cluster, in order to realize closer-range transmission and even local transmission of data, the data sending unit in the cache database and the data receiving unit in the computing engine can be deployed in a node-sharing manner.
The node may refer to a single server in a cluster, and each server in the cluster is a node of the cluster. And all nodes in the cluster are connected through a network.
The cache database, the storage database and the calculation engine all comprise a plurality of units, and each unit can be independently deployed on one or more nodes in the cluster. If the data sending unit in the cache database and the data receiving unit in the computing engine are deployed in a node sharing manner, local transmission of data can be realized, and the sending of the data query request and the transmission of result data do not need to pass through a network, which is equivalent to local reading.
The first embodiment describes the data query method provided by the present application in detail in an optional implementation manner, and the data query method described in the present application includes, but is not limited to, the method described in the first embodiment.
The second embodiment of the application provides a data query method. In this embodiment, the data query method described in this application will be described in more detail by way of a more specific example.
In this embodiment, the calculation engine is a Spark operator, the cache database is an Alluxio storage system, and the storage database is an HDFS persistent storage system.
The Spark operator is a calculation engine for real-time data processing, is a fast general calculation engine specially designed for large-scale data processing, and can be understood as a calculation method of Spark RDD (flexible Distributed Dataset).
The Alluxio storage system is a virtual distributed storage system, is a framework with a memory as a center, unifies the data access speed by the memory speed, accelerates the data access speed, and brings remarkable performance improvement for big data processing.
The HDFS persistent storage system is a distributed file system easy to expand, stores oversized files in a streaming data access mode, and runs on a commercial hardware cluster.
The embodiment provides an optimal system deployment method, that is, a Spark operator and an Alluxio storage system are deployed in the same cluster, and a worker unit of the Spark operator and a worker unit of the Alluxio storage system are deployed in a node-sharing manner. The worker unit of the Spark operator has the main functions of managing a node memory, calculating the use condition of a unit, receiving a distributed resource instruction and the like. The worker unit of the Alluxio storage system has the main functions of managing data resources, file data transmission and the like distributed to the Alluxio storage system. Therefore, the co-node deployment of the worker unit of the Spark operator and the worker unit of the Alluxio storage system can improve the distance between the Spark operator and the Alluxio storage system, so that the transmission of data from the Alluxio storage system to the Spark operator is equivalent to local transmission and does not need to pass through a network.
Specifically, the Spark operator and the Alluxio storage system may be deployed in a first cluster, and the HDFS persistent storage system may be deployed in a second cluster. In the first cluster, co-node deployment is carried out on a worker unit of a Spark operator and a worker unit of an Alluxio storage system.
Fig. 4 is a signaling flowchart of the data query method provided in this embodiment.
As shown in fig. 4, the data query method provided in this embodiment includes the following steps:
step S401, responding to a data query request sent by a user side, analyzing the data query request by a Spark operator, and generating an execution plan corresponding to the data query request.
The data Query request sent by the user side may be a Structured Query Language (SQL) Query statement, and the SQL Query statement is a standard computer Language for accessing and processing a database.
Specifically, in the step, a driver unit of the Spark operator translates an SQL query statement sent by a user side, generates a language that the Spark operator can understand, generates an execution plan, and tells the Spark operator where to acquire result data, how to calculate after acquiring the result data.
The driver unit of the Spark operator has the main function of being responsible for analysis of applications, is a data analysis unit in the Spark operator, and can analyze SQL query statements into execution plans such as calculation tasks, query tasks and the like.
The execution plan includes at least: and the distribution information of the result data corresponding to the data query request in the Alluxio storage system refers to the approximate distribution condition of the result data corresponding to the data query request in the Alluxio storage system. Such as: the Alluxio storage system is divided into a plurality of partitions, and the distribution information corresponds to the partition position of the result data, but not the specific position of the result data.
The execution plan further includes at least: the data requirement information corresponding to the data query request refers to relevant information of result data corresponding to the data query request, such as: content information, feature information, quantity information, etc.
The execution plan further includes at least: and the data processing information corresponding to the data query request refers to a calculation method which needs to be carried out on result data after the Spark operator acquires the result data corresponding to the data query request.
In step S402, the Spark operator sends an execution plan to the Alluxio storage system.
Specifically, the Spark operator sends information (such as distribution information of result data corresponding to the data query request in the Alluxio storage system, data demand information corresponding to the data query request, and the like) related to the data query in the execution plan to the master unit of the Alluxio storage system.
The main function of the master unit of the Alluxio storage system is responsible for managing the global metadata in the Alluxio storage system, and is a data management unit of the Alluxio storage system.
Step S403, the Alluxio storage system queries, according to the execution plan, result data corresponding to the data query request in the Alluxio storage system, and determines whether the result data corresponding to the data query request is in the Alluxio storage system.
Specifically, the master unit of the Alluxio storage system judges whether the result data corresponding to the data query request is cached in the Alluxio storage system according to the distribution information of the result data corresponding to the data query request in the Alluxio storage system and the data demand information corresponding to the data query request.
And S404, if the result data corresponding to the data query request is queried in the Alluxio storage system, sending the result data to a Spark operator.
Specifically, the worker unit of the Alluxio storage system sends the result data to the worker unit of the Spark operator. Because the worker unit of the Spark operator and the worker unit of the Alluxio storage system are deployed in a node sharing mode, the transmission of the result data from the worker unit of the Alluxio storage system to the worker unit of the Spark operator is equivalent to local transmission, and network support is not needed.
Step S405, if the result data corresponding to the data query request is not queried in the Alluxio storage system, distributing a thread to the HDFS persistent storage system by the Alluxio storage system, and querying the result data corresponding to the data query request in the HDFS persistent storage system.
Specifically, a worker unit of the Alluxio storage system distributes threads to the HDFS persistent storage system.
The thread (thread) is the smallest unit that the operating system can perform operation scheduling. It is included in the process and is the actual unit of operation in the process. One thread is a single sequential control flow in a process, multiple threads can be concurrently executed in one process, and each thread executes different tasks in parallel.
The thread distributed in this embodiment is a task of acquiring result data corresponding to the data query request in the HDFS persistent storage system.
Step S406, the HDFS persistent storage system sends the result data corresponding to the data query request to the Alluxio storage system.
Specifically, the HDFS persistent storage system transmits result data corresponding to the data query request to a worker unit of the Alluxio storage system through remote network transmission.
Step S407, the Alluxio storage system sends the obtained result data corresponding to the data query request to a Spark operator.
Specifically, the worker unit of the Alluxio storage system sends the result data to the worker unit of the Spark operator.
Step S408, the Alluxio storage system caches the result data corresponding to the acquired data query request to the Alluxio storage system.
When responding to a subsequent data query request of the same query content, the result data can be directly acquired from the Alluxio storage system.
After the above steps S401 to S408 are completed, the data query method described in this embodiment is completed, and after the Spark operator receives the result data corresponding to the data query request sent by the Alluxio storage system, the following steps may also be executed:
step S409, calculating, by the Spark operator, result data corresponding to the received data query request according to the execution plan, and obtaining calculation result data of the result data corresponding to the data query request.
Specifically, the Spark operator calculates the result data corresponding to the received data query request according to the data processing information corresponding to the data query request.
Step S410, the Spark operator sends the calculation result data of the result data corresponding to the data query request to the user side.
In steps S405 to S406, if the result number corresponding to the data query request is not queried in the Alluxio storage system, the result data corresponding to the data query request needs to be obtained from the HDFS persistent storage system. And transmitting result data corresponding to the data query request from the HDFS persistent storage system to the Alluxio storage system, wherein the result data are actually transmitted remotely.
Therefore, in order to reduce the remote transmission of data as much as possible, the method of this embodiment further includes: and storing the thermal data in an Alluxio storage system according to a preset time frequency.
The hot data refers to online data frequently accessed by the computing node within a period of time.
The hot data is periodically loaded into the Alluxio storage system, which is equivalent to caching result data with high possibility of being inquired into the Alluxio storage system in advance, so that when responding to a data inquiry request, the possibility that the result data corresponding to the data inquiry request is directly inquired in a cache database is increased, and short-circuit transmission of the result data is realized with the maximum possibility.
The hot data has a certain timeliness, so when the hot data is loaded into the Alluxio storage system periodically, a caching time threshold of the hot data is configured. And if the storage time of the hot data in the Alluxio storage system is greater than the cache time threshold, deleting the hot data from the Alluxio storage system. In this way, the infinite expansion of data in the Alluxio storage system can be prevented, so that new hot data can not be cached in the Alluxio storage system, and the hot data in the current time period is guaranteed to be stored in the Alluxio storage system all the time.
Certainly, the heat times of different hot data are also different, and in order to ensure that the hot data with a longer heat time can be stored in the Alluxio storage system for a longer time, the embodiment provides an optional implementation manner, that is; and after the hot data is queried, the storage time of the hot data in the Alluxio storage system restarts to be calculated. In this way, the thermal data with longer heat time can be distinguished from the thermal data with shorter heat time, and the thermal data with longer heat time can be stored in the Alluxio storage system for a longer time.
Based on the fact that the hot data are loaded to the Alluxio storage system regularly, when the data query request is responded, the possibility that the result data corresponding to the data query request are queried in the Alluxio storage system directly is increased, and short-circuit transmission of the result data is achieved to the maximum possibility.
The second embodiment specifically describes the data query method provided by the present application in an optional implementation manner, and the data query method described in the present application includes, but is not limited to, the method described in the second embodiment.
A third embodiment of the present application provides a data query system, and fig. 5 is a schematic structural diagram of the data query system provided in this embodiment.
As shown in fig. 5, the data query system provided in this embodiment includes: a calculation engine 501, a cache database 502 and a storage database 503; the cache database 502 is disposed between the computing engine 501 and the storage database 503, and the cache database 502 stores part of the data stored in the storage database 503 to provide short-circuit data transmission for the computing engine 501.
Optionally, the cache database 502 is deployed in a cluster together with the computing engine 501.
Optionally, the data sending unit in the cache database 502 and the data receiving unit in the calculation engine 501 are deployed in a node.
The calculation engine 501 includes: an information transmitting unit; a data receiving unit;
the information sending unit is configured to send a data query request to the cache database 502;
the data receiving unit is configured to receive result data corresponding to the data query request sent by the cache database 502;
optionally, the calculation engine 501 further includes: the device comprises an information receiving unit, a data analysis unit, a calculation unit and a first data sending unit;
the information receiving unit is configured to receive the data query request sent by the user side.
The data analysis unit is configured to analyze the data query request and generate an execution plan corresponding to the data query request, where the execution plan at least includes: data demand information corresponding to the data query request, distribution information of result data corresponding to the data query request in the cache database 502, and calculation information of the result data corresponding to the data query request by the calculation engine 501;
and the computing unit is used for computing the result data corresponding to the data query request according to the computing information.
The first data sending unit is configured to send calculation result data of result data corresponding to the data query request to the user side.
The cache database 502 includes: the device comprises a first data query unit, a second data sending unit and a data caching unit;
the first data query unit is configured to query result data corresponding to the data query request from the cache database 502;
the second data query unit is configured to query result data corresponding to the data query request from the storage database 503;
the second data sending unit is configured to send result data corresponding to the data query request to the calculation engine 501;
the data caching unit is used for caching result data corresponding to the data query request;
optionally, the data cache unit is further configured to store the thermal data.
The storage database 503 includes: a data storage unit;
and the data storage unit is used for persistently storing the result data corresponding to the query request.
Optionally, the storage database 503 further includes: a third data transmission unit;
the third data sending unit is configured to send result data corresponding to the data query request to the cache database 502.
A fourth embodiment of the present application provides a data query device, which is applied to data query under a storage and separation architecture; and deploying a cache database between the computing engine and the storage database in the storage and computation separation architecture, wherein the cache database stores part of data stored in the storage database and provides short-circuit data transmission for the computing engine. Fig. 6 is a schematic structural diagram of the data query apparatus provided in this embodiment.
As shown in fig. 6, the data query apparatus provided in this embodiment includes: a first data query module 601, a second data query module 602, and a data transmission module 603;
the first data query module 601 is configured to, in response to receiving a first data query request sent by a computing engine, query, from a cache database, result data corresponding to the first data query request.
Optionally, the querying, in response to the receiving, by the cache database, of the first data query request sent by the computing engine, result data corresponding to the first data query request from the cache database specifically includes:
generating an execution plan corresponding to the first data query request in the computing engine, wherein the execution plan comprises: data demand information corresponding to the first data query request and distribution information of result data corresponding to the first data query request in the cache database;
the calculation engine sends the data demand information and the distribution information to the cache database;
the cache database inquires result data corresponding to the first data inquiry request in the cache database according to the data demand information and the distribution information
The second data query module 602 is configured to query, in response to not querying the result data corresponding to the first data query request from the cache database, the result data corresponding to the first data query request from a storage database.
Optionally, after sending the result data corresponding to the first data query request acquired from the storage database to the computing engine, the method further includes: and storing result data corresponding to the first data query request acquired from the storage database into the cache database.
Optionally, after storing the result data corresponding to the first data query request acquired from the storage database into the cache database, the method further includes: setting a first cache duration threshold for result data corresponding to the first data query request stored in the cache database, specifically:
and if the storage duration of the result data corresponding to the first data query request in a cache database is greater than the first cache duration threshold, deleting the result data corresponding to the first data query request from the cache database.
Optionally, the method further includes: responding to a second data query request, and sending result data corresponding to the second data query request stored in the cache database to a computing engine;
the result data is obtained by responding to the first data query request and storing the result data corresponding to the first data query request acquired from the storage database into the cache database; the query time of the first data query request is earlier than the query time of the second data query request, and the query content of the first data query request is the same as the query content of the second data query request.
The data sending module 603 is configured to, in response to the result data corresponding to the first data query request being queried from the cache database, send the result data corresponding to the first data query request to a computing engine; and the data processing device is also used for sending result data corresponding to the first data query request acquired from the storage database to the computing engine.
Optionally, the method further includes: and co-clustering and deploying the cache database and the computing engine.
Optionally, the cache database further includes a data sending unit, where the data sending unit is configured to send result data corresponding to the first data query request to the computing engine;
the computing engine further comprises a data receiving unit, and the data receiving unit is used for receiving result data corresponding to the first data query request sent by the cache database;
the method further comprises the following steps: and the data sending unit and the data receiving unit are deployed in a common node mode.
Optionally, the method further includes: storing thermal data into the cache database according to a preset time frequency, wherein the thermal data specifically comprise: and querying data with the number of times exceeding a preset query number threshold value in a preset time period.
Optionally, the method further includes: setting a second cache duration threshold for the thermal data stored in the cache database, specifically: and if the storage time length of the hot data in the cache database is greater than the second cache time length threshold value, deleting the hot data from the cache database.
Optionally, after the hot data is queried, the storage duration of the hot data in the cache database restarts to be calculated.
Optionally, the execution plan further includes: the computing engine queries computing information of result data corresponding to the first data query request; the method further comprises the following steps:
and in response to the calculation engine receiving the result data corresponding to the first data query request sent by the cache database, the calculation engine calculates the result data corresponding to the first data query request according to the calculation information.
A fifth embodiment of the present application provides an electronic apparatus. Fig. 7 is a schematic structural diagram of the electronic device provided in this embodiment.
As shown in fig. 7, the electronic device provided in this embodiment includes: a memory 701 and a processor 702.
The memory 701 is used for storing computer instructions for executing the data query method.
The processor 702 is configured to execute the computer instructions stored in the memory 701, and perform the following operations:
responding to a first data query request sent by the computing engine received by the cache database, and querying result data corresponding to the first data query request from the cache database;
responding to result data corresponding to the first data query request queried from the cache database, and sending the result data corresponding to the first data query request to the computing engine by the cache database;
in response to the result data corresponding to the first data query request is not queried from the cache database, the cache database queries the result data corresponding to the first data query request from a storage database, and sends the result data corresponding to the first data query request acquired from the storage database to the computing engine.
Optionally, after sending the result data corresponding to the first data query request acquired from the storage database to the computing engine, the method further includes: and storing result data corresponding to the first data query request acquired from the storage database into the cache database.
Optionally, after storing the result data corresponding to the first data query request acquired from the storage database into the cache database, the method further includes: setting a first cache duration threshold for result data corresponding to the first data query request stored in the cache database, specifically:
and if the storage duration of the result data corresponding to the first data query request in a cache database is greater than the first cache duration threshold, deleting the result data corresponding to the first data query request from the cache database.
Optionally, the method further includes: responding to a second data query request, and sending result data corresponding to the second data query request stored in the cache database to a computing engine;
the result data is obtained by responding to the first data query request and storing the result data corresponding to the first data query request acquired from the storage database into the cache database; the query time of the first data query request is earlier than the query time of the second data query request, and the query content of the first data query request is the same as the query content of the second data query request.
Optionally, the method further includes: and co-clustering and deploying the cache database and the computing engine.
Optionally, the cache database further includes a data sending unit, where the data sending unit is configured to send result data corresponding to the first data query request to the computing engine;
the computing engine further comprises a data receiving unit, and the data receiving unit is used for receiving result data corresponding to the first data query request sent by the cache database;
the method further comprises the following steps: and the data sending unit and the data receiving unit are deployed in a common node mode.
Optionally, the method further includes: storing thermal data into the cache database according to a preset time frequency, wherein the thermal data specifically comprise: and data with the query times exceeding a preset query time threshold value in a preset time period.
Optionally, the method further includes: setting a second cache duration threshold for the thermal data stored in the cache database, specifically: and if the storage time length of the hot data in the cache database is greater than the second cache time length threshold value, deleting the hot data from the cache database.
Optionally, after the hot data is queried, the storage duration of the hot data in the cache database is calculated again.
Optionally, the querying, in response to the cache database receiving the first data query request sent by the computing engine, result data corresponding to the first data query request from the cache database is specifically:
generating an execution plan corresponding to the first data query request in the computing engine, wherein the execution plan comprises: data demand information corresponding to the first data query request and distribution information of result data corresponding to the first data query request in the cache database;
the calculation engine sends the data demand information and the distribution information to the cache database;
and the cache database queries the result data corresponding to the first data query request in the cache database according to the data demand information and the distribution information.
Optionally, the execution plan further includes: the computing engine is used for computing information of result data corresponding to the first data query request; the method further comprises the following steps:
and in response to the calculation engine receiving the result data corresponding to the first data query request sent by the cache database, the calculation engine calculates the result data corresponding to the first data query request according to the calculation information.
A sixth embodiment of the present application provides a computer-readable storage medium, which includes computer instructions, when executed by a processor, for implementing the method of the embodiments of the present application.
It is noted that the terms "first," "second," and the like herein are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Furthermore, the terms "comprising," "having," "including," and "comprising," and other similar forms of words are intended to be inclusive and open-ended with respect to any item or items following any item of the preceding description, and no term or items from any preceding description is intended to be exhaustive or limited to any listed item or items.
As used herein, unless otherwise expressly specified, the term "or" includes all possible combinations unless not feasible. For example, if it is expressed that a database may include a or B, the database may include a, or B, or both a and B, unless specifically stated or not otherwise possible. As a second example, if it is expressed that a database may include A, B, or C, the database may include databases A, or B, or C, or A and B, or A and C, or B and C, or A and B and C, unless specifically stated or otherwise not feasible.
It is to be noted that the above-described embodiments may be implemented by hardware or software (program code), or a combination of hardware and software. If implemented in software, it may be stored in the computer-readable medium described above. The software, when executed by a processor, may perform the methods disclosed above. The computing unit and other functional units described in this disclosure may be implemented by hardware or software, or a combination of hardware and software. It will also be understood by those skilled in the art that the modules/units may be combined into one module/unit, and each module/unit may be further divided into a plurality of sub-modules/sub-units.
In the foregoing detailed description, embodiments have been described with reference to numerous specific details that may vary from implementation to implementation. Certain adaptations and modifications of the described embodiments may occur. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims. The sequence of steps shown in the figures is also for illustrative purposes only and is not meant to be limited to any particular sequence of steps. Thus, those skilled in the art will appreciate that the steps may be performed in a different order while performing the same method.
In the drawings and detailed description of the present application, exemplary embodiments are disclosed. However, many variations and modifications may be made to these embodiments. Accordingly, although specific terms are employed, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims (17)

1. A data query method is characterized in that the method is applied to data query under a storage and separation architecture; a cache database is arranged between a computing engine and a storage database in the storage and computation separation architecture, and part of data stored in the storage database is stored in the cache database to provide short-circuit data transmission for the computing engine; the method comprises the following steps:
responding to a first data query request sent by the computing engine and received by the cache database, and querying result data corresponding to the first data query request from the cache database;
responding to result data corresponding to the first data query request queried from the cache database, and sending the result data corresponding to the first data query request to the computing engine by the cache database;
in response to the result data corresponding to the first data query request is not queried from the cache database, the cache database queries the result data corresponding to the first data query request from a storage database, and sends the result data corresponding to the first data query request acquired from the storage database to the computing engine.
2. The method of claim 1, wherein after sending the result data corresponding to the first data query request obtained from the storage database to the compute engine, the method further comprises: and storing result data corresponding to the first data query request acquired from the storage database into the cache database.
3. The method according to claim 2, wherein after storing the result data corresponding to the first data query request obtained from the storage database into the cache database, the method further comprises: setting a first cache duration threshold for result data corresponding to the first data query request stored in the cache database, specifically:
and if the storage duration of the result data corresponding to the first data query request in a cache database is greater than the first cache duration threshold, deleting the result data corresponding to the first data query request from the cache database.
4. The method of claim 2, further comprising: responding to a second data query request, and sending result data corresponding to the second data query request stored in the cache database to a computing engine;
the result data is obtained by responding to the first data query request and storing the result data corresponding to the first data query request acquired from the storage database into the cache database; the query time of the first data query request is earlier than the query time of the second data query request, and the query content of the first data query request is the same as the query content of the second data query request.
5. The method of claim 1, further comprising: and co-clustering and deploying the cache database and the computing engine.
6. The method of claim 5, wherein the cache database further comprises a data sending unit, and the data sending unit is configured to send result data corresponding to the first data query request to the computing engine;
the computing engine further comprises a data receiving unit, and the data receiving unit is used for receiving result data corresponding to the first data query request sent by the cache database;
the method further comprises the following steps: and the data sending unit and the data receiving unit are deployed in a node-sharing mode.
7. The method of claim 1, further comprising: storing thermal data into the cache database according to a preset time frequency, wherein the thermal data specifically comprise: and data with the query times exceeding a preset query time threshold value in a preset time period.
8. The method of claim 7, further comprising: setting a second cache duration threshold for the thermal data stored in the cache database, specifically: and if the storage time length of the hot data in the cache database is greater than the second cache time length threshold value, deleting the hot data from the cache database.
9. The method of claim 8, wherein the time duration for storing the hot data in the cache database resumes after the hot data is queried.
10. The method according to claim 1, wherein the querying, in response to the cache database receiving a first data query request sent by the computing engine, result data corresponding to the first data query request from the cache database is specifically:
generating an execution plan corresponding to the first data query request in the computing engine, wherein the execution plan comprises: data demand information corresponding to the first data query request and distribution information of result data corresponding to the first data query request in the cache database;
the calculation engine sends the data demand information and the distribution information to the cache database;
and the cache database queries the result data corresponding to the first data query request in the cache database according to the data demand information and the distribution information.
11. The method of claim 10, wherein the execution plan further comprises: the computing engine queries computing information of result data corresponding to the first data query request; the method further comprises the following steps:
and in response to the calculation engine receiving the result data corresponding to the first data query request sent by the cache database, the calculation engine calculates the result data corresponding to the first data query request according to the calculation information.
12. A data query system, the system comprising: the system comprises a calculation engine, a cache database and a storage database; the cache database is deployed between the computing engine and the storage database, and part of data stored in the storage database is stored in the cache database and provides short-circuit data transmission for the computing engine;
the calculation engine, comprising: an information transmitting unit; a data receiving unit;
the information sending unit is used for sending a data query request to the cache database;
the data receiving unit is used for receiving result data corresponding to the data query request sent by the cache database;
the cache database comprises: the device comprises a first data query unit, a second data query unit, a data sending unit and a data caching unit;
the first data query unit is used for querying result data corresponding to the data query request from the cache database;
the second data query unit is used for querying result data corresponding to the data query request from the storage database;
the data sending unit is used for sending result data corresponding to the data query request to the computing engine;
the data caching unit is used for caching result data corresponding to the data query request;
the storage database comprises: a data storage unit;
and the data storage unit is used for persistently storing the result data corresponding to the query request.
13. The system of claim 12, wherein the cache database is deployed co-clustered with the compute engine.
14. The system of claim 13, wherein the data sending unit in the cache database is co-node deployed with the data receiving unit in the compute engine.
15. A data query device is characterized in that the device is applied to data query under a storage and separation architecture; deploying a cache database between a computing engine and a storage database in the storage and computation separation architecture, wherein the cache database stores part of data stored in the storage database and provides short-circuit data transmission for the computing engine; the device comprises: the system comprises a first data query module, a second data query module and a data sending module;
the first data query module is used for responding to a first data query request sent by a computing engine and querying result data corresponding to the first data query request from a cache database;
the second data query module is used for responding to the result data corresponding to the first data query request which is not queried from the cache database, and querying the result data corresponding to the first data query request from a storage database;
the data sending module is used for responding to result data corresponding to the first data query request queried from the cache database and sending the result data corresponding to the first data query request to a computing engine; and the data processing device is also used for sending result data corresponding to the first data query request acquired from the storage database to the computing engine.
16. An electronic device, comprising: a memory, a processor;
the memory to store one or more computer instructions;
the processor to execute the one or more computer instructions to implement the method of claims 1-11.
17. A computer-readable storage medium having stored thereon one or more computer instructions which, when executed by a processor, perform the method of claims 1-11.
CN202210646892.7A 2022-06-09 2022-06-09 Data query method, system and device and electronic equipment Pending CN115221186A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210646892.7A CN115221186A (en) 2022-06-09 2022-06-09 Data query method, system and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210646892.7A CN115221186A (en) 2022-06-09 2022-06-09 Data query method, system and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115221186A true CN115221186A (en) 2022-10-21

Family

ID=83608726

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210646892.7A Pending CN115221186A (en) 2022-06-09 2022-06-09 Data query method, system and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115221186A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905306A (en) * 2022-12-26 2023-04-04 北京滴普科技有限公司 Local caching method, equipment and medium for OLAP analysis database
CN116662449A (en) * 2023-06-14 2023-08-29 浙江大学 OLAP query optimization method and system based on broadcast sub-query cache
CN116974467A (en) * 2023-06-20 2023-10-31 杭州拓数派科技发展有限公司 Data caching processing method, device and system
CN116662449B (en) * 2023-06-14 2024-06-04 浙江大学 OLAP query optimization method and system based on broadcast sub-query cache

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115905306A (en) * 2022-12-26 2023-04-04 北京滴普科技有限公司 Local caching method, equipment and medium for OLAP analysis database
CN116662449A (en) * 2023-06-14 2023-08-29 浙江大学 OLAP query optimization method and system based on broadcast sub-query cache
CN116662449B (en) * 2023-06-14 2024-06-04 浙江大学 OLAP query optimization method and system based on broadcast sub-query cache
CN116974467A (en) * 2023-06-20 2023-10-31 杭州拓数派科技发展有限公司 Data caching processing method, device and system

Similar Documents

Publication Publication Date Title
US10691722B2 (en) Consistent query execution for big data analytics in a hybrid database
US20190230000A1 (en) Intelligent analytic cloud provisioning
CN105049268B (en) Distributed computing resource distribution system and task processing method
CN115221186A (en) Data query method, system and device and electronic equipment
CN111221469B (en) Method, device and system for synchronizing cache data
CN111258978B (en) Data storage method
EP3678030B1 (en) Distributed system for executing machine learning, and method therefor
CN103207919A (en) Method and device for quickly inquiring and calculating MangoDB cluster
CN109639773B (en) Dynamically constructed distributed data cluster control system and method thereof
CN113515545B (en) Data query method, device, system, electronic equipment and storage medium
Chen et al. Latency minimization for mobile edge computing networks
CN116108057A (en) Distributed database access method, device, equipment and storage medium
CN115587118A (en) Task data dimension table association processing method and device and electronic equipment
CN109254981A (en) A kind of data managing method and device of distributed cache system
CN109844723B (en) Method and system for master control establishment using service-based statistics
CN113761052A (en) Database synchronization method and device
CN110825732A (en) Data query method and device, computer equipment and readable storage medium
CN111382199A (en) Method and device for synchronously copying database
JP2001282551A (en) Job processor and job processing method
CN113326335A (en) Data storage system, method, device, electronic equipment and computer storage medium
JP2002259197A (en) Active contents cache control system, active contents cache controller, its controlling method, program for control and processing active contents cache and recording medium for its program
JP5594460B2 (en) Transmission information control apparatus, method and program
CN111092943B (en) Multi-cluster remote sensing method and system of tree structure and electronic equipment
US11977487B2 (en) Data control device, storage system, and data control method
CN108733697A (en) The method and apparatus for executing data query

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination