CN113568892A - Method and equipment for carrying out data query on data source based on memory calculation - Google Patents

Method and equipment for carrying out data query on data source based on memory calculation Download PDF

Info

Publication number
CN113568892A
CN113568892A CN202110924867.6A CN202110924867A CN113568892A CN 113568892 A CN113568892 A CN 113568892A CN 202110924867 A CN202110924867 A CN 202110924867A CN 113568892 A CN113568892 A CN 113568892A
Authority
CN
China
Prior art keywords
data
data source
query
memory
connector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110924867.6A
Other languages
Chinese (zh)
Inventor
刘睿民
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weixun Boray Data Technology Beijing Co ltd
Original Assignee
Weixun Boray Data Technology Beijing Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weixun Boray Data Technology Beijing Co ltd filed Critical Weixun Boray Data Technology Beijing Co ltd
Priority to CN202110924867.6A priority Critical patent/CN113568892A/en
Publication of CN113568892A publication Critical patent/CN113568892A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/214Database migration support
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention discloses a method and equipment for carrying out data query on a data source based on memory calculation, which are applied to a distributed system comprising a plurality of memory databases connected in parallel, wherein when at least one data source is accessed into the system and a data query request is received, an execution plan is generated according to the data query request; if the number of the connectors connected with the data sources is multiple, pushing down the execution plan to each data source to execute based on the multiple connectors; acquiring a plurality of preliminary data corresponding to the execution plan from each data source based on each connector; obtaining result data corresponding to the query request according to the plurality of preliminary data, and returning the result data to the user or the application; the connector is used for connecting the system and each data source, and is established according to the type of the data source, so that the efficiency and the safety of data query on different data sources are improved on the basis of avoiding high cost investment.

Description

Method and equipment for carrying out data query on data source based on memory calculation
Technical Field
The present application relates to the field of database technologies, and in particular, to a method and an apparatus for performing data query on a data source based on memory computation.
Background
With the rapid development of the internet and the internet of things, the wide application of various intelligent terminal devices and the gradual innovation of data processing technology, the value of data brought by the current social progress, the economic development, the digital upgrade of enterprise business and the like is more and more prominent, such as accurate marketing, real-time monitoring, real-time early warning, trend prediction and the like. However, while people enjoy the data value, they also face a problem: data appears in various formats and storage modes, more and more data sources, more and more huge data amount and more complicated data types are provided, for example, some data sources are text files, some data sources are Key-Value databases, some data sources are relational databases, some data sources are NoSQL and data warehouses, and some data sources are continuous data streams generated by some intelligent terminal equipment in real time. In the face of huge data volume and different data sources, connection clients provided by the data sources are often required to be used for single access, one connection cannot be used for simultaneously accessing a plurality of data sources, and data in different data sources cannot be associated in one connection. Therefore, uniform connection of data across multiple heterogeneous data sources has become a major key problem for multiple data source connections.
Due to different storage structures and data formats, as shown in fig. 1, the prior art generally constructs a local database system to access different data sources. Firstly, more hardware devices such as servers and storage devices, and software such as an operating system, a database management system and middleware need to be purchased for constructing a local data source, and the software and hardware platform construction of the database system is completed locally. After the local database management system is built, the source data in different data sources are uniformly stored or updated in the local data sources after the database format is uniformly converted by means of an ETL tool, data migration, synchronous data copying or asynchronous data backup and the like. When accessing, processing or analyzing data, the system loads the data in the local database system into the memory for processing and analysis, and feeds back the processing and analysis results to the user, thereby completing the data access, processing and analysis across different data sources. Before a better method is not available, the method and the system solve the problems of data sharing and access of multiple data sources, but the method and the system have the following three problems:
(1) the construction cost is higher
According to the method, a local database system needs to be constructed to meet the requirement of source data aggregation and storage in different data sources, so that software and hardware equipment needs to be purchased, manpower, financial resources and material resources are invested to carry out construction and later-stage operation and maintenance, and the investment cost can be continuously increased along with the increase of the data volume. For the users with sensitive cost, the investment cost of the method is very high, and the method does not meet the business development and capability requirements of enterprises.
(2) With increasing data volume, the delay becomes increasingly significant
Because the local database system adopts a disk medium and a processing mechanism that data can be processed only after being loaded into a memory when the data is used, along with the increase of the data volume, the delay brought by the method and the system can also linearly increase, for example, a user submits a request for a long time to obtain a response or result feedback, an enterprise or government department generates a report form for a long time, and the like, the delay often reaches the hour level and the fastest reaches the minute level. In the big data era, the processing speed obviously cannot meet the timeliness requirement of a user on data real-time processing, inquiring and displaying.
(3) Data security risks
Data may be lost or leaked in the process of migrating the data from the data source to the local database system, so that certain risks exist in data security. For core service data of a user, such as customer information, transaction data, core service data and the like, the method and the system are difficult to provide reliable guarantee for data safety.
(4) The local data can not be really kept updated synchronously with the source data
Whether data migration or data synchronous replication or asynchronous backup, there is a certain time interval between the update of data from different data sources to the local database system, which means that the data in the local database system cannot keep updated synchronously with the source data in the data sources. In some application scenarios with high real-time requirements, such as economic operation analysis and stock market analysis, the result of analysis is seriously affected by the data being unable to be updated synchronously, so that a decision maker cannot master real-time dynamic information, and makes a wrong judgment under misleading of error data, thereby affecting the decision making and bringing about a valuable business opportunity that huge loss and loss are lost and are lost instantly for enterprise development.
Therefore, how to improve the efficiency and security of data query on different data sources on the basis of avoiding high investment cost is a technical problem to be solved urgently at present.
Disclosure of Invention
The invention provides a method for carrying out data query on a data source based on memory calculation, which is used for solving the technical problems of high cost, low efficiency and low safety when carrying out data query on different data sources in the prior art, and is applied to a distributed system comprising a plurality of memory databases connected in parallel, and the method comprises the following steps:
when at least one data source is accessed to the system and a data query request sent by a user or an application is received, generating an execution plan according to the data query request;
if the number of the connectors connected with the data sources is multiple, pushing down the execution plan to each data source for execution based on the multiple connectors;
obtaining a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors;
obtaining result data corresponding to the query request according to the plurality of preliminary data, and returning the result data to the user or the application;
wherein the connector is a process for connecting the system with each of the data sources, and the connector is created according to the type of the data source.
In some embodiments of the present application, the method further comprises:
when an access request sent by a data source to be accessed is detected, determining a first number of the data source to be accessed and a second number of idle connectors in a data source connection pool corresponding to the type of the data source to be accessed;
if the first number is not larger than the second number, the data source to be accessed is accessed into the system based on the idle connector;
and if the first number is larger than the second number, creating a new idle connector according to a preset number or a creation number input by a user.
In some embodiments of the present application, generating an execution plan according to the data query request specifically includes:
analyzing the data query request and generating an initial query plan;
optimizing the initial query plan according to the metadata, the query cost and the index and determining an optimal index;
generating the execution plan according to the optimal index;
the metadata represents data content stored in each data source, the query cost represents resource consumption and execution duration of query, the index is determined according to the query request, and the optimal index is an index which has the minimum query cost and is matched with data stored in each data source.
In some embodiments of the present application, the system includes a plurality of distributed memory compute nodes, and a single memory compute node may run one or more of the connectors.
In some embodiments of the present application, the obtaining, from each of the data sources, a plurality of preliminary data corresponding to the execution plan based on each of the connectors includes:
obtaining a plurality of execution results corresponding to the execution plan from the data sources based on the connectors;
and converting each execution result into each initial data according to a preset format based on each connector.
In some embodiments of the present application, the obtaining, according to the plurality of pieces of preliminary data, result data corresponding to the query request specifically includes:
loading a plurality of pieces of preliminary data to a memory, and carrying out data merging after carrying out data cleaning on the plurality of pieces of preliminary data in the memory;
and acquiring the result data according to the result of data combination.
In some embodiments of the present application, data merging is performed after data cleaning is performed on a plurality of pieces of the preliminary data in a memory, specifically:
acquiring the quantity of the primary data entering the memory;
if the quantity of the primary data entering the memory reaches a preset quantity, performing data cleaning on the primary data entering the memory, then performing data merging, and performing data cleaning on the primary data sequentially entering the memory and then performing data merging;
wherein the preset number is smaller than the total number of the initial data.
In some embodiments of the present application, after generating an execution plan according to the data query request, the method further comprises:
if the number of the connectors connected with the data source is one, pushing down the execution plan to the data source to execute based on the connectors;
and acquiring result data corresponding to the query request from the data source based on the connector, and returning the result data to the user or the application.
In some embodiments of the present application, the obtaining, based on the connector, result data corresponding to the query request from the data source specifically includes:
obtaining an execution result corresponding to the execution plan from the data source based on the connector;
converting the execution result into preliminary data according to a preset format based on the connector;
and loading the preliminary data to a memory, and performing data cleaning on the preliminary data to obtain the result data.
In some embodiments of the present application, the predetermined format is a CSV format.
Correspondingly, the present invention further provides a device for performing data query on a data source based on memory computation, where the device is applied to a distributed system including a plurality of memory databases connected in parallel, and the device includes:
the generating module is used for generating an execution plan according to a data query request when at least one data source is accessed to the system and the data query request sent by a user or an application is received;
the push-down module is used for pushing down the execution plan to each data source to be executed based on a plurality of connectors if the number of the connectors connected with the data sources is multiple;
a first acquisition module configured to acquire a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors;
a second obtaining module, configured to obtain, according to the plurality of pieces of preliminary data, result data corresponding to the query request, and return the result data to the user or the application;
wherein the connector is a process for connecting the system with each of the data sources, and the connector is created according to the type of the data source.
Correspondingly, the present invention further provides a computer-readable storage medium, where instructions are stored, and when the instructions are run on a terminal device, the terminal device is caused to execute the method for performing data query on a data source based on memory calculation as described above.
By applying the technical scheme, in a distributed system comprising a plurality of memory databases connected in parallel, when at least one data source is accessed to the system and receives a data query request sent by a user or an application, an execution plan is generated according to the data query request; if the number of the connectors connected with the data sources is multiple, pushing down the execution plan to each data source for execution based on the multiple connectors; obtaining a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors; obtaining result data corresponding to the query request according to the plurality of preliminary data, and returning the result data to the user or the application; the connector is established according to the type of the data source, acquired data can be acquired from different data sources based on the connector without being stored in a local database by arranging the connector between the database system and the data source, and therefore the efficiency and the safety of data query on different data sources are improved on the basis of avoiding high cost investment.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a diagram illustrating an architecture for accessing different data sources based on a local database system in the prior art;
fig. 2 is a schematic flowchart illustrating a method for performing data query on a data source based on memory computation according to an embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for querying a data source based on memory computing according to another embodiment of the present invention;
FIG. 4 is an architecture diagram illustrating a single compute node running a connector in accordance with an embodiment of the present invention;
FIG. 5 is an architecture diagram illustrating a single compute node running multiple connectors in an embodiment of the invention;
FIG. 6 is a diagram illustrating merging of query results from data sources according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram illustrating an apparatus for performing data query on a data source based on memory computation according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
An embodiment of the present application provides a method for performing data query on a data source based on memory computing, where the method is applied to a distributed system including a plurality of memory databases connected in parallel, as shown in fig. 2, and the method includes the following steps:
step S101, when at least one data source is accessed to the system and a data query request sent by a user or an application is received, an execution plan is generated according to the data query request.
In this embodiment, the system performs access to at least one data source in advance, and the data source may include, but is not limited to, a text file, or a Key-Value database, or a relational database, or NoSQL and a data warehouse, or a continuous data stream. When a data Query request sent by a user or an application is received, an SQL (Structured Query Language) statement corresponding to the Query request is analyzed and optimized, and then an execution plan is generated, where the execution plan may be in the form of an execution plan tree, where the SQL statement may be analyzed by an analyzer in a system, and optimized by an optimizer, and a specific process is obvious to those skilled in the art and is not described herein again.
In order to improve query efficiency, in some embodiments of the present application, an execution plan is generated according to the data query request, specifically:
analyzing the data query request and generating an initial query plan;
optimizing the initial query plan according to the metadata, the query cost and the index and determining an optimal index;
generating the execution plan according to the optimal index;
the metadata represents data content stored in each data source, the query cost represents resource consumption and execution duration of query, the index is determined according to the query request, and the optimal index is an index which has the minimum query cost and is matched with data stored in each data source.
In this embodiment, the initial query plan is optimized according to the metadata, the query cost, and the index, an index having the minimum query cost in each index and matching with data stored in each data source is determined as an optimal index, and the execution plan is generated according to the optimal index, thereby improving the query efficiency.
Step S102, if the number of connectors connected to the data source is multiple, pushing down the execution plan to each data source for execution based on the multiple connectors.
In this embodiment, a connector is provided between the system and each data source, data transmission is performed through the connector, the connector is a process for connecting the system and each data source, the connector is created according to the type of the data source, and if the number of connectors to which the data source is connected is multiple, the execution plan is pushed down to each data source for execution based on the multiple connectors.
For reliable data query of a data source, in some embodiments of the present application, the method further comprises:
when an access request sent by a data source to be accessed is detected, determining a first number of the data source to be accessed and a second number of idle connectors in a data source connection pool corresponding to the type of the data source to be accessed;
if the first number is not larger than the second number, the data source to be accessed is accessed into the system based on the idle connector;
and if the first number is larger than the second number, creating a new idle connector according to a preset number or a creation number input by a user.
In this embodiment, at least one data source needs to be accessed in advance so as to query data of the data source according to a query request sent by a user or an application, a data source connection pool corresponding to a type of the data source is provided in the system, a preset number of idle connectors are created in the data connection pool in advance, that is, idle connectors in the same data connection pool can only be connected with data sources of a corresponding type, when an access request sent by the data source to be accessed is detected, a first number of the data sources to be accessed and a second number of the idle connectors in the data source connection pool corresponding to the type of the data source to be accessed are determined, the first number and the second number are compared, if the first number is not greater than the second number, it is determined that the idle connectors in the data source connection pool are enough to access all the data sources to be accessed, each data source to be accessed is accessed to the system based on each idle connector, the access can be carried out in sequence or simultaneously; if the first number is greater than the second number, it indicates that the idle connectors in the data source connection pool cannot access all the data sources to be accessed, a new idle connector corresponding to the type of the data source to be accessed needs to be created, the new idle connector can be automatically created based on the preset number, and the user can manually create the new idle connector as needed, that is, the new idle connector is created based on the creation number input by the user, and then the data source to be accessed is accessed based on the idle connector.
In order to increase flexibility in querying data from a data source, in some embodiments of the present application, the system includes a plurality of distributed memory compute nodes, and a single memory compute node may run one or more of the connectors.
In this embodiment, the connector runs on the memory computing node, and in a specific application scenario of the present application, a single memory computing node may only run one connector, as shown in fig. 4; a single memory compute node may also run multiple connectors, such as node 1 connected to connector 1 and connector 2 as shown in fig. 5, where nodes 1-f in fig. 4 and 5 are distributed memory compute nodes 1-f. The number of connectors running on a single memory compute node may be determined based on the computing resources of the memory compute node and may be flexibly set by one skilled in the art.
Step S103 acquires a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors.
In this embodiment, each data source performs data query according to the execution plan, and after the data query is completed, each connection acquires a plurality of preliminary data corresponding to the execution plan from each data source, where the preliminary data is different from result data corresponding to the query request and needs further processing.
In order to accurately obtain the preliminary data, in some embodiments of the present application, a plurality of preliminary data corresponding to the execution plan are obtained from each of the data sources based on each of the connectors, specifically:
obtaining a plurality of execution results corresponding to the execution plan from the data sources based on the connectors;
and converting each execution result into each initial data according to a preset format based on each connector.
In this embodiment, a plurality of execution results corresponding to the execution plan are obtained from each data source based on each connector, and then each execution result is converted into each preliminary data according to a preset format based on each connector in order to improve processing efficiency.
Those skilled in the art can select different preset formats according to actual needs without affecting the scope of the present application, and in the preferred embodiment of the present application, the preset format is a CSV (Comma-Separated Values) format.
And step S104, acquiring result data corresponding to the query request according to the plurality of pieces of preliminary data, and returning the result data to the user or the application.
In this step, result data corresponding to the query request is obtained according to the plurality of preliminary data, and the result data is returned to the user or the application. And then resources such as computation, storage, network and the like allocated to the query request can be released, and the next query request is waited. The user can delete or store the result data according to the requirement, and the storage can be stored in the memory or on the disk.
In order to accurately obtain result data, in some embodiments of the present application, the obtaining, according to a plurality of the preliminary data, result data corresponding to the query request specifically includes:
loading a plurality of pieces of preliminary data to a memory, and carrying out data merging after carrying out data cleaning on the plurality of pieces of preliminary data in the memory;
and acquiring the result data according to the result of data combination.
In this implementation, a plurality of the preliminary data are loaded into the memory, and the plurality of the preliminary data are subjected to data cleaning in the memory, where the data cleaning includes removing duplicate data and abnormal data, then the data after the data cleaning is subjected to data merging, and result data is obtained according to a result of the data merging.
In order to further improve the data query efficiency, in some embodiments of the present application, data merging is performed after data cleaning is performed on a plurality of pieces of the preliminary data in a memory, specifically:
acquiring the quantity of the primary data entering the memory;
if the quantity of the primary data entering the memory reaches a preset quantity, performing data cleaning on the primary data entering the memory, then performing data merging, and performing data cleaning on the primary data sequentially entering the memory and then performing data merging;
wherein the preset number is smaller than the total number of the initial data.
In the embodiment, when part of the preliminary data enters the memory, the data is cleaned and merged, so that the data merging speed is increased. Preferably, the preset number is 2. In a specific application scenario of the present application, as shown in fig. 6, query data a-N in the graph is a plurality of preliminary data, and when query result a and query result B enter the memory, query result a and query result B are merged to obtain result data AB, and when query result M enters the memory, result data AB and query result M are merged to obtain result data ABM, and so on.
For reliable data query of a data source, in some embodiments of the present application, after generating an execution plan according to the data query request, the method further comprises:
if the number of the connectors connected with the data source is one, pushing down the execution plan to the data source to execute based on the connectors;
and acquiring result data corresponding to the query request from the data source based on the connector, and returning the result data to the user or the application.
In this embodiment, if the number of connectors connected to the data source is one, the execution plan may be pushed down to the user or the application directly based on the connectors, and then the result data corresponding to the query request is obtained from the data source based on the connectors, and the result data is returned to the user or the application.
In order to accurately obtain result data, in some embodiments of the present application, the obtaining, based on the connector, result data corresponding to the query request from the data source specifically includes:
obtaining an execution result corresponding to the execution plan from the data source based on the connector;
converting the execution result into preliminary data according to a preset format based on the connector;
and loading the preliminary data to a memory, and performing data cleaning on the preliminary data to obtain the result data.
In this embodiment, an execution result corresponding to the execution plan is obtained from the data source based on the connector, in order to improve processing efficiency, the execution result is converted into preliminary data according to a preset format based on the connector, the preliminary data is different from result data corresponding to the query request, further processing is required, the preliminary data is loaded into the memory, and the result data is obtained after data cleaning is performed on the preliminary data in the memory, where the data cleaning may include removing duplicate data and abnormal data. In a preferred embodiment of the present application, the preset format is a CSV format.
By applying the technical scheme, in a distributed system comprising a plurality of memory databases connected in parallel, when at least one data source is accessed to the system and receives a data query request sent by a user or an application, an execution plan is generated according to the data query request; if the number of the connectors connected with the data sources is multiple, pushing down the execution plan to each data source for execution based on the multiple connectors; obtaining a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors; obtaining result data corresponding to the query request according to the plurality of preliminary data, and returning the result data to the user or the application; the connector is established according to the type of the data source, acquired data can be acquired from different data sources based on the connector without being stored in a local database by arranging the connector between the database system and the data source, and therefore the efficiency and the safety of data query on different data sources are improved on the basis of avoiding high cost investment.
In order to further illustrate the technical idea of the present invention, the technical solution of the present invention will now be described with reference to specific application scenarios.
An embodiment of the present application provides a method for performing data query on a data source based on memory computing, which is applied to a distributed system including a plurality of memory databases connected in parallel, where the system further stores at least one memory database instance operated by the plurality of memory databases in parallel, and as shown in fig. 3, the method includes the following steps:
step S201, receiving an inquiry request sent by a user/application.
Before executing step S201, the following steps are further included:
s1, the system receives at least one data source access event in real time, namely a request event for accessing the remote data source into the system, and sends a notice to an internal storage database instance.
And S2, receiving the notification of the data source access event in real time by the memory database instance.
And S3, after receiving the notification of the data source access event, the memory database instance synchronously judges the type of the data source, provides a corresponding connector according to the type of the data source and accesses the data source into the system. The operation is completed by the multiple data sources one by one, namely, the cross-source uniform connection of the multiple data sources is completed. The connectors are created in advance by the system, the connectors of the same type are stored in data source connection pools of the same type, at least one connector can be created in advance in the data source connection pools of the same type, the number of the created connectors can be a preset default value, and the default value can be set by a user according to requirements.
When the number of the pre-created connectors is less than the number of the data sources to be accessed, new connectors can be automatically created and added according to default values set by users, the users can manually create the connectors according to actual requirements, and the number of the new connectors is also customized by the users. The connectors run on distributed memory computing nodes of the system, and one memory computing node can run one connector or a plurality of connectors.
And S4, after the system is successfully accessed to at least one data source, the system can receive a query request sent by a user or an application in real time.
Step S202, analyzing and optimizing the SQL statement corresponding to the query request, and generating a corresponding execution plan.
In this step, the execution plan is synchronously and concurrently transmitted to the connectors of the data sources already connected in the system, and the number of the connectors of the data sources already connected at present is determined.
In step S203, it is determined whether there are multiple connectors connected to the data source, if yes, step S204 is executed, otherwise step S206 is executed.
And step S204, pushing down the execution plan to each data source through a connector connecting each data source, executing the execution plan, and inquiring source data according to conditions.
In this step, when there are a plurality of connectors connected to the data source, each connector receives the same execution plan and pushes down to the corresponding data source connected to each connector to execute, and source data meeting the query condition is synchronously searched and read in the plurality of data sources according to the execution plan. Wherein the execution plan is pushed down to the plurality of data sources by a plurality of memory compute nodes in the system.
Step S205, converting the query result into a unified data format (e.g., CSV) through the connector, and after completion of the combination, uniformly feeding back the query result to the user, and executing step S208.
In this step, after the source data is acquired, each connector respectively and uniformly converts the source data acquired from the corresponding data source into a CSV format, the memory database instance uniformly loads the primary data in the CSV format into the memory, performs data cleaning and data merging in the memory, and finally feeds the merged result data back to the user or the application.
And step S206, directly pushing down the execution plan to the corresponding data source through the connector for execution, and inquiring the source data according to the conditions.
In this step, when the number of the connectors connected to the data source is only one, the connectors directly push down the received execution plan to the data source for execution, and search and read the source data meeting the query condition according to the execution plan.
Step S207, converting the query result into a unified data format (e.g., CSV) through the connector, feeding back to the user, and executing step S208.
In this step, after the connector converts the taken data into the CSV format, the memory database instance uniformly loads the primary data in the CSV format to the memory, performs data cleaning in the memory, acquires result data, and feeds back the result data to a user or an application.
And S208, finishing the query, releasing occupied resources such as memory, network, calculation and the like, waiting for a new query request, and deleting or storing the obtained result data according to the requirements of the user.
By applying the technical scheme, the uniform connection of a plurality of different types of data sources can be completed only by using a single SQL statement through the connector, and the data can be processed without being stored in a local database system in a data migration, synchronous copy or asynchronous backup mode. After the data source is connected, when a query request is received, the source data can be directly loaded to the memory to obtain result data in a SQL access mode across a plurality of different data sources. Therefore, when performing query across multiple different data sources, no matter how fast the data volume is or the growth speed is, the user does not need to invest in a higher TCO (Total Cost of Ownership) to build a local database system. Meanwhile, the acquired source data is not required to be stored in a local database system but directly loaded into a memory, so that intermediate links in the conventional similar implementation method are reduced, the data reading and processing efficiency is greatly improved, the delay existing in the conventional similar technology is improved from hour level to minute level to second level or even millisecond level, the timeliness of the database management system for processing data is greatly improved, and the target of real-time data query and processing is infinitely approached.
In addition, because the data is directly read from the data source, intermediate links such as data migration, backup, disk drop reloading and the like are eliminated, and the safety risks such as data loss or leakage and the like caused by data migration, backup, transmission and the like are avoided. More importantly, the data acquired each time is the latest data in the data source, namely, the queried data and the source data can be synchronously updated, so that the data processing result acquired by the user request can completely reflect the real situation of real-time updating.
Moreover, by adopting the memory computing technology, a distributed cluster for uniformly connecting, inquiring and processing cross-source heterogeneous data can be quickly built, original distributed cluster resources of users can be directly utilized, the method is applied to the fields of data sharing, data instant display and the like of various industries, the on-line dynamic expansion of computing nodes according to requirements is realized, the performance of a user data processing platform and a database system is ensured to be always kept high, and the load and processing pressure brought by the quick increase of multi-source heterogeneous data are easily coped with.
Corresponding to the method for performing data query on a data source based on memory computation in the embodiment of the present application, an embodiment of the present application further provides a device for performing data query on a data source based on memory computation, where the device is applied to a distributed system including a plurality of memory databases connected in parallel, and as shown in fig. 7, the device includes:
a generating module 701, configured to generate an execution plan according to a data query request sent by a user or an application when at least one data source accesses the system and the data query request is received;
a push-down module 702, configured to, if the number of connectors to which data sources are connected is multiple, push down the execution plan to each of the data sources for execution based on the multiple connectors;
a first obtaining module 703 for obtaining a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors;
a second obtaining module 704, configured to obtain result data corresponding to the query request according to the plurality of pieces of preliminary data, and return the result data to the user or the application;
wherein the connector is a process for connecting the system with each of the data sources, and the connector is created according to the type of the data source.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A method for performing data query on a data source based on memory calculation, wherein the method is applied to a distributed system comprising a plurality of memory databases connected in parallel, and the method comprises:
when at least one data source is accessed to the system and a data query request sent by a user or an application is received, generating an execution plan according to the data query request;
if the number of the connectors connected with the data sources is multiple, pushing down the execution plan to each data source for execution based on the multiple connectors;
obtaining a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors;
obtaining result data corresponding to the query request according to the plurality of preliminary data, and returning the result data to the user or the application;
wherein the connector is a process for connecting the system with each of the data sources, and the connector is created according to the type of the data source.
2. The method of claim 1, wherein the method further comprises:
when an access request sent by a data source to be accessed is detected, determining a first number of the data source to be accessed and a second number of idle connectors in a data source connection pool corresponding to the type of the data source to be accessed;
if the first number is not larger than the second number, the data source to be accessed is accessed into the system based on the idle connector;
and if the first number is larger than the second number, creating a new idle connector according to a preset number or a creation number input by a user.
3. The method of claim 1, wherein generating an execution plan based on the data query request includes:
analyzing the data query request and generating an initial query plan;
optimizing the initial query plan according to the metadata, the query cost and the index and determining an optimal index;
generating the execution plan according to the optimal index;
the metadata represents data content stored in each data source, the query cost represents resource consumption and execution duration of query, the index is determined according to the query request, and the optimal index is an index which has the minimum query cost and is matched with data stored in each data source.
4. The method of claim 1, wherein obtaining a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors is by:
obtaining a plurality of execution results corresponding to the execution plan from the data sources based on the connectors;
and converting each execution result into each initial data according to a preset format based on each connector.
5. The method according to claim 4, wherein obtaining result data corresponding to the query request based on the plurality of preliminary data comprises:
loading a plurality of pieces of preliminary data to a memory, and carrying out data merging after carrying out data cleaning on the plurality of pieces of preliminary data in the memory;
and acquiring the result data according to the result of data combination.
6. The method of claim 5, wherein the data merging is performed after the data cleansing is performed on the plurality of preliminary data in the memory, specifically:
acquiring the quantity of the primary data entering the memory;
if the quantity of the primary data entering the memory reaches a preset quantity, performing data cleaning on the primary data entering the memory, then performing data merging, and performing data cleaning on the primary data sequentially entering the memory and then performing data merging;
wherein the preset number is smaller than the total number of the initial data.
7. The method of claim 1, wherein after generating an execution plan from the data query request, the method further comprises:
if the number of the connectors connected with the data source is one, pushing down the execution plan to the data source to execute based on the connectors;
and acquiring result data corresponding to the query request from the data source based on the connector, and returning the result data to the user or the application.
8. The method according to claim 7, wherein the obtaining of the result data corresponding to the query request from the data source based on the connector is specifically:
obtaining an execution result corresponding to the execution plan from the data source based on the connector;
converting the execution result into preliminary data according to a preset format based on the connector;
and loading the preliminary data to a memory, and performing data cleaning on the preliminary data to obtain the result data.
9. An apparatus for performing data query on a data source based on memory computation, wherein the apparatus is applied in a distributed system including a plurality of memory databases connected in parallel, the apparatus comprising:
the generating module is used for generating an execution plan according to a data query request when at least one data source is accessed to the system and the data query request sent by a user or an application is received;
the push-down module is used for pushing down the execution plan to each data source to be executed based on a plurality of connectors if the number of the connectors connected with the data sources is multiple;
a first acquisition module configured to acquire a plurality of preliminary data corresponding to the execution plan from each of the data sources based on each of the connectors;
a second obtaining module, configured to obtain, according to the plurality of pieces of preliminary data, result data corresponding to the query request, and return the result data to the user or the application;
wherein the connector is a process for connecting the system with each of the data sources, and the connector is created according to the type of the data source.
10. A computer-readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of performing a data query on a data source based on memory computation of any one of claims 1-8.
CN202110924867.6A 2021-08-12 2021-08-12 Method and equipment for carrying out data query on data source based on memory calculation Pending CN113568892A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110924867.6A CN113568892A (en) 2021-08-12 2021-08-12 Method and equipment for carrying out data query on data source based on memory calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110924867.6A CN113568892A (en) 2021-08-12 2021-08-12 Method and equipment for carrying out data query on data source based on memory calculation

Publications (1)

Publication Number Publication Date
CN113568892A true CN113568892A (en) 2021-10-29

Family

ID=78171363

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110924867.6A Pending CN113568892A (en) 2021-08-12 2021-08-12 Method and equipment for carrying out data query on data source based on memory calculation

Country Status (1)

Country Link
CN (1) CN113568892A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309726A (en) * 2022-09-27 2022-11-08 北京奥星贝斯科技有限公司 Database instance access method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103207908A (en) * 2013-03-29 2013-07-17 成都康赛电子科大信息技术有限责任公司 Multi-data-source dynamic isolated access method
CN103365929A (en) * 2012-04-10 2013-10-23 阿里巴巴集团控股有限公司 Management method and management system of database connection
CN106934001A (en) * 2017-03-03 2017-07-07 广州天源迪科信息技术有限公司 Distributed quick inventory inquiry system and method
CN111581234A (en) * 2020-05-09 2020-08-25 中国银行股份有限公司 RAC multi-node database query method, device and system
CN111666279A (en) * 2020-04-14 2020-09-15 阿里巴巴集团控股有限公司 Query data processing method and device, electronic equipment and computer storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365929A (en) * 2012-04-10 2013-10-23 阿里巴巴集团控股有限公司 Management method and management system of database connection
CN103207908A (en) * 2013-03-29 2013-07-17 成都康赛电子科大信息技术有限责任公司 Multi-data-source dynamic isolated access method
CN106934001A (en) * 2017-03-03 2017-07-07 广州天源迪科信息技术有限公司 Distributed quick inventory inquiry system and method
CN111666279A (en) * 2020-04-14 2020-09-15 阿里巴巴集团控股有限公司 Query data processing method and device, electronic equipment and computer storage medium
CN111581234A (en) * 2020-05-09 2020-08-25 中国银行股份有限公司 RAC multi-node database query method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115309726A (en) * 2022-09-27 2022-11-08 北京奥星贝斯科技有限公司 Database instance access method and device
CN115309726B (en) * 2022-09-27 2023-01-13 北京奥星贝斯科技有限公司 Database instance access method and device

Similar Documents

Publication Publication Date Title
CN107506451B (en) Abnormal information monitoring method and device for data interaction
US8719271B2 (en) Accelerating data profiling process
US8190599B2 (en) Stream data processing method and system
CN113312191B (en) Data analysis method, device, equipment and storage medium
CN103514223A (en) Data synchronism method and system of database
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
WO2019109854A1 (en) Data processing method and device for distributed database, storage medium, and electronic device
CN115374102A (en) Data processing method and system
US11249975B2 (en) Data archiving method and system using hybrid storage of data
CN113407600B (en) Enhanced real-time calculation method for dynamically synchronizing multi-source large table data in real time
CN114691658A (en) Data backtracking method and device, electronic equipment and storage medium
US11182386B2 (en) Offloading statistics collection
CN113760847A (en) Log data processing method, device, equipment and storage medium
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
CN115757616A (en) Data consistency checking method, device and medium based on binary log
CN113641739B (en) Spark-based intelligent data conversion method
CN115329011A (en) Data model construction method, data query method, data model construction device and data query device, and storage medium
CN113568892A (en) Method and equipment for carrying out data query on data source based on memory calculation
WO2023015809A1 (en) Method and device for optimizing distributed memory data query
CN115599871A (en) Lake and bin integrated data processing system and method
CN115391286A (en) Link tracking data management method, device, equipment and storage medium
CN112131257B (en) Data query method and device
CN113760600B (en) Database backup method, database restoration method and related devices
Jiadi et al. Research on Data Center Operation and Maintenance Management Based on Big Data
CN113485763A (en) Data processing method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211029

RJ01 Rejection of invention patent application after publication