CN106570145B - Distributed database result caching method based on hierarchical mapping - Google Patents

Distributed database result caching method based on hierarchical mapping Download PDF

Info

Publication number
CN106570145B
CN106570145B CN201610966316.5A CN201610966316A CN106570145B CN 106570145 B CN106570145 B CN 106570145B CN 201610966316 A CN201610966316 A CN 201610966316A CN 106570145 B CN106570145 B CN 106570145B
Authority
CN
China
Prior art keywords
data
cache
worker
virtual
page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610966316.5A
Other languages
Chinese (zh)
Other versions
CN106570145A (en
Inventor
庞廓
郭皓明
王之欣
魏闫艳
田霂
焉丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Priority to CN201610966316.5A priority Critical patent/CN106570145B/en
Publication of CN106570145A publication Critical patent/CN106570145A/en
Application granted granted Critical
Publication of CN106570145B publication Critical patent/CN106570145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Abstract

The invention provides a distributed database result caching method based on hierarchical mapping, belongs to the field of distributed database performance optimization research and application, and particularly relates to cache management, grouping, sequencing and extraction of query operation result set data of a distributed database. The invention establishes a distributed mapping cache framework based on a distributed architecture. And the bottom layer local cache realizes local storage and grouping of result set data. This cache information is aggregated in the upper layer virtual cache. And the data extraction is realized through the mapping of the virtual cache and the local cache. Meanwhile, the grouping, the sequencing and other processing of the result set are realized. The caching technology provided by the invention can reduce the internal network and memory load of the system and improve the reliability and data throughput performance of the distributed data system while meeting the requirement of correctly extracting the result set of the distributed database. The technology has positive application value in the fields of electronic commerce, internet of things, smart cities and the like based on big data.

Description

Distributed database result caching method based on hierarchical mapping
Technical Field
The invention provides a distributed database result set caching technology. The invention belongs to the field of distributed database performance optimization research and application, and particularly relates to a cache management, grouping, sequencing and extraction method for query operation result set data of a distributed database.
Background
Data is a core resource of an information system. How to correctly and reliably extract data meeting the constraint condition is one of the key issues concerned by information systems. With the rapid development of technologies and applications of the internet of things, social networks and the like, the generation speed and scale of data are explosively increased. And who can hold the data and can well utilize the data, the data can take an advantage in the industry. Business data is a core asset of an enterprise, and the storage, use, management and other aspects of the data are very high in requirements. To meet these requirements, data must be stored in database systems, and selecting an appropriate database system becomes a significant issue. At present, databases are classified according to storage structures, and are mainly classified into a centralized database and a distributed database.
The centralized database is physically defined to a single location, the system and its data management are centrally controlled by a central server, which is directly accessed by all users. In a centralized database, many operations on data are easily implemented, such as modification, querying, backup, access control, and the like. But when the central server fails and cannot run, all users cannot use the database. And because all operations are done at the central server, it is very stressful to the central server. When the data size is large and the data processing is very frequent, even using a high performance server as a central server cannot meet the user's demand.
The distributed database system refers to a system in which data is stored in computers at different sites of a computer network, each site has autonomous processing capability and completes local application, each site also participates in execution of at least one global application, and the global application program can access a plurality of site data in the system through network communication. But logically the databases of these different sites are one whole. Because each computer can store and process data, the server is not required to be very powerful, and a common server or even a personal computer can be used. And the downtime of a single computer cannot cause global failure. The local application has high response speed and good expandability, and is easy to integrate the existing system. In theory, the distributed database can continuously increase the storage capacity and the computing capacity by increasing the number of the working nodes. Distributed databases, while also constrained by network transport and control management costs, are far more storage and computation efficient than centralized databases.
In view of the many advantages of distributed database systems, distribution has become the mainstream direction for data storage and processing today. However, the distributed database also has some disadvantages, such as high system overhead, especially communication overhead, complex access structure, and difficult data security processing.
Distributed storage architectures in common use today are mainly master-worker structures. The system is composed of a master node and a plurality of worker nodes. The master node is used for storing operations such as resource allocation scheduling and task execution process management. The worker node is used for local data storage, retrieval query and the like. The client establishes a link relation with the master node and submits a data operation request. The master node completes the operations of request analysis, task encapsulation, execution scheduling, result collection and feedback and the like; and a link channel is established between the worker and the master, so that the sending of the operation message, the local execution and the result feedback are realized. Data is stored in worker nodes, and data between the nodes can be crossed and repeated or not. All worker nodes form a complete data set.
In such a distributed database, the extraction of a massive result set tends to cause the following three problems.
a) The single point data throughput pressure of the Master node increases. When the result set of the query is very large, if all the query results in the distributed database are aggregated to one node, a large amount of raw data is accumulated in one node. The existing caching technology depends on two modes of temporary table or memory caching. The former consumes IO resources when a large amount of data is accumulated. IO read and write operations themselves form a significant bottleneck in the response performance of operation time. At the same time, other operational tasks are caused to be queued and scheduled. The memory caching mode can reduce the time delay of IO processing, but the capacity of the memory is limited, and when the data set is large in scale, the requirement of all data caching cannot be met.
b) The Master data processing load is large, and generally, one query operation needs to perform grouping, sorting and other processing on a result set. When the result set data is gathered in the master node for uniform cache, the node needs to perform various processing and calculations such as filtering, sorting, splicing, grouping and the like on the data in the result set according to the content of the operation request. This puts a heavy load on the master node data processing. The method deviates from the basic idea of the distributed architecture, not only can the architectural advantages of the distributed database not be well played, but also the risk of system node failure caused by load aggravation is avoided.
c) The network transmission pressure is large. In the process of carrying the result set to the master node, data exchange needs to be realized by relying on network transmission. When the resulting data size is large, the transmission task consumes a large amount of system communication bandwidth. Meanwhile, connection resources of the network terminal are occupied. This also places a certain strain on the scheduling of other operational tasks in the system. Increasing the time cost and reliability risk of the system.
The above problems greatly affect the performance of distributed databases. Some solutions have been proposed to push data to a secondary master node outside the master node for computation (e.g., ocean base, etc.), so that although the master control capability of the master node can be guaranteed when the data is congested, the problem caused when a large amount of data and computation converge at one point is not solved essentially.
Disclosure of Invention
Aiming at the technical problem of the distributed database in a big data application scene, the invention provides a caching technology based on hierarchical mapping. This technique distributes the data across the node local caches. And converging the cache information at the central node. And establishing a mapping relation with the underlying data through the processing of the cache information by the central node. And extracting data based on the mapping relation. Meanwhile, the operations such as data grouping and sequencing are realized through the mapping relations. The method for extracting the data of the distributed database can solve the problem of data extraction of the current distributed database in a large-scale result set, and has very important significance in the field of databases.
The invention provides grouping, sequencing and extracting operations of mass result data aiming at a distributed database based on a distributed architecture and a layered mapping technology. The method can enable a user to quickly obtain the query result data, greatly reduce the network data transmission quantity, reduce the single-point pressure of the master node, and improve the cluster data throughput performance and the system reliability.
The invention mainly comprises four parts of data local cache, data remote virtual cache, grouping strategy and sequencing strategy.
1. Local caching of data: in the process of executing a query operation, a worker node establishes a local cache for a current task. The query result of the node is cached in the local cache. Meanwhile, the result set is not returned to the master, and the related information of the cache set is returned to the master node. The local cache set (dataSet) contains a plurality of cache pages (datasheets), each of which in turn corresponds to a file in the hard disk space. The local cache essence is stored locally in the worker in the form of a data file, an index file and a metadata file.
The local cache also comprises a cursor pointer, when data is extracted each time, the cache page pointed by the cursor is found from the cache set of the corresponding task, then the real position of the data is obtained by inquiring the index file according to the position in the cache page pointed by the cursor, and finally the corresponding position of the data file is read, thus the needed data can be obtained.
2. And data is remotely and virtually cached. After the Master issues the tasks to the worker, a virtual cache set (virtual dataSet) is also created for each query task. The virtual cache is similar to a local cache in a worker in a logical structure, and each virtual cache also contains a plurality of virtual cache pages (virtual data pages), but the difference is that a file handle is not stored in a virtual cache page, but information of a corresponding local cache page in the worker is stored. Equivalently, a mapping relation between a master virtual cache page and a worker local cache page is established. And a cursor pointer is also arranged in the virtual cache, and each time a user extracts data from the master, the master finds corresponding local cache page information from the virtual cache page pointed by the current pointer, sends a data extraction task to a corresponding worker according to the information, and waits for a return result.
The virtual cache does not contain real data, but stores worker information, cache set information, cache page information and the like where the local cache page is located. Therefore, data extraction of the master end can be directly converted into a data extraction task corresponding to the worker end, pressure caused by data accumulation of the master end is avoided, and network congestion is avoided by extracting a small amount of data at each time according to user requirements.
3. Grouping processing: if the query result needs to be grouped, the master and the worker directly create cache pages according to the groups needing to be grouped, and finally, the data of the same group is stored in each cache page. And at the worker end, when a user query result is returned, checking the group of each piece of data, if the cache page corresponding to the group exists, directly adding the data to the corresponding cache page, and if the cache page does not exist, newly building a cache page and adding the data. After the storage is finished, the worker returns the total information of the cache set to the master, including which groups and the number of each group of data. And the Master creates virtual cache pages according to the information of the cache set, each cache page also corresponds to one group, and cache pages corresponding to the same group in different workers are added to the same virtual cache page in the Master.
In the process of result data extraction, a data extraction request (getNext) is first submitted to the cache of the master node. After receiving the request, the virtual cache of the Master node moves the cursor position in the current cache. And if the current cursor position does not exceed the upper limit of the current virtual page, taking the current virtual page as the target page. Otherwise, the next virtual page is taken as the target page, and the cursor points to the first position of the page. And then, sending a data extraction request to the corresponding worker node according to the mapping relation between the target paging and the bottom worker node. After receiving the request, the Worker node extracts data from the specified page and returns the data.
4. Sequencing and extracting: the ordering policy of the local cache-based fast fetch system is that a complete global ordering is not performed. At the worker end, after the query result set is collected, the query result set is directly sorted locally. After finishing sorting by adopting a proper sorting strategy (such as quick sorting or merging sorting), the result set is still stored in the file corresponding to the local cache set. And at the master end, when a user needs to extract ordered data, extracting one piece of data from each worker. If the sorting is from small to large, the data extracted from each worker is the minimum value of all data of the worker, and one minimum value is selected from the minimum values of the workers and is the global minimum value. The global ordering strategy is a dynamic ordering process of multi-path merging. And extracting the data from the worker according to the data quantity required to be displayed to the user, and not sequencing all the data at one time. Therefore, the requirement of a user for inquiring the sequencing result can be met, and the calculated amount of the master node can be greatly reduced.
The invention conceivably stores all data on the worker node, and puts as many operations on the worker node as possible, thereby avoiding single-point pressure and network transmission load of a master node and improving the performance of the distributed database during data extraction, grouping and sequencing. Not only the query task is executed on the worker node, but also the large-scale result set obtained by the query task is still stored on the worker node. And the global operation of grouping and sequencing is also subjected to localization processing, most of the operation is dispersed to worker nodes for execution, and the parallelism of the whole distributed database is improved.
The invention establishes a distributed mapping cache framework based on a distributed architecture. And the bottom layer local cache realizes local storage and grouping of result set data. This cache information is aggregated in the upper layer virtual cache. And the data extraction is realized through the mapping of the virtual cache and the local cache. Meanwhile, the grouping, the sequencing and other processing of the result set are realized. The caching technology provided by the invention can reduce the internal network and memory load of the system and improve the reliability and data throughput performance of the distributed data system while meeting the requirement of correctly extracting the result set of the distributed database. The technology has positive application value in the fields of electronic commerce, internet of things, smart cities and the like based on big data.
Drawings
FIG. 1 is a diagram of a distributed database logical structure.
FIG. 2 is a flow chart of basic query data caching.
FIG. 3 is a flow chart of basic query data extraction.
FIG. 4 is a query close flow diagram.
FIG. 5 is a flow chart of packet query data caching.
FIG. 6 is a flow diagram of an ordered query data caching process.
FIG. 7 is a flow diagram of ordered query data extraction.
Detailed Description
The invention is further illustrated by the following specific examples and the accompanying drawings.
The basic structure of the distributed database of the present embodiment is shown in fig. 1. The system consists of a master node and a plurality of worker nodes. The contents of the modules in each node are as follows.
The Master node mainly comprises a data organization interface, metadata, task management, grouping management, sequencing management, cache management, an extraction interface and a virtual cache.
And the data organization interface provides data service for external users, and the finally extracted data is organized into a form required by the users and returned to the users through the data organization interface.
And the metadata stores the relevant information of the worker node, the basic information of the database table and the data distribution information.
And the task management part is used for creating a query task for the user request. And inquiring the metadata to determine which worker nodes the required data are distributed in, establishing short connection to the selected worker nodes, and sending the task. And after all workers finish executing the tasks, collecting return results of the tasks.
Packet management, preparing packet buffers for tasks that require packets.
And sequencing management, namely preparing ordered cache for the tasks needing sequencing.
And cache management, namely establishing a corresponding connection relation between the master virtual cache and the local cache after obtaining local cache information returned by the worker.
And the extraction interface organizes logic according to information such as grouping, sorting, current pointer position and the like when a user extracts data, and selects a data source of a next position.
In the virtual cache, a master node creates a virtual cache for each task, each virtual cache is concentrated with a plurality of cache pages, each virtual cache page has a plurality of mapping pages (mapping pages), and each mapping page is essentially a pointer pointing to a real cache page in a designated worker.
The Worker node mainly comprises a data interface, a grouping management part, a sequencing management part, an original data query part, an original data set part and a local cache part.
And the data interface extracts corresponding data from the local cache according to the master request and returns the data to the master.
And grouping management, namely putting the query result into a corresponding local cache page according to the group.
And sequencing management, namely, after local sequencing is carried out on the query results, the query results are placed into a local cache region of the corresponding task.
And (3) original data query, wherein a SQ L-like statement in a task is directly queried in an original data set, and a query result is returned to a grouping management or sequencing management part for further processing.
The original data set, i.e., the data itself stored by the database. And storing the file on a local disk of each worker node in a format file.
And the local cache is used for storing a result set finally obtained by the local query result and also storing the result set on a local disk of the worker node in a format file form. In the local cache, each task corresponds to a local cache set, each local cache set comprises one or more cache pages, and logically, the result sets are sequentially stored in the cache pages. Physically each cache page corresponds to three formatted files, a data file, an index file, and a metadata file. The result data and the index and other related information are stored in the three files respectively. While the cache page object itself holds pointers to the corresponding three files.
The query tasks mainly comprise three types, wherein the first type is a common query task, the second type is a query task with grouping, and the third type is a query task with sequencing. Each type of query task is divided into three parts of caching, extracting and closing. The following is a detailed description based on the flow chart. The task closing part is the same in the three types of tasks, so the process is described once in the common query task, and the latter two types of tasks are not described again.
1. Generic query task
The query caching process of the general query task is shown in fig. 2, and includes the following steps:
1) and the Master northbound interface receives a user request, preprocesses the query request, selects the worker and distributes tasks.
The method comprises the following specific steps:
a) the Master northbound interface accepts user requests.
b) Tasks are created and globally unique task IDs are assigned. The ID also serves as an identification of the worker returned result.
c) Metadata information is obtained. Including the data distribution of the queried table in the worker, etc. The query terminates if the metadata is not obtained.
d) Checking whether the message format is correct and the information is sufficient. The query terminates if the message does not satisfy the requirements.
e) The user rights issue is checked and if the user rights are not satisfied the query terminates.
f) And selecting worker nodes needing to participate in the query task according to the metadata information, and establishing connection for the worker nodes.
g) And creating an unordered virtual cache set, and preparing for storing virtual cache information.
h) And issuing a query task to the selected worker.
i) And waiting for all the participated workers of the ID task to return results.
j) And acquiring the view information of the result set and returning the view information to the user.
k) And (6) ending.
2) Query operation in Worker
a) The Worker receives the query task.
b) And creating a local cache space by taking the task ID as an identifier.
c) And executing query to obtain the data index.
d) The original data is extracted according to the index.
e) A cache page is created.
f) And creating a data file, an index file and a metadata file corresponding to the cache page.
g) And storing the data lines into the files corresponding to the cache pages, and updating the corresponding index files and metadata files.
h) And acquiring a cache set view.
i) The organization caches the collection information and returns it to the master.
3) The Master southbound interface receives the returned information of the worker and establishes a cache link
a) And the Master southbound interface receives the cache set information returned by the worker.
b) Obtaining a virtual cache set
c) Establishing a virtual cache page for each real cache page according to cache information returned from the worker
d) And establishing a link between the virtual cache page and the real cache page, and then taking a value from the virtual cache page on the master can be directly mapped to a value from the real cache page corresponding to the corresponding worker.
e) And storing cache set information returned by the worker.
f) And (4) judging whether all worker participating in the task returns results, if true, turning to 3a), and if false, turning to 1 i).
The query extraction flow of the general query task is shown in fig. 3, and includes the following steps:
1) and the Master northbound interface receives a user request, preprocesses the query request, selects the worker and distributes tasks.
The method comprises the following specific steps:
a) the Master northbound interface accepts user requests.
b) And acquiring the corresponding task according to the task id in the received request.
c) It is determined whether the task is to take the next value. Go to 1d if true), go to 1k if false).
d) And acquiring a current data pointer. I.e. the data pointer of the virtual data set corresponding to the current task.
e) And acquiring the current virtual cache page.
f) The index value in the current virtual cache page is calculated.
g) And acquiring real cache page information corresponding to the virtual cache page.
h) And acquiring the connection of the worker node where the real cache page is located.
i) And sending the task to the worker.
j) And waiting for returning the result to the worker, and directly returning the result to the user. Jump to 1 n).
k) And judging whether the task is to obtain a result view. Go to 1l if true), go to 1n if false).
l) obtaining the virtual cache set.
m) statistically organizing the virtual cache page information and returning.
n) is finished.
2) Query operation in Worker
a) The Worker receives the query task.
b) The id of the cache set, the id of the cache page, and the index value in the cache page are initialized.
c) And acquiring a cache set.
d) And acquiring a cache page.
e) And acquiring a file corresponding to the cache page.
f) And acquiring the real position of the data according to the index value and the content of the index file.
g) The required data is extracted from the data file.
h) And returning the data result to the master.
The query closing flow of the general query task is shown in fig. 4, and includes the following steps:
1) and the Master northbound interface receives a user request, preprocesses the query request, selects the worker and distributes tasks.
The method comprises the following specific steps:
a) the Master northbound interface accepts user requests.
b) And acquiring the corresponding task according to the task id in the received request.
c) And deleting the virtual cache corresponding to the task.
d) And acquiring the connection of the worker corresponding to the task.
e) And sending a closing query task to the worker.
f) Waiting for the returned result of worker.
g) Close the connection with worker
h) The task is closed.
i) And (6) ending.
2) Query operation in Worker
a) The Worker receives the query close task.
b) And acquiring a cache set.
c) And acquiring a cache page.
d) And closing the file handles corresponding to the cache pages, wherein the file handles comprise file handles corresponding to the data files, the index files and the metadata files.
e) The IO object is emptied.
f) And clearing the cache set index. Namely, deleting the cache set corresponding to the task from the map storing all cache sets.
g) The cache page index is cleared. And deleting the cache pages in all the cache page maps of the cache set one by one.
h) The pointers to the cache sets are closed. I.e. to empty the pointer in the cache set indicating the current data location.
i) And acquiring a cache set file path. And acquiring a physical path of the cache set file in a local worker machine.
j) And deleting the physical file corresponding to the cache set.
k) And emptying the system cache.
l) back to the master.
2. Query task with grouping
The query caching process of the query task with grouping is shown in fig. 5, and includes the following steps:
1) and the Master northbound interface receives a user request, preprocesses the query request, selects the worker and distributes tasks.
The method comprises the following specific steps:
a) the Master northbound interface accepts user requests.
b) Tasks are created and globally unique task IDs are assigned. The ID also serves as an identification of the worker returned result.
c) Metadata information is obtained. Including the data distribution of the queried table in the worker, etc. The query terminates if the metadata is not obtained.
d) Checking whether the message format is correct and the information is sufficient. The query terminates if the message does not satisfy the requirements.
e) The user rights issue is checked and if the user rights are not satisfied the query terminates.
f) And selecting worker nodes needing to participate in the query task according to the metadata information, and establishing connection for the worker nodes.
g) And creating a grouping virtual cache set to prepare for storing the virtual cache information.
h) And issuing a query task to the selected worker.
i) And waiting for all the participated workers of the ID task to return results.
j) And acquiring the view information of the result set and returning the view information to the user.
k) And (6) ending.
2) Query operation in Worker
a) The Worker receives the query task.
b) And creating a local cache space by taking the task ID as an identifier.
c) And executing query to obtain the data index.
d) The original data is extracted according to the index.
e) And judging whether all the original data are traversed or not. Go to 2k if true), go to 2f if false).
f) And judging whether the virtual cache page corresponding to the group of the data already exists. Go to 2g if true), go to 2h if false).
g) Get the corresponding cache page of the group, go to 2j)
h) And creating a cache page corresponding to the group.
i) And creating a data file, an index file and a metadata file corresponding to the cache page.
j) And storing the data lines into the files corresponding to the cache pages, and updating the corresponding index files and metadata files. Go to 2 e).
k) And acquiring a cache set view.
l) organize the cache set information and return to the master.
3) The Master southbound interface receives the returned information of the worker and establishes a cache link
a) And the Master southbound interface receives the cache set information returned by the worker.
b) And acquiring a virtual cache set.
c) And judging whether all the cache page information is traversed or not. Go to 3d if true), go to 3h if false).
d) And judging whether the virtual cache page corresponding to the group exists or not. Go to 3e) if true), go to 3f) if false.
e) Get the corresponding virtual cache page of the group, go to 3g)
f) And creating a virtual cache page corresponding to the group.
g) And adding the cache page information into the virtual cache page to establish connection. Go to 3 c).
h) And storing cache set information returned by the worker.
i) And (4) judging whether all worker participating in the task returns results, if true, turning to 3a), and if false, turning to 1 i).
The query extraction process of the query task with the group is the same as that of the ordinary query task, as shown in fig. 3.
3. Query tasks with ordering
The query caching process with ordered query tasks is shown in fig. 6, and includes the following steps:
1) and the Master northbound interface receives a user request, preprocesses the query request, selects the worker and distributes tasks.
The method comprises the following specific steps:
a) the Master northbound interface accepts user requests.
b) Tasks are created and globally unique task IDs are assigned. The ID also serves as an identification of the worker returned result.
c) Metadata information is obtained. Including the data distribution of the queried table in the worker, etc. The query terminates if the metadata is not obtained.
d) Checking whether the message format is correct and the information is sufficient. The query terminates if the message does not satisfy the requirements.
e) The user rights issue is checked and if the user rights are not satisfied the query terminates.
f) And selecting worker nodes needing to participate in the query task according to the metadata information, and establishing connection for the worker nodes.
g) And creating an ordered virtual cache set to prepare for storing the virtual cache information. The ordered cache pages have a virtual register space therein for storing a current pointer and current data for each virtual cache page.
h) And issuing a query task to the selected worker.
i) And waiting for all the participated workers of the ID task to return results.
j) And acquiring the view information of the result set and returning the view information to the user.
k) And (6) ending.
2) Query operation in Worker
a) The Worker receives the query task.
b) And creating a local cache space by taking the task ID as an identifier.
c) And executing query to obtain the data index.
d) The original data is extracted according to the index.
e) And (6) local sorting.
f) A cache page is created.
g) And creating a data file, an index file and a metadata file corresponding to the cache page.
h) And storing the data lines into the files corresponding to the cache pages, and updating the corresponding index files and metadata files.
i) And acquiring a cache set view.
j) The organization caches the collection information and returns it to the master.
3) The Master southbound interface receives the returned information of the worker and establishes a cache link
a) And the Master southbound interface receives the cache set information returned by the worker.
b) Obtaining a virtual cache set
c) Establishing a virtual cache page for each real cache page according to cache information returned from the worker
d) And establishing a link between the virtual cache page and the real cache page, and then taking values from the virtual cache page on the master is equivalent to taking values from the real cache page corresponding to the corresponding worker.
e) And storing cache set information returned by the worker.
f) And (4) judging whether all worker participating in the task returns results, if true, turning to 3a), and if false, turning to 1 i).
The query extraction flow of the sequencing query task is shown in fig. 7, and comprises the following steps:
1) and the Master northbound interface receives a user request, preprocesses the query request, selects the worker and distributes tasks.
The method comprises the following specific steps:
a) the Master northbound interface accepts user requests.
b) And acquiring the corresponding task according to the task id in the received request.
c) It is determined whether the task is to take the next value. Go to 1d if true) and 1v if false).
d) And acquiring a virtual cache set of the task.
e) And judging whether the virtual register space in the cache set is empty or not. Go to 1f) if true), go to 1m if false).
f) And judging whether all the virtual cache pages are traversed or not. Go to 1m if true), go to 1g if false).
g) And extracting the value with the index position being 0 from the current cache page.
h) And acquiring real cache page information corresponding to the virtual cache page.
i) And acquiring the connection of the worker node where the real cache page is located.
j) And sending the task to the worker.
k) And waiting for returning the result to the worker, and directly returning the result to the user.
l) creating a virtual register object for the result, and adding the virtual register object into the virtual cache set space. Go to 1 r).
m) determining whether the ordering is increasing. Go to 1n) if true), go to 1o) if false.
n) obtaining non-empty minimum value data in the virtual register space. Go to 1 p).
o) obtaining non-empty maximum data in the virtual register space.
p) updates the extracted virtual cache page pointer, index + 1.
And q) acquiring the connection of the worker node where the real cache page is located.
r) sends the task to the worker.
s) waiting for the worker to return the result.
t) updating the extracted virtual register space.
u) returning the obtained result to the user. Go to 1 y).
v) determining whether the task is to obtain a result view. Go to 1w) if true), go to 1y) if false.
w) obtain the virtual cache set.
x) statistically organizing the virtual cache page information and returning.
y) is finished.
2) Query operation in Worker
a) The Worker receives the query task.
b) The id of the cache set, the id of the cache page, and the index value in the cache page are initialized.
c) And acquiring a cache set.
d) And acquiring a cache page.
e) And acquiring a file corresponding to the cache page.
f) And acquiring the real position of the data according to the index value and the content of the index file.
g) The required data is extracted from the data file.
h) And returning the data result to the master.
The above embodiments are merely illustrative of the present invention and not restrictive, and those skilled in the art can modify the technical aspects of the present invention or substitute them with equivalents without departing from the spirit and scope of the present invention, which should be determined from the claims.

Claims (8)

1. A distributed database result caching method based on hierarchical mapping comprises the following steps:
the first step is as follows: the method comprises the following steps of establishing a hierarchical mapping cache system based on a distributed architecture, wherein the hierarchical mapping cache system comprises two parts: the cache management system comprises a remote virtual cache deployed at a central node, namely a master node, and a local cache deployed at a storage node, namely a worker node;
the second step is that: the remote virtual cache of the master node comprises a plurality of virtual cache pages, and the relevant information of the local cache of the worker node is stored in the virtual cache pages, namely the mapping relation between the remote virtual cache page of the master node and the local cache page of the worker node is established; the local cache of the worker node comprises a plurality of local cache pages, and the entity data is stored in different local cache pages of the worker node;
the third step: when a query request is executed, a query task is placed on a worker node to be executed, the worker node executes query operation locally to obtain a result data set and stores the result data set in a local cache of the worker node, and then only relevant information of the local cache is returned to a master node instead of the result data set obtained by query;
the fourth step: converting a data extraction task at a master node end into a corresponding data extraction task at a worker node according to the mapping relation between the remote virtual cache page of the master node and the local cache page of the worker node, so as to realize data extraction;
if the query result needs to be grouped, storing data grouping information and grouping cache mapping information in the worker node in a virtual cache of the master node; the worker node dynamically establishes cache pages in groups according to an operation request in the process of writing local cache data, and data records with different values are written into the cache pages corresponding to the groups; when the query result of the user is returned at the worker node, each piece of data is checked for group, if a local cache page corresponding to the group exists, the local cache page is directly added to the corresponding local cache page, if the local cache page does not exist, a new local cache page is created to add the data, and after the storage is finished, the worker node returns the total information of the cache set to the master node, wherein the total information comprises the number of the groups and the data of each group;
the Master node creates virtual cache pages according to the information of the cache set, each virtual cache page corresponds to one group, and local cache pages corresponding to the same group in different worker nodes are mapped to the same virtual cache page in the Master node;
if the current query operation has a sorting requirement, sorting the data in one cache page according to the corresponding sorting value in the data writing process of the third step; after the worker node collects the query result set, the query result set is directly sorted locally, and after sorting is completed, the result set is still stored in a file corresponding to the local cache set.
2. The method of claim 1, wherein the result set data extraction step without ordering requirements is:
the first step is as follows: the master node receives the data extraction request, firstly extracts the corresponding virtual cache and then acquires the cursor of the virtual cache;
the second step is that: adding one to the cursor, and judging whether the cursor overflows the upper data limit of the current virtual page according to the total amount of data records in the virtual page; if not, taking the current virtual page as a target page, otherwise, setting the next virtual page as the target page; returning an error message if no next virtual page exists;
the third step: extracting a virtual mapping pointer from the current target page, calculating a worker mapping pointer corresponding to the current cursor position, binding a corresponding worker node through the pointer, and sending a data extraction request to the worker;
the fourth step: after receiving the request, the worker changes the local cursor position according to the cursor information and the paging information, extracts the corresponding data record from the local cache and returns the data record;
the fifth step: and returning after the master node receives the result data.
3. The method of claim 2, wherein during the data extraction process, the master node determines the virtual page corresponding to the current cursor according to the position of the moving cursor and the number of data records in the virtual page; and after the virtual paging corresponding to the current cursor is determined, calculating a worker pointer corresponding to the position and the cursor position in the worker according to the cursor value.
4. The method of claim 1, wherein the ordering required result set data extraction step is:
the first step is as follows: the master firstly establishes a data window in a virtual cache;
the second step is that: the master receives an extraction request and judges whether a data record exists in a current data window or not; if no data record exists, sending an extraction request to the worker terminal; extracting a data record after local sequencing from all worker terminal local caches, and changing the cursor position of the worker terminal; after receiving the data records, the master end sorts the data records in the data window and sets the cursor position in the window as an initial position;
the third step: the master changes the cursor position in the window and extracts the data record of the corresponding position from the window and returns the data record;
the fourth step: and sequentially extracting data from the master end until all data are extracted.
5. The method of claim 4, wherein the data set ordering is not done on the master side, but is done locally on the worker side; the master end establishes a data window for the data extraction request; in the initial stage of the window, the master sends a data extraction request to all the workers and extracts a data record from the local cache of the workers.
6. The method of claim 5, wherein the master terminal, upon receiving the data, sorts it in the window; the master changes the cursor position of the window according to the data extraction request and extracts the data record of the corresponding position from the window; and after the master end finishes the data extraction of the current window, the next data extraction operation is continuously executed from the worker end, the positions of the data queue and the cursor in the window are updated, and the data extraction operation is responded until the data records in all the workers are completely extracted.
7. The method of claim 1, wherein the local cache establishes a disk IO file cache for each cache page in order to reduce memory usage, and data records are written into the file cache in sequence.
8. The method of claim 1, wherein a mapping cache corresponding to a local file cache is established in a worker node memory, a fixed number of data records are loaded at one time, and in the data reading process, if the line number of the extracted data record is in the upper and lower boundaries of the current mapping cache, the data record is directly extracted from the memory; if the number of the data records exceeds the preset value, extracting the corresponding data record from the file cache and loading the data record into the memory, and finishing the extraction of the corresponding line number data record.
CN201610966316.5A 2016-10-28 2016-10-28 Distributed database result caching method based on hierarchical mapping Active CN106570145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610966316.5A CN106570145B (en) 2016-10-28 2016-10-28 Distributed database result caching method based on hierarchical mapping

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610966316.5A CN106570145B (en) 2016-10-28 2016-10-28 Distributed database result caching method based on hierarchical mapping

Publications (2)

Publication Number Publication Date
CN106570145A CN106570145A (en) 2017-04-19
CN106570145B true CN106570145B (en) 2020-07-10

Family

ID=58536038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610966316.5A Active CN106570145B (en) 2016-10-28 2016-10-28 Distributed database result caching method based on hierarchical mapping

Country Status (1)

Country Link
CN (1) CN106570145B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368583A (en) * 2017-07-21 2017-11-21 郑州云海信息技术有限公司 A kind of method and system of more cluster information inquiries
CN108984122A (en) * 2018-07-05 2018-12-11 柏建民 Mapping formula remotely stores operating technology
US11429611B2 (en) 2019-09-24 2022-08-30 International Business Machines Corporation Processing data of a database system
US10719517B1 (en) * 2019-12-18 2020-07-21 Snowflake Inc. Distributed metadata-based cluster computing
CN116303140B (en) * 2023-05-19 2023-08-29 珠海妙存科技有限公司 Hardware-based sorting algorithm optimization method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1595905A (en) * 2004-07-04 2005-03-16 华中科技大学 Streaming media buffering proxy server system based on cluster
CN102521406A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading
CN104504001A (en) * 2014-12-04 2015-04-08 西北工业大学 Massive distributed relational database-oriented cursor creation method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1595905A (en) * 2004-07-04 2005-03-16 华中科技大学 Streaming media buffering proxy server system based on cluster
CN102521406A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Distributed query method and system for complex task of querying massive structured data
CN102521405A (en) * 2011-12-26 2012-06-27 中国科学院计算技术研究所 Massive structured data storage and query methods and systems supporting high-speed loading
CN104504001A (en) * 2014-12-04 2015-04-08 西北工业大学 Massive distributed relational database-oriented cursor creation method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"分布式数据库查询优化的研究";吴宪;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111215(第 S2 期);第I138-946页 *
吴宪."分布式数据库查询优化的研究".《中国优秀硕士学位论文全文数据库 信息科技辑》.2011,(第 S2 期), *

Also Published As

Publication number Publication date
CN106570145A (en) 2017-04-19

Similar Documents

Publication Publication Date Title
CN106570145B (en) Distributed database result caching method based on hierarchical mapping
CN106934014B (en) Hadoop-based network data mining and analyzing platform and method thereof
Mohanty Big data: An introduction
CN103106249B (en) A kind of parallel data processing system based on Cassandra
US20140358977A1 (en) Management of Intermediate Data Spills during the Shuffle Phase of a Map-Reduce Job
Liang et al. Express supervision system based on NodeJS and MongoDB
CN102214176B (en) Method for splitting and join of huge dimension table
CN102880854B (en) Distributed processing and Hash mapping-based outdoor massive object identification method and system
CN102831405B (en) Method and system for outdoor large-scale object identification on basis of distributed and brute-force matching
CN103198097A (en) Massive geoscientific data parallel processing method based on distributed file system
US10158709B1 (en) Identifying data store requests for asynchronous processing
Li et al. Parallelizing skyline queries over uncertain data streams with sliding window partitioning and grid index
CN107357873A (en) A kind of big data storage management system
CN107077453A (en) For the system and method for the parallel optimization that data base querying is carried out using cluster cache
CN112632025A (en) Power grid enterprise management decision support application system based on PAAS platform
CN105550351B (en) The extemporaneous inquiry system of passenger's run-length data and method
JP2013045208A (en) Data generation method, device and program, retrieval processing method, and device and program
Lan et al. A lightweight time series main-memory database for IoT real-time services
CN109669987A (en) A kind of big data storage optimization method
KR101955376B1 (en) Processing method for a relational query in distributed stream processing engine based on shared-nothing architecture, recording medium and device for performing the method
CN105468728A (en) Cross-section data acquisition method and system
CN106257447A (en) The video storage of cloud storage server and search method, video cloud storage system
CN112558869A (en) Remote sensing image caching method based on big data
CN110019440A (en) The processing method and processing device of data
Costan From big data to fast data: Efficient stream data management

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant