NZ622485B - Shared cache used to provide zero copy memory mapped database - Google Patents
Shared cache used to provide zero copy memory mapped databaseInfo
- Publication number
- NZ622485B NZ622485B NZ622485A NZ62248514A NZ622485B NZ 622485 B NZ622485 B NZ 622485B NZ 622485 A NZ622485 A NZ 622485A NZ 62248514 A NZ62248514 A NZ 62248514A NZ 622485 B NZ622485 B NZ 622485B
- Authority
- NZ
- New Zealand
- Prior art keywords
- shared cache
- data
- memory
- data set
- applications
- Prior art date
Links
- 230000000875 corresponding Effects 0.000 claims abstract description 33
- 238000007405 data analysis Methods 0.000 description 19
- 238000000034 method Methods 0.000 description 8
- 238000004458 analytical method Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000004931 aggregating Effects 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000003993 interaction Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006011 modification reaction Methods 0.000 description 1
- 230000003287 optical Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000004450 types of analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/084—Multiuser, multiprocessor or multiprocessing cache systems with a shared cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/10—Address translation
- G06F12/1009—Address translation using page tables, e.g. page table structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/10—Providing a specific technical effect
- G06F2212/1016—Performance improvement
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/60—Details of cache memory
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/65—Details of virtual memory and virtual address translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5011—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
- G06F9/5016—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
Abstract
Disclosed is a method for providing a plurality of applications with concurrent access to data in a cache. The method includes identifying a plurality of attributes of an expected data set to be accessed concurrently by the plurality of applications and allocating a memory space for a shared cache. The shared cache comprises a column data store configured to store data for each of the plurality of attributes of the expected data set in columns. The method also includes retrieving the expected data set from a database and populating the shared cache with the expected data set. The method further includes storing memory address locations corresponding to the columns of the column data store of the shared cache for access by the plurality of applications. Each application generates a memory map from memory locations in a virtual address space of each respective application to the stored address memory locations. The shared cache comprises a column data store configured to store data for each of the plurality of attributes of the expected data set in columns. The method also includes retrieving the expected data set from a database and populating the shared cache with the expected data set. The method further includes storing memory address locations corresponding to the columns of the column data store of the shared cache for access by the plurality of applications. Each application generates a memory map from memory locations in a virtual address space of each respective application to the stored address memory locations.
Description
SHARED CACHE USED TO PROVIDE ZERO COPY MEMORY MAPPED
DATABASE
FIELD OF THE INVENTION
Embodiments of the invention generally relate to data analysis and, more
specifically, to techniques for providing a shared cache as a zero copy memory mapped
database.
DESCRIPTION OF THE RELATED ART
The reference in this specification to any prior publication ( or information derived
from it), or to any matter which is known, is not, and should not be taken as an
acknowledgment or admission or any form of suggestion that that prior publication ( or
information derived from it) or known matter forms part of the common general
knowledge in the field of endeavour to which this specification relates.
Some programming languages provide an execution environment that includes
memory management services for applications. That is, the execution environment
manages application memory usage. The operating system provides each process,
including the execution environment, with a dedicated memory address space. The
execution environment assigns a memory address space to execute the application. The
total addressable memory limits how many processes may execute concurrently and how
much memory the operating system may provide to any given process.
In some data analysis systems, applications perform queries against a large
common data set, e.g. an application that performs financial analyses on a common
investment portfolio. In such a case, the financial analysis application may repeatedly load
portions of the entire data set into the application's memory or the application may load the
entire expected data set. Frequently, even if multiple applications analyze the same data
set, the data is loaded into the memory address space of each application. Doing so takes
time and system resources, which increases system latency and effects overall system
performance. The amount of memory in a system limits the number of execution
environment processes that can run concurrently with memory address space sizable
enough to allow the application to load an entire expected data set.
The scalability of the system is limited as the expected data set grows, because the
system has to either reduce the number of applications that can run concurrently or
increase the rate at which portions of the expected data set must be loaded, causing overall
system performance to degrade.
SUMMARY
One embodiment of the invention includes a method for a plurality of applications
to access a data set concurrently. This method may generally include identifying a
plurality of attributes of an expected data set to be accessed concurrently by the plurality of
applications and allocating a memory space for a shared cache. The shared cache
comprises a column data store configured to store data for each of the plurality of attributes
of the expected data set in columns. This method may further include retrieving the
expected data set from a database, populating the shared cache with the expected data set;
and storing memory address locations corresponding to the columns of the column data
store of the shared cache for access by the plurality of applications. Each application
generates a memory map from memory locations in a virtual address space of each
respective application to the stored address memory locations.
Typically the plurality of applications access the data set by:
requesting data from one or more of the memory locations in the virtual address
space allocated to the application;
mapping memory locations in the virtual address space to corresponding memory
address locations of the shared cache, via the memory map; and
accessing the requested data from the mapped memory locations in the shared
cache.
Typically storing the data of each of the plurality of attributes of the expected data
set in columns, comprises:
dividing one or more data records retrieved from the database into a plurality of
attribute values;
identifying a contiguous memory location for each attribute; and
storing each attribute value at in one of the identified contiguous memory locations.
Typically a plurality of object-oriented representations of the expected data set are
provided to a plurality of applications by:
initializing a plurality of objects with data access methods;
storing the memory address locations corresponding to the columns of the column
data store of the shared cache in the objects; and
providing the plurality of objects for access by the plurality of applications.
Typically at least one of the plurality of applications access the data set by:
calling a data access method of one of the objects; and
receiving the requested data from the data access method of the object.
Typically the shared cache is configured to provide read only access to the plurality
of applications.
Typically the shared cache is updated by:
identifying a plurality of attributes of an updated expected data set;
re-allocating the memory space for the shared cache;
retrieving an updated expected data set from the database;
populating the shared cache with the updated expected data set; and
storing a plurality of address references to the columns of the column data store in
the shared cache.
Other embodiments of the present invention include, without limitation, a
computer-readable storage medium including instructions that, when executed by a
processing unit, cause the processing unit to implement aspects of the approach described
herein as well as a system that includes different elements configured to implement aspects
of the approach described herein.
Thus, the invention seeks to provide a computer-readable storage medium storing
instructions that, when executed by a processor, cause the processor to perform an
operation for providing a plurality of applications with concurrent access to data, the
operation comprising:
identifying a plurality of attributes of an expected data set to be accessed
concurrently by the plurality of applications;
allocating a memory space for a shared cache, wherein the shared cache comprises
a column data store configured to store data for each of the plurality of attributes of the
expected data set in columns;
retrieving the expected data set from a database;
populating the shared cache with the expected data set; and
storing memory address locations corresponding to the columns of the column data
store of the shared cache for access by the plurality of applications, wherein each
application generates a memory map from memory locations in a virtual address space of
each respective application to the stored address memory locations.
Typically the applications access the data set by:
requesting data from one or more of the memory locations in the virtual address
space allocated to the application;
mapping memory locations in the virtual address space to corresponding memory
address locations of the shared cache, via the memory map; and
accessing the requested data from the mapped memory locations in the shared
cache.
Typically storing the data of each of the plurality of attributes of the expected data
set in columns, comprises:
dividing one or more data records retrieved from the database into a plurality of
attribute values;
identifying a contiguous memory location for each attribute; and
storing each attribute value at in one of the identified contiguous memory locations.
Typically a plurality of object-oriented representations of the expected data set are
provided to a plurality of applications by:
initializing a plurality of objects with data access methods;
storing the memory address locations corresponding to the columns of the column
data store of the shared cache in the objects; and
providing the plurality of objects for access by the plurality of applications.
Typically at least one of the plurality of applications access the data set by:
calling a data access method of one of the objects; and
receiving the requested data from the data access method of the object.
Typically the shared cache is configured to provide read only access to the plurality
of applications.
Typically the shared cache is updated by:
identifying a plurality of attributes of an updated expected data set;
re-allocating the memory space for the shared cache;
retrieving an updated expected data set from the database;
populating the shared cache with the updated expected data set; and
storing a plurality of address references to the columns of the column data store in
the shared cache.
The invention further seeks to provide a computer system, comprising:
a memory; and
a processor storing one or more programs configured to perform an
operation for providing a plurality of applications with concurrent access to data, the
method comprising:
identifying a plurality of attributes of an expected data set to be accessed
concurrently by the plurality of applications;
allocating a memory space for a shared cache, wherein the shared cache comprises
a column data store configured to store data for each of the plurality of attributes of the
expected data set in columns;
retrieving the expected data set from a database;
populating the shared cache with the expected data set; and
storing memory address locations corresponding to the columns of the column data
store of the shared cache for access by the plurality of applications, wherein each
application generates a memory map from memory locations in a virtual address space of
each respective application to the stored address memory locations.
Typically the plurality of applications access the data set by:
requesting data from one or more of the memory locations in the virtual address
space allocated to the application;
mapping memory locations in the virtual address space to corresponding memory
address locations of the shared cache, via the memory map; and
accessing the requested data from the mapped memory locations in the shared
cache.
Typically storing the data of each of the plurality of attributes of the expected data
set in columns, comprises:
dividing one or more data records retrieved from the database into a plurality of
attribute values;
identifying a contiguous memory location for each attribute; and
storing each attribute value at in one of the identified contiguous memory locations.
Typically a plurality of object-oriented representations of the expected data set are
provided to a plurality of applications by:
initializing a plurality of objects with data access methods;
storing the memory address locations corresponding to the columns of the column
data store of the shared cache in the objects; and
providing the plurality of objects for access by the plurality of applications.
Typically at least one of the plurality of applications access the data set by:
calling a data access method of one of the objects; and
receiving the requested data from the data access method of the object.
Typically the shared cache is configured to provide read only access to the plurality
of applications.
Typically the shared cache is updated by:
identifying a plurality of attributes of an updated expected data set;
re-allocating the memory space for the shared cache;
retrieving an updated expected data set from the database;
populating the shared cache with the updated expected data set; and
storing a plurality of address references to the columns of the column data store in
the shared cache.
Advantageously, the method stores a single instance of the expected data set in
memory, so each application does not need to create an additional instance of the expected
data set. Therefore, larger expected data sets may be stored in memory without limiting
the number of applications running concurrently.
Further, the method may arrange the expected data set in the shared cache for
efficient data analysis. For instance, the method may arrange the expected data set in
columns, which facilitates aggregating subsets of the expected data set.
BRIEF DESCRIPTION OF THE DRAWINGS
So that the manner in which the features of the present invention recited above can
be understood in detail, a more particular description of the invention, briefly summarized
above, may be had by reference to embodiments, some of which are illustrated in the
appended drawings. It is to be noted, however, that the appended drawings illustrate only
typical embodiments of this invention and are therefore not to be considered limiting of its
scope, for the invention may admit to other equally effective embodiments.
Figure 1 is a block diagram illustrating a computer system configured to implement
one or more aspects of the present invention.
Figure 2 illustrates an example computing environment, according to one
embodiment.
Figure 3 is a block diagram of the flow of data in the application server of Figure 1,
according to one embodiment.
Figure 4 illustrates a column store in the shared cache, according to one
embodiment.
Figure 5 illustrates a method for setting up the shared cache, according to one
embodiment.
Figure 6 illustrates a method for retrieving data from the shared cache, according to
one embodiment.
DETAILED DESCRIPTION
Embodiments of the invention provide a shared cache as a zero copy memory
mapped database. Multiple applications access the shared cache concurrently. In one
embodiment, the shared cache is a file that each application maps into the virtual memory
address space of that application. Doing so allows multiple applications to access the
shared cache simultaneously. Note, in the present context, an expected data set generally
refers to records from a database repository designated to be loaded into the shared cache.
A process, referred to herein as a synchronizer, populates, and in some cases updates, a
data structure storing the expected data set in the shared cache. To access the shared
cache, each running application maps the shared cache into a virtual memory address space
of the execution environment in which the application runs. The mapping translates virtual
memory addresses (i n a user address space) to memory addresses in the shared cache (t he
system address space). In one embodiment, the applications only read data from the data
stored in the shared cache. As a result, applications can access the data concurrently
without causing conflicts.
In one embodiment, the data structure is a column data store in which data from the
database repository is stored contiguously in columns. The applications analyze data
entities called models. Models include a combination of data attributes from a database
repository and different types of models include different data attributes. The expected
data set includes several different types of models. The synchronizer arranges the column
data store to include a column for every data attribute of the models included in the
expected data set. Application developers build the applications upon data access methods
that abstract the interactions with the actual columns of the column data store, so that
application developers can readily access the data of a model without regard for the
underlying data structure. The columns allow efficient aggregation of the data, because as
an application iterates through a data attribute of a group of modes, a data access method
simply reads sequential entries in a column. For example, an expected data set may
include personal checking account models. In such a case, the column data store would
include the data of the personal checking account models in columns, such as a column for
account balances, a column for account numbers, and a column for recent transactions.
The application accesses the columns of data through calls to data access methods.
In the following description, numerous specific details are set forth to provide a
more thorough understanding of the present invention. However, it will be apparent to one
of skill in the art that the present invention may be practiced without one or more of these
specific details.
Figure 1 is a block diagram illustrating an example data analysis system 100,
according to one embodiment. As shown, the data analysis system 100 includes a client
computing system 105, a client 110, a server computer system 115, an application server
120, and a database repository 115. The client 110 runs on the client computing system
105 and requests data analysis activities from the application server 120 that performs the
data analysis activities at a server computing system 115 on data retrieved from the
database repository 150.
The client 110 translates user inputs into requests for data analysis by the
application server 120. The client 110 runs on computing systems connected to the server
computing system 115 over a network. For example, the client 110 may be dynamic web
pages in a browser or a web-based Java application running on a client computing system
105. Alternatively, the client 110 may run on the same server computing system 115 as
the application server 120. In any event, a user interacts with the data analysis system 100
through client 110.
The application server 120 performs the analysis upon data read from the database
repository 150. A network connects the database repository 150 and the server computing
system 115. The database repository 150 stores records of data. For example, the database
repository 150 may be a Relational Database Management System (R DBMS) storing data
as rows in relational tables. Alternatively, the database repository 150 may exist on the
same server computing system 115 as the application server 120.
In one embodiment, a user sets up an application server 120 with an expected data
set. Once configured, the expected data set is made available to multiple clients 110 for
analysis.
Figure 2 illustrates an example server computing system 115 configured with a
shared cache 228, according to one embodiment. The shared cache 228 provides
applications 222 running in execution environments 221 with concurrent access to data
stored in the shared cache 228. As shown, the server computing system 115 includes,
without limitation, a central processing unit (C PU) 250, a network interface 270, a memory
220, and a storage 230, each connected to an interconnect ( bus ) 240. The server
computing system 115 may also include an I/O device interface 260 connecting I/O
devices 280 (e .g., keyboard, display and mouse devices) to the computing system 115.
Further, in context of this disclosure, the computing elements shown in server computing
system 115 may correspond to a physical computing system (e.g., a system in a data
center) or may be a virtual computing instance executing within a computing cloud.
The CPU 250 retrieves and executes programming instructions stored in memory
220 as well as stores and retrieves application data residing in memory 220. The bus 240
is used to transmit programming instructions and application data between the CPU 250,
I/O device interface 260, storage 230, network interface 270, and memory 220. Note, CPU
250 is included to be representative of a single CPU, multiple CPUs, a single CPU having
multiple processing cores, a CPU with an associate memory management unit, and the
like. The memory 220 is generally included to be representative of a random access
memory. The storage 230 may be a disk drive storage device. Although shown as a single
unit, the storage 230 may be a combination of fixed and/or removable storage devices,
such as fixed disc drives, removable memory cards, or optical storage, network attached
storage (N AS), or a storage area-network (S AN).
The requests for data analyses and the results of data analyses are transmitted
between the client 110 and the applications 222 over the network via the network interface
270.Illustratively, the memory 220 includes applications 222 running in execution
environments 221, a synchronizer 225, and a shared cache 228. The applications 222
perform data analyses using data from the shared cache 228. Prior to performing a data
analysis, the synchronizer 225 initializes the shared cache 228 with data retrieved from the
database repository 150. For example, the synchronizer 225 may issue database queries
over the network to the database repository 150 via the network interface 270. Once the
synchronizer 225 initializes ( or updates) the shared cache 228, a applications 222 maps the
shared cache 228 into the virtual address space local to the execution environment 221 of
the application 222. This memory mapping allows the application 222 to access the shared
cache 228 and read the data from the shared cache 228. When other applications 222 also
map the shared cache into the virtual address space local to the execution environment 221
of the applications 222, then the applications 222 may concurrently access the shared
cache 228.
Although shown in memory 220, the shared cache 228 may be stored in memory
220, storage 230, or split between memory 220 and storage 230. Further, although shown
as a single element the shared cache 228 may be divided or duplicated.
In some embodiments, the database repository 150 may be located in the storage
230. In such a case, the database queries and subsequent responses are transmitted over
the bus 240. As described, the client 110 may also be located on the server computing
system 115, in which case the client 110 would also be stored in memory 220 and the user
would utilize the I/O devices 280 to interact with the client 110 through the I/O device
interface 260.
Figure 3 illustrates a flow of data as multiple applications 222 concurrently access
the shared cache 228 on the application server 120, according to one embodiment. As
shown, the application server 120 includes the synchronizer 225, shared cache 228, and
applications 222 running in execution environments 221, and a memory map 315 for each
execution environment 221. Further, each application 222 accesses one or more models
310.
The application 222 analyzes models 310 that include a combination of data
attributes from the database repository 150. To setup the shared cache 228 for the
applications 222, the synchronizer 225 reads data from the database repository 150. The
synchronizer 225 writes data to the shared cache 228. As it writes the data to the shared
cache 228, the synchronizer 225 organizes the data according to a data structure. For
example, the synchronizer may organize the data into a column data store for efficient data
access. The synchronizer 225 provides address references to the shared cache 228 that the
applications 222 use for accessing the data of the models 310 in the data structure of the
shared cache 228.
In one embodiment, the operating system of the server computing system 115
manages the memory map 315 to the shared cache 228. The memory map 315 maps a
virtual address space local to each execution environment 221 to physical memory
addresses in the shared cache 228. The address space of each execution environments 221
is a range of virtual memory locations. The virtual memory locations appear to the
execution environment 221 as one large block of contiguous memory addresses. The
memory map 315 contains a table of virtual memory locations and corresponding physical
memory locations. The virtual memory locations are mapped to the physical memory
locations in either memory 220 or storage 230 by looking up the virtual memory location
in the memory map 315 and retrieving the corresponding physical memory location.
When an application reads data from the virtual address space, a memory map 315
translates a memory address from the virtual address space of the physical address space.
Specifically, the application receives the data from the physical memory location in the
address space of the shared cache 228.
The application 222, the execution environment 221, the operating system, or any
other component responsible for translating memory addresses may create this mapping.
For example, an application 222 may be a Java® application running in the Java Virtual
Machine (J VM) execution environment. In such a case, the operating system provides the
JVM virtual memory address space to execute the Java application for data analysis. The
JVM runs the Java application in a portion of the virtual memory address space, called the
heap. Once created, the memory map 315 maps a portion of the remaining virtual memory
address locations to physical memory locations in the address space of the shared cache
228. When multiple JVMs run Java applications for data analysis on the same application
server 120, the memory maps 315 all map to the same shared cache 228, providing
concurrent access.
Figure 4 illustrates an example of the shared cache 228 configured as a column data
store 410, according to one embodiment of the present invention. As shown, the shared
cache 228 includes the column data store 410, which includes columns 440. An
application 222 accesses the data of a model 310 from the columns 440 that correspond to
the attributes of the model 310. An analysis based upon aggregating a particular attribute
of many models 310 of the same type may access a particular column 440(0 ) that
corresponds to the attribute instead of all columns 440 that correspond to that type of
model 310. Note, the synchronizer 225 may arrange the columns 440 for a particular type
of model 310 together or according to a number of different designs.
In one embodiment, a user configures the data analysis system 100 for analyzing
data of a given domain by selecting types of models 310 to analyze. The models 310
include data attributes from the database repository 150, so the synchronizer 225 retrieves
the database records to populate the column data store 410 based upon the selected models
310. The synchronizer 225 creates the column data store 410 to include a column 440 for
each attribute of the selected models 310.
For example, assume a model 310 representing a home mortgage is composed of
three attributes, such as the bank name, loan amount, and the mortgage issue date. In such
a case, the synchronizer 225 would query the database repository 150 for the data to build
three columns 440 in the column data store 410. The first column 440(0 ) would include a
list of bank names, the second column 440(1 ) would include the loan amounts, and the last
column 440(C -1) would include the mortgage issue date. Depending on the organization
of the database repository 150, a model 310 may include data from a single record in a
table in the database repository 150, data from multiple tables in the database repository
150, or aggregated data from multiple records in the database repository 150.
An application 222 accesses the data of a model 310 by reading the data values at
equal indexes across the columns 440 of the model. Alternatively, the application may
iterate through one attribute of a group of models, which involves reading sequential
entries in a single column 440.
In the example of a model 310 representing a home mortgage, the application 222
may call a data access method to create an aggregate of an attribute of the model 310, such
as the loan amounts attribute. The data access method would read sequential entries in the
second column 440(1 ) that includes the loan amounts. The data access method only needs
to find, read, and aggregate the entries in the one column 440(1 ). This is very efficient
because the application 222 easily calculates the memory addresses of sequential entries by
simply incrementing a pointer from one entry to the next.
A database repository 150 organizes data by records in tables, so to generate the
same average loan amount value, without using the shared cache 228 and the column data
store 410, a table with the loan amount attribute would need to be located and the records
from the table would need to be read to find the loan amount data. To find the loan
amount data in a record the data analysis system would have to access the entire record and
then the data analysis system would have to follow pointers from one data item of the
record to the next data item of the record until finding the loan amount value of the record.
The contiguous storage of the data values in columns 440 in a column data store
410 supports data aggregation. As a result, an application 222 only needs to read the
columns 440 involved in an analysis, instead of entire records; as previously discussed in
the example of determining the average home mortgage loan amount. Not only does less
data have to be read, but reading the relevant data is more efficient because the relevant
data is stored sequentially in memory 220, so it is easy to determine the address of
subsequent entries as the application 222 iterates through the column 440. Further, since
the data entries are stored contiguously, the data spans fewer pages of memory 220,
reducing the overhead associated with swapping memory pages.
As described the synchronizer 225 provides address reference to a column 440 to
the application 222 for accessing data in the column 440. The address reference is a virtual
memory location. The operating system maps the virtual memory location of the column
440 in the virtual memory address space in which an application 222 runs to the physical
memory location of the column 440 in a shared cache 228. Therefore, the application 222
accesses the column 440 as though the column 440 was included in one large block of
contiguous memory addresses belonging to the execution environment 221 that the
application 222 runs in.
Figure 5 illustrates a method for initializing the shared cache 228 and providing the
memory map 315 to the applications 222, according to one embodiment. Note, in this
example, the initialization of the shared cache 228 is discussed from the perspective of the
synchronizer 225. Although the method steps are described in conjunction with the
systems of Figures 1-4, persons of ordinary skill in the art will understand that any system
configuration to perform the method steps, in any order, is within the scope of the
invention.
As shown, method 500 begins at step 505, where the synchronizer 225 receives a
list of models 310 to include in the expected data set. A user defines the expected data set
available for analysis by selecting which models 310 the system should make available for
multiple applications 222 to analyze concurrently. The user may make the selections from
a user interface at the application server 120, or may create a script that includes the
selections.
In step 510, the synchronizer 225 creates a shared cache 228 as a file in memory
220. One skilled in the art will appreciate that the shared cache 228 could be stored in
memory 220 only or in some combination of memory 220 and storage 230. The operating
system generally determines the physical location of the shared cache 228 or portions of
the shared cache 228 based upon the amount of memory 220 available. The server
computer system 115 contains sufficient memory 220 to store the entire shared cache 228
in memory 220.
In step 515, the synchronizer 225 initializes a column data store 410 in the shared
cache 228 by initializing columns 440 for the attributes defined by the selected models
310. The synchronizer 255 creates pointers to memory locations in the shared cache 288
for each column 440.
In step 520, the synchronizer 225 retrieves the records included in the expected
data set from the database repository 150. The synchronizer 225 retrieves the records by
querying the database repository 150. For example, the database repository 150 may be a
structured query language ( S QL) based RDBMS, where the synchronizer 225 issues SQL
queries as defined by the selected data types to retrieve records from the tables of the
database repository 150.
In step 525, the synchronizer 225 stores data values from the retrieved records as
contiguous entries in the appropriate columns 440. The columns 440 correspond to the
attributes of the models 310. As the synchronizer 225 processes each retrieved record, the
synchronizer 225 copies the individual data values into the appropriate column 440. The
synchronizer 225 stores the first entry of a column 440 at the memory location of the
pointer that the synchronizer 225 created for the column 440 in step 515. The data values
from the first retrieved record become the first entries in the columns 440, the data values
from the second retrieved record become the second entries in the columns 440, and so on.
Thus, each data record that the synchronizer 225 retrieves is stored as multiple entries at
the same index location in multiple columns 440.
In step 530, the synchronizer 225 provides address references of the columns 440
in the shared cache 228 to the applications 222. The address references may be the
locations of the first entries of the columns 440 in the shared cache 228. The address
references may be stored in a file that each application 222 is able to access.
Although the synchronizer creates the columns 440 in the shared cache 228, the
address references provided to a model 310 may be virtual address locations. The model
310 may be is used by an applications 222 running in an execution environments 221 with
a local address space of virtual memory. A memory map 315 translates the virtual address
locations to physical memory locations in the columns 440 in the shared cache 228. The
creation of the columns data store 410 in the shared cache 228 that is outside of the virtual
memory space of a single execution environment 221 allows the synchronizer 225 to
provide address references to an interface 310 used by multiple applications 222 in
multiple execution environments 221. Therefore, multiple applications can use models
310, which have the virtual address locations mapped to the shared cache 228, to access
the same data in the columns 440 concurrently.
In some embodiments of this invention, the operating system of the server
computing system 115 or the program execution environment creates and maintains the
memory map 315 of the shared cache 228. In such a case, the memory map 315 contains
physical memory locations of the shared cache 228, but not necessarily the locations of the
columns 440 in the shared cache 228. A synchronizer 225 would provide virtual address
locations to an application 222 that represent offsets into the shared cache 228 for the
physical memory location of the columns 440.
Figure 6 illustrates a method for accessing a model 310 in the shared cache 228
from the point of view of the application 222, according to one embodiment. Although the
method steps are described in conjunction with the systems of Figures 1-4, persons of
ordinary skill in the art will understand that any system configuration to perform the
method steps, in any order, is within the scope of the invention.
As shown, method 600 begins at step 605, where the application 222 creates a
memory map 315 of the shared cache 228. As discussed above, the memory map 315
identifies virtual memory locations and corresponding physical memory locations in the
shared cache 228. The shared cache 228 is a memory mapped file, which the application
222 first opens and then maps into the execution environment's 221 memory. For
example, assuming the application 222 is a Java application, the application 222 opens the
shared cache 228 file as a RandomAccessFile. Then the application 222 creates a
MappedByteBuffer, which maps to the shared cache 228. Once the application 222 creates
a MappedByteBuffer, the application 222 is able to read bytes of data from specific
locations in the MappedByteBuffer that are mapped to locations in the shared cache 228.
The application utilizes models 310 to read the data from the data structure in the shared
cache 228.
In step 610, the application 222 makes a data access method call to retrieve the data
of a model 310. Depending on how the data access method has been developed, the data
access method may retrieve a subset of the raw data stored in the shared cache 228 or the
data access method may retrieve an aggregate of the data stored in the shared cache 228.
In step 615, the interface 310 requests data from address references in the memory
mapped representation of the shared cache 228. According to one embodiment of the
invention, the address references are locations of the first entries in the columns 440 of the
column data store 410. The interface 310 may request data beginning at the first entry of
the column 440 or may calculate an offset location. If the application 222 is a Java
application, the requested memory locations are virtual memory locations in the
MappedByteBuffer. As noted, the MappedBytebuffer is the memory mapped
representation of the shared cache 228, so the MappedBytebuffer is included in the virtual
address space of the execution environment 221 that the application 222 runs in.
In step 620, the operating system maps the requested memory locations from the
virtual address space of the execution environment 221 to the physical memory locations
in the shared cache 228. According to one embodiment, the operating system identifies the
virtual memory location in a table in the memory map 315 and retrieves the corresponding
physical memory location.
In step 625, the application 222 receives the requested data from the shared cache
228. According to one embodiment of the invention, the operating system performs the
memory mapping, so the application 222 receives the requested data as if the data had been
requested from the address space of the execution environment 221.
In step 630, the application 222 processes the retrieved data according to the
intended function of the data analysis application. For example, the application 222 may
report some aggregate or subset of the requested data in the shared cache 228 to the client
110 or may issue additional data requests based upon the already retrieved data. This
processing may occur as part of the data access method call or after the data access method
call has returned.
In one embodiment of the invention includes a method for a plurality of
applications to access a data set concurrently. This method may generally include
identifying a plurality of attributes of an expected data set to be accessed concurrently by
the plurality of applications and allocating a memory space for a shared cache. The shared
cache comprises a column data store configured to store data for each of the plurality of
attributes of the expected data set in columns. This method may further include retrieving
the expected data set from a database, populating the shared cache with the expected data
set; and storing memory address locations corresponding to the columns of the column
data store of the shared cache for access by the plurality of applications. Each application
generates a memory map from memory locations in a virtual address space of each
respective application to the stored address memory locations.
Typically the plurality of applications access the data set by:
requesting data from one or more of the memory locations in the virtual address
space allocated to the application;
mapping memory locations in the virtual address space to corresponding memory
address locations of the shared cache, via the memory map; and
accessing the requested data from the mapped memory locations in the shared
cache.
Typically storing the data of each of the plurality of attributes of the expected data
set in columns, comprises:
dividing one or more data records retrieved from the database into a plurality of
attribute values;
identifying a contiguous memory location for each attribute; and
storing each attribute value at in one of the identified contiguous memory locations.
Typically a plurality of object-oriented representations of the expected data set are
provided to a plurality of applications by:
initializing a plurality of objects with data access methods;
storing the memory address locations corresponding to the columns of the column
data store of the shared cache in the objects; and
providing the plurality of objects for access by the plurality of applications.
Typically at least one of the plurality of applications access the data set by:
calling a data access method of one of the objects; and
receiving the requested data from the data access method of the object.
Typically the shared cache is configured to provide read only access to the plurality
of applications.
Typically the shared cache is updated by:
identifying a plurality of attributes of an updated expected data set;
re-allocating the memory space for the shared cache;
retrieving an updated expected data set from the database;
populating the shared cache with the updated expected data set; and
storing a plurality of address references to the columns of the column data store in
the shared cache.
Other embodiments of the present invention include, without limitation, a
computer-readable storage medium including instructions that, when executed by a
processing unit, cause the processing unit to implement aspects of the approach described
herein as well as a system that includes different elements configured to implement aspects
of the approach described herein.
Thus, the invention seeks to provide a computer-readable storage medium storing
instructions that, when executed by a processor, cause the processor to perform an
operation for providing a plurality of applications with concurrent access to data, the
operation comprising:
identifying a plurality of attributes of an expected data set to be accessed
concurrently by the plurality of applications;
allocating a memory space for a shared cache, wherein the shared cache comprises
a column data store configured to store data for each of the plurality of attributes of the
expected data set in columns;
retrieving the expected data set from a database;
populating the shared cache with the expected data set; and
storing memory address locations corresponding to the columns of the column data
store of the shared cache for access by the plurality of applications, wherein each
application generates a memory map from memory locations in a virtual address space of
each respective application to the stored address memory locations.
Typically the applications access the data set by:
requesting data from one or more of the memory locations in the virtual address
space allocated to the application;
mapping memory locations in the virtual address space to corresponding memory
address locations of the shared cache, via the memory map; and
accessing the requested data from the mapped memory locations in the shared
cache.
Typically storing the data of each of the plurality of attributes of the expected data
set in columns, comprises:
dividing one or more data records retrieved from the database into a plurality of
attribute values;
identifying a contiguous memory location for each attribute; and
storing each attribute value at in one of the identified contiguous memory locations.
Typically a plurality of object-oriented representations of the expected data set are
provided to a plurality of applications by:
initializing a plurality of objects with data access methods;
storing the memory address locations corresponding to the columns of the column
data store of the shared cache in the objects; and
providing the plurality of objects for access by the plurality of applications.
Typically at least one of the plurality of applications access the data set by:
calling a data access method of one of the objects; and
receiving the requested data from the data access method of the object.
Typically the shared cache is configured to provide read only access to the plurality
of applications.
Typically the shared cache is updated by:
identifying a plurality of attributes of an updated expected data set;
re-allocating the memory space for the shared cache;
retrieving an updated expected data set from the database;
populating the shared cache with the updated expected data set; and
storing a plurality of address references to the columns of the column data store in
the shared cache.
The invention further seeks to provide a computer system, comprising:
a memory; and
a processor storing one or more programs configured to perform an
operation for providing a plurality of applications with concurrent access to data, the
method comprising:
identifying a plurality of attributes of an expected data set to be accessed
concurrently by the plurality of applications;
allocating a memory space for a shared cache, wherein the shared cache comprises
a column data store configured to store data for each of the plurality of attributes of the
expected data set in columns;
retrieving the expected data set from a database;
populating the shared cache with the expected data set; and
storing memory address locations corresponding to the columns of the column data
store of the shared cache for access by the plurality of applications, wherein each
application generates a memory map from memory locations in a virtual address space of
each respective application to the stored address memory locations.
Typically the plurality of applications access the data set by:
requesting data from one or more of the memory locations in the virtual address
space allocated to the application;
mapping memory locations in the virtual address space to corresponding memory
address locations of the shared cache, via the memory map; and
accessing the requested data from the mapped memory locations in the shared
cache.
Typically storing the data of each of the plurality of attributes of the expected data
set in columns, comprises:
dividing one or more data records retrieved from the database into a plurality of
attribute values;
identifying a contiguous memory location for each attribute; and
storing each attribute value at in one of the identified contiguous memory locations.
Typically a plurality of object-oriented representations of the expected data set are
provided to a plurality of applications by:
initializing a plurality of objects with data access methods;
storing the memory address locations corresponding to the columns of the column
data store of the shared cache in the objects; and
providing the plurality of objects for access by the plurality of applications.
Typically at least one of the plurality of applications access the data set by:
calling a data access method of one of the objects; and
receiving the requested data from the data access method of the object.
Typically the shared cache is configured to provide read only access to the plurality
of applications.
Typically the shared cache is updated by:
identifying a plurality of attributes of an updated expected data set;
re-allocating the memory space for the shared cache;
retrieving an updated expected data set from the database;
populating the shared cache with the updated expected data set; and
storing a plurality of address references to the columns of the column data store in
the shared cache.
While the foregoing is directed to embodiments of the present invention, other and
further embodiments of the invention may be devised without departing from the basic
scope thereof. For example, aspects of the present invention may be implemented in
hardware or software or in a combination of hardware and software. One embodiment of
the invention may be implemented as a program product for use with a computer system.
The program(s ) of the program product define functions of the embodiments ( i ncluding the
methods described herein) and can be contained on a variety of computer-readable storage
media. Illustrative computer-readable storage media include, but are not limited to: (i )
non-writable storage media (e .g., read-only memory devices within a computer such as
CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of
solid-state non-volatile semiconductor memory) on which information is permanently
stored; and ( i i) writable storage media (e .g., floppy disks within a diskette drive or hard-
disk drive or any type of solid-state random-access semiconductor memory) on which
alterable information is stored.
The invention has been described above with reference to specific embodiments.
Persons of ordinary skill in the art, however, will understand that various modifications
and changes may be made thereto without departing from the broader spirit and scope of
the invention as set forth in the appended claims. The foregoing description and drawings
are, accordingly, to be regarded in an illustrative rather than a restrictive sense.
Therefore, the scope of the present invention is determined by the claims that
follow.
Throughout this specification and the claims which follow, unless the context
requires otherwise, the word "comprise", and variations such as "comprises" and
"comprising", will be understood to imply the inclusion of a stated integer or step or group
of integers or steps but not the exclusion of any other integer or step or group of integers or
steps.
Claims (21)
1. A method for providing a plurality of applications with concurrent access to data, the method comprising: identifying a plurality of attributes of an expected data set to be accessed concurrently by the plurality of applications; allocating a memory space for a shared cache, wherein the shared cache comprises a column data store configured to store data for each of the plurality of attributes of the expected data set in columns; retrieving the expected data set from a database; populating the shared cache with the expected data set; and storing memory address locations corresponding to the columns of the column data store of the shared cache for access by the plurality of applications, wherein each application generates a memory map from memory locations in a virtual address space of each respective application to the stored address memory locations.
2. The method of claim 1, wherein the plurality of applications access the data set by: requesting data from one or more of the memory locations in the virtual address space allocated to the application; mapping memory locations in the virtual address space to corresponding memory address locations of the shared cache, via the memory map; and accessing the requested data from the mapped memory locations in the shared cache.
3. The method of claim 1 or claim 2, wherein storing the data of each of the plurality of attributes of the expected data set in columns, comprises: dividing one or more data records retrieved from the database into a plurality of attribute values; identifying a contiguous memory location for each attribute; and storing each attribute value at in one of the identified contiguous memory locations.
4. The method of any one of the claims 1 to 3, wherein a plurality of object- oriented representations of the expected data set are provided to a plurality of applications initializing a plurality of objects with data access methods; storing the memory address locations corresponding to the columns of the column data store of the shared cache in the objects; and providing the plurality of objects for access by the plurality of applications.
5. The method of claim 4, wherein at least one of the plurality of applications access the data set by: calling a data access method of one of the objects; and receiving the requested data from the data access method of the object.
6. The method of any one of the claims 1 to 5, wherein the shared cache is configured to provide read only access to the plurality of applications.
7. The method of any one of the claims 1 to 6, wherein the shared cache is updated by: identifying a plurality of attributes of an updated expected data set; re-allocating the memory space for the shared cache; retrieving an updated expected data set from the database; populating the shared cache with the updated expected data set; and storing a plurality of address references to the columns of the column data store in the shared cache.
8. A computer-readable storage medium storing instructions that, when executed by a processor, cause the processor to perform an operation for providing a plurality of applications with concurrent access to data, the operation comprising: identifying a plurality of attributes of an expected data set to be accessed concurrently by the plurality of applications; allocating a memory space for a shared cache, wherein the shared cache comprises a column data store configured to store data for each of the plurality of attributes of the expected data set in columns; retrieving the expected data set from a database; populating the shared cache with the expected data set; and storing memory address locations corresponding to the columns of the column data store of the shared cache for access by the plurality of applications, wherein each application generates a memory map from memory locations in a virtual address space of each respective application to the stored address memory locations.
9. The computer-readable storage medium of claim 8, wherein the applications access the data set by: requesting data from one or more of the memory locations in the virtual address space allocated to the application; mapping memory locations in the virtual address space to corresponding memory address locations of the shared cache, via the memory map; and accessing the requested data from the mapped memory locations in the shared cache.
10. The computer-readable storage medium of claim 8 or claim 9, wherein storing the data of each of the plurality of attributes of the expected data set in columns, comprises: dividing one or more data records retrieved from the database into a plurality of attribute values; identifying a contiguous memory location for each attribute; and storing each attribute value at in one of the identified contiguous memory locations.
11. The computer-readable storage medium of any one of the claims 8 to 10, wherein a plurality of object-oriented representations of the expected data set are provided to a plurality of applications by: initializing a plurality of objects with data access methods; storing the memory address locations corresponding to the columns of the column data store of the shared cache in the objects; and providing the plurality of objects for access by the plurality of applications.
12. The computer-readable storage medium of claim 11, wherein at least one of the plurality of applications access the data set by: calling a data access method of one of the objects; and receiving the requested data from the data access method of the object.
13. The computer-readable storage medium of any one of the claims 8 to 12, wherein the shared cache is configured to provide read only access to the plurality of applications.
14. The computer-readable storage medium of any one of the claims 8 to 13, wherein the shared cache is updated by: identifying a plurality of attributes of an updated expected data set; re-allocating the memory space for the shared cache; retrieving an updated expected data set from the database; populating the shared cache with the updated expected data set; and storing a plurality of address references to the columns of the column data store in the shared cache.
15. A computer system, comprising: a memory; and a processor storing one or more programs configured to perform an operation for providing a plurality of applications with concurrent access to data, the method comprising: identifying a plurality of attributes of an expected data set to be accessed concurrently by the plurality of applications; allocating a memory space for a shared cache, wherein the shared cache comprises a column data store configured to store data for each of the plurality of attributes of the expected data set in columns; retrieving the expected data set from a database; populating the shared cache with the expected data set; and storing memory address locations corresponding to the columns of the column data store of the shared cache for access by the plurality of applications, wherein each application generates a memory map from memory locations in a virtual address space of each respective application to the stored address memory locations.
16. The system of claim 15, wherein the plurality of applications access the data set by: requesting data from one or more of the memory locations in the virtual address space allocated to the application; mapping memory locations in the virtual address space to corresponding memory address locations of the shared cache, via the memory map; and accessing the requested data from the mapped memory locations in the shared cache.
17. The system of claim 15 or claim 16, wherein storing the data of each of the plurality of attributes of the expected data set in columns, comprises: dividing one or more data records retrieved from the database into a plurality of attribute values; identifying a contiguous memory location for each attribute; and storing each attribute value at in one of the identified contiguous memory locations.
18. The system of any one of claims 15 to 17, wherein a plurality of object- oriented representations of the expected data set are provided to a plurality of applications initializing a plurality of objects with data access methods; storing the memory address locations corresponding to the columns of the column data store of the shared cache in the objects; and providing the plurality of objects for access by the plurality of applications.
19. The system of claim 18, wherein at least one of the plurality of applications access the data set by: calling a data access method of one of the objects; and receiving the requested data from the data access method of the object.
20. The system of any one of claims 15 to 19, wherein the shared cache is configured to provide read only access to the plurality of applications.
21. The system of any one of claims 15 to 20, wherein the shared cache is updated by: identifying a plurality of attributes of an updated expected data set; re-allocating the memory space for the shared cache; retrieving an updated expected data set from the database; populating the shared cache with the updated expected data set; and storing a plurality of address references to the columns of the column data store in the shared cache.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/827,627 US9367463B2 (en) | 2013-03-14 | 2013-03-14 | System and method utilizing a shared cache to provide zero copy memory mapped database |
US13/827627 | 2013-03-14 |
Publications (2)
Publication Number | Publication Date |
---|---|
NZ622485A NZ622485A (en) | 2014-11-28 |
NZ622485B true NZ622485B (en) | 2015-03-03 |
Family
ID=
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9652291B2 (en) | System and method utilizing a shared cache to provide zero copy memory mapped database | |
US11263211B2 (en) | Data partitioning and ordering | |
US11899666B2 (en) | System and method for dynamic database split generation in a massively parallel or distributed database environment | |
US10089377B2 (en) | System and method for data transfer from JDBC to a data warehouse layer in a massively parallel or distributed database environment | |
US10380114B2 (en) | System and method for generating rowid range-based splits in a massively parallel or distributed database environment | |
US10180973B2 (en) | System and method for efficient connection management in a massively parallel or distributed database environment | |
US10528596B2 (en) | System and method for consistent reads between tasks in a massively parallel or distributed database environment | |
US11544268B2 (en) | System and method for generating size-based splits in a massively parallel or distributed database environment | |
US7146365B2 (en) | Method, system, and program for optimizing database query execution | |
US8364909B2 (en) | Determining a conflict in accessing shared resources using a reduced number of cycles | |
US10089357B2 (en) | System and method for generating partition-based splits in a massively parallel or distributed database environment | |
US10078684B2 (en) | System and method for query processing with table-level predicate pushdown in a massively parallel or distributed database environment | |
JP6434154B2 (en) | Identifying join relationships based on transaction access patterns | |
WO2018157680A1 (en) | Method and device for generating execution plan, and database server | |
US11288287B2 (en) | Methods and apparatus to partition a database | |
US11176091B2 (en) | Techniques for dynamic multi-storage format database access | |
Matallah et al. | Experimental comparative study of NoSQL databases: HBASE versus MongoDB by YCSB | |
CN114443615A (en) | Database management system, related apparatus, method and medium | |
US11080299B2 (en) | Methods and apparatus to partition a database | |
NZ622485B (en) | Shared cache used to provide zero copy memory mapped database | |
CN113742346A (en) | Asset big data platform architecture optimization method | |
Cebollero et al. | Catalog Views and Dynamic aent Views |