US10459932B2 - Visualizing large data volumes utilizing initial sampling and multi-stage calculations - Google Patents
Visualizing large data volumes utilizing initial sampling and multi-stage calculations Download PDFInfo
- Publication number
- US10459932B2 US10459932B2 US14/575,633 US201414575633A US10459932B2 US 10459932 B2 US10459932 B2 US 10459932B2 US 201414575633 A US201414575633 A US 201414575633A US 10459932 B2 US10459932 B2 US 10459932B2
- Authority
- US
- United States
- Prior art keywords
- dataset
- memory database
- engine
- sample
- result set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2462—Approximate or statistical queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/26—Visual data mining; Browsing structured data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/9038—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/217—Database tuning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/23—Updating
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
Definitions
- the present invention relates to analysis of large data volumes, and in particular, to systems and methods for visualizing large data volumes utilizing an initial sampling and a multi-stage calculation.
- Embodiments relate to systems and methods of visualizing large data volumes utilizing sampling techniques and the formulation and application of multi-stage calculation plan(s).
- a big data volume is initially sampled to reduce its size. This sampling may be random in nature.
- the sampled dataset may be further refined (wrangled) by discretization that may include rounding, binning, and/or characterization etc. to produce a wrangled sample dataset.
- a user defines useful end visualization(s) by inputting expected dimension/measures, thereby creating a calculation plan having multiple stages. From these visualizations of sampled data and the calculation plan derived therefrom, minimal grouping sets are deduced for application to the full dataset. The user publishes/schedules the wrangled operation and grouping set extraction definition.
- a wrangled dataset and grouping sets are produced in the big data layer.
- minimal grouping sets are retrieved in the in-memory engine of the client and processed by an in-memory database engine according to the common calculation plan. This produces result sets and a final set of visualizations of the full dataset, in which the user can recognize valuable data trends and/or relationships.
- FIG. 1 illustrates a simplified view of a system configured to perform visualization of large data volumes according to an embodiment.
- FIG. 2 illustrates a simplified diagram of a process flow according to an embodiment.
- FIGS. 3A-3H illustrate various steps in a specific example.
- FIG. 4 illustrates hardware of a special purpose computing machine configured to implement visualization of large data volumes according to an embodiment.
- FIG. 5 illustrates an example of a computer system.
- FIG. 1 shows a simplified view of a system 100 according to an embodiment.
- a large data volume 102 e.g., comprising millions of records
- This large data volume typically comprises data stored according to tables/rowsets.
- a user 106 seeks to access, manipulate, and visualize these large data volumes utilizing the analytical functionality provided by an interface layer 110 , for example a Business Intelligence (BI) tool in communication with in-memory database layer.
- BI Business Intelligence
- Such an in-memory database layer is configured to handle data volumes smaller than that offered by the big data layer.
- the user provides a first input 112 to the big data layer, to produce a sampled dataset 114 of a size able to be manipulated by the in-memory database.
- This sampled dataset is then subject to additional refinement to produce a wrangled sample dataset 116 .
- This wrangled sample may be transferred into and stored in an interface layer including an in-memory database.
- Examples of techniques that can be used to perform refinement of the sampled data to produce the wrangled sample dataset include but are not limited to discretization such as binning, grouping, categorization, and others. Further details regarding data refinement techniques are discussed in detail in connection with the example below.
- the user interacts with the wrangled sample dataset utilizing a visualization engine 118 .
- the visualization engine 118 is shown in two parts for purposes of illustrating a temporal nature of interaction therewith, and this figure does not depict two separate visualization engines.
- the user provides inputs 120 to the visualization engine, specifying those dimensions and measures of the underlying data that are expected to be of interest.
- This exploration of the sampled data is also referred to herein as storytelling.
- This storytelling reflects effort by the user to create one or more data visualizations in a manner giving rise to meaningful insight into trends and relationships present within the data.
- a calculation plan 122 is created. Specifically, the calculation plan comprises manipulating the data of minimal grouping sets 124 over multiple stages 126 . Each of these stages typically comprises execution of operations in the Structured Query Language (SQL) or other relational query language.
- SQL Structured Query Language
- FIG. 1 shows the execution of the calculation plan by a calculation engine 119 .
- This calculation engine may comprise an in-memory database engine.
- Each calculation node 121 of the calculation plan consumes one or multiple rowsets, and produces a rowset. Depending on the technology used, some intermediate nodes may be temporarily materialized.
- These final result sets may be iterated to create corresponding visualization(s) 128 of the data for the user.
- These visualizations can define elements such as axis members, point coordinates, cell values, bubble sizes, etc.
- a user may generate and revise the calculation plan on the basis of components (e.g., dimensions, measures) and/or regions (e.g., timeframes, intervals, ranges) of the sampled dataset expected to yield insights.
- components e.g., dimensions, measures
- regions e.g., timeframes, intervals, ranges
- the minimal grouping sets comprising the input to this calculation plan are deduced from the end visualizations by the calculation plan. That is, with the assistance of the visualization engine, a user may move backward through the multiple stages relied upon to calculate the end visualizations, to identify a minimal set of data needed—the minimal grouping set.
- the end visualizations are provided by the interface layer to the user.
- input 130 comprising a grouping set extracting definition is provided to the big data layer to generate minimal grouping sets.
- a minimal grouping applies a filter and an aggregation on a source table, and produces a new table smaller than the source table.
- Each grouping set is substituted to the source table in the calculation plan, for one or many input nodes.
- the grouping sets are determined so that an input node produces the same results, whether it operates on the source table or on its substituted grouping set.
- a particular calculation plan may involve the following three (3) input nodes N 0 , N 1 , N 2 (using SQL as a query language) on a source table ⁇ source table>.
- all input nodes can be computed by substituting the results of GS 1 to the source table. All source tuples that match the filter condition for N 0 , N 1 or N 2 also match the filter condition for GS 1 (so the source tuples to compute the input nodes are included in the source tuples to compute GS 1 ). All columns used by the filter condition for N 0 , N 1 and N 2 are available in GS 1 .
- Revenue is aggregated with an additive function, allowing aggregation of Revenue on two columns (ProductFamily and Country) and then re-aggregation on a single column (ProductFamily). This allows the same results to be obtained as by aggregating directly Revenue on ProductFamily.
- No other grouping set can contain fewer rows than GS 1 and still be used as a source to produce N 0 , N 1 and N 2 . This reflects the minimal property of the minimal grouping set.
- the SQL query for GS 1 is pushed to the big data layer. Its result set will be transferred from the big data layer to the local engine, as a new table ⁇ GS table>.
- the definition of N 0 , N 1 and N 2 in the calculation plan can be modified to use ⁇ GS table> instead of ⁇ source table>.
- the grouping set extracting definition allows the big data layer to create a wrangled dataset 132 from the large data volume.
- This wrangled dataset reflects data manipulation and/or refinements that reduce its size and improve its effectiveness for processing.
- a variety of techniques may be used to create the wrangled dataset 132 from the large data volume based on the minimal grouping set definition. Such techniques may include but are not limited to binning and/or grouping.
- the wrangled dataset may be materialized in the big data layer as a specific table, or may remain in virtual form.
- the minimal grouping sets 124 are generated from the wrangled dataset. These minimal grouping sets match the expected input nodes of the calculation plan.
- An initial calculation filter may be designed to produce minimal grouping sets from the wrangled dataset. Filters are pushed down to the wrangled dataset such that the grouping set contains only dimensions and measures to be aggregated together for at least one visualization block. The minimal grouping sets are deduced from what the end-used defines: visualizations involving prepared dimensions and measures.
- the minimal grouping sets are transferred to the layer including the in-memory database. There, they may be stored (e.g., by the in-memory database) and accessed by the engine to perform calculations according to the calculation plan.
- This calculation plan transforms the pre-filtered and pre-aggregated data of the minimal grouping sets into final result sets 140 that will be iterated to “draw” the end visualizations 142 (axis members, point coordinates, cell values, bubble sizes, etc . . . ).
- Each calculation node consumes one or multiple rowsets, and produces a rowset.
- some intermediate nodes may be temporarily materialized.
- Example of such further actions can include taking a new sampling to begin the process again, and/or changing the calculation plan to perform different calculations on an existing sample.
- the engine may be an engine of an existing in-memory database, such as the HANA in-memory database available from SAP SE of Walldorf, Germany.
- in-memory database architectures examples include but are not limited to, the SYBASE IQ database also available from SAP SE; the Microsoft Embedded SQL for C (ESQL/C) database available from Microsoft Corp. of Redmond, Wash.; and the Exalytics In-Memory database available from Oracle Corp. of Redwood Shores, Calif.
- FIG. 2 shows a simplified diagram illustrating a process flow 200 according to an embodiment.
- an interface layer receives a sampled dataset from a big data layer.
- the sampled dataset may be refined by the application of one or more techniques.
- a visualization engine of the interface layer designs end visualization(s) that provide insight into the sampled dataset.
- a fourth step 208 the visualization engine generates a calculation plan comprising multiple stages that culminate in these end visualizations. Minimal grouping sets are defined in the calculation plan. These minimal grouping sets are deduced from the calculation plan.
- the grouping set results are received from the sample dataset. These grouping set results may reflect refinement/wrangling of the data.
- a sixth step 212 the grouping set is processed by an in-Memory engine according to the calculation plan.
- a first visualization is communicated to the user.
- Embodiments are now described in connection with an example of the visualization and analysis of a large data volume from the HADOOP big data platform, which is available from the APACHE SOFTWARE FOUNDATION.
- the visualization/analysis is performed based upon multi-stage calculations of a calculation plan formulated utilizing a visualization engine of a Business Intelligence (BI) tool, executed by a HANA in-memory calculation engine embedded in a LUMIRA interface layer available from SAP SE, of Walldorf, Germany.
- BI Business Intelligence
- FIG. 3A shows the initial stage 300 , where source dataset 302 resides in the big data layer 304 of the HADOOP platform.
- This source dataset may be extremely large, comprising for example on the order of at least millions of records.
- the data comprises all national flights in the USA over a decade, for three different busy destination airports: Atlanta (ATL), Seattle (SEA), and St. Louis (STL).
- the HADOOP platform may afford a user with some basic functionality for interrogating and manipulating this large volume of data. Examples of such functionality can include filter and aggregation operations.
- HADOOP big data platform may be unable to provide a user with functionality necessary to perform more detailed analysis of the large data volumes involved. Such functionality is described in detail in the following SQL standards—SQL:99, SQL:2003, and SQL:2011, each of which is incorporated by reference herein for all purposes.
- HADOOP may not support one or more of the following SQL operators commonly used for data analysis and manipulation: •Union, •Intersection, •Minus, •Rownum, •Rank, •Running Aggr, •Moving Aggr.
- embodiments may utilize an initial sampling of the big dataset of HADOOP, in order to create a dataset of a reduced size that is amenable to handling by an in-memory database engine (e.g., the calculation engine of the LUMIRA interface) offering such enhanced analysis/manipulation functionality.
- the initial sampling may be random in nature.
- this smaller dataset produced by sampling may be processed by a calculation engine of an in-memory database to develop a calculation plan affording a user with relevant visualizations.
- the user may return to query the original big data set to obtain minimal grouping sets therefrom. Processing those minimal grouping sets according to a calculation plan executed by the engine, may afford the user with enhanced visualizations of data of the big data set.
- FIG. 3B thus shows the next stage, wherein a user accesses the big data layer 304 and utilizes functionality thereof to create a random sample 306 having particular characteristics.
- this random sample may not be persisted in HADOOP. Instead, it may be stored as a wrangled dataset accessible to the calculation engine, as is further discussed below.
- the user may enter HADOOP HIVE server connection parameters into the LUMIRA interface (the user may also be called upon to enter a credential).
- the HADOOP system proposes the list of HIVE schema as well as the corresponding HIVE tables for a chosen schema. The user selects one for each.
- this dataset name can be according to the format: [“schema name”.“table name”.]
- a user may next click on Add Table button to display the corresponding tableview.
- the system displays to the user a tableview comprising the dataset columns with the first twenty lines.
- the user selects for retrieval, only those specific table columns that are to be interacted with.
- the system shows the number of lines, columns, and cells that will be acquired from the dataset to understand the impact of its choices.
- the calculation is performed based on select count(*) from the table. This operation may be executed efficiently by the existing HADOOP functionality.
- the system may advise a user to leverage sampling as a basis for data visualization and analysis according to an embodiment.
- the user can select a number of lines for the sampling to be acquired.
- the user may also select the column on which a sampling algorithm is to be run.
- the problem may be equivalent to counting the frequency of the members of a single dimension, and the frequency of tuples involving multiple dimensions.
- the distribution of the dimension variable is half-normal. Two distributions for the measure were tested: a normal distribution and a custom-built multimodal distribution.
- the relative error depends on the distribution of measure values (before aggregation).
- the error curves are similar for the three full data sizes for a given measure distribution, the relative error depends mostly on the sample size (not the sampling ratio) and the relative frequency of the dimension member being aggregated for the tested distributions.
- a 1M rows sample supports SUM aggregations with a relative error below 10%, for dimension members/tuples which relative frequency is as low as 0.7%.
- a 1 million rows sample may be a good starting point for a workflow on a HADOOP data source.
- a random sample this size could be collected in less than 10 minutes, and could then support several hours of data exploration, data analysis, and/or visualization design.
- a sample size of one million rows is not required, and as used here this figure merely represents a heuristic rather than a fixed number. The actual size of the sample may vary depending upon the particular application and embodiment.
- Fetching the sample asynchronously could allow the end-user to start working as soon as a certain number of rows (e.g. 100K rows) have been collected, which should take less than a minute on an intranet. Large samples may be shared across multiple users. And, it is further noted that most eventually published visualizations will likely be computed offline (i.e. by a scheduled task), on the full available data.
- a certain number of rows e.g. 100K rows
- Large samples may be shared across multiple users.
- most eventually published visualizations will likely be computed offline (i.e. by a scheduled task), on the full available data.
- the interface then acquires the sampled dataset and provides statistics to the user. At this point, a user may be advised that she is working with only a sampled dataset, rather than the entire dataset.
- the sampled dataset may next be subjected to certain processing for refinement.
- this data refining process may be referred to herein as data wrangling.
- a user may define one or more of the following in order to enhance the relevance of the sampled data for visualization and focus analysis thereon:
- this data manipulation may be implemented by the LUMIRA interface layer. Accordingly, FIG. 3B shows the wrangled sample 308 present within that layer. FIG. 3C shows processing of the wrangled sample within the interface layer.
- These data manipulations may impart significant business meaning to the data. They may also reduce a size of the published dataset once aggregated, thereby enhancing an ability of the engine to perform analytical processing of the smaller data set in a reasonable period of time.
- a user may take actions in order to prepare a wrangled sample for processing by a calculation engine of the in-memory database layer. Examples of such action include server-side discretization, and the removal of columns
- a user may perform further data cleansing in order to normalize the dimension of the sampled dataset. This may be done by removing spelling errors, blanks, duplications, etc.
- a continuous numeric attribute may not be suited for direct use as a dimension when aggregating data, because it holds too many distinct values and thus may interfere with the aggregation reducing the cardinality of the dataset.
- continuous values may come from sensors (e.g. temperature, velocity, acceleration) or position-tracking devices (e.g. GPS).
- Server-side discretization may thus serve at least two goals.
- One is to map a large set of source values into a manageable smaller set of dimension members, in order to reduce the volume of data that needs to be transferred to the client (e.g., from HADOOP to the calculation engine of LUMIRA).
- Another goal is to transfer continuous attributes into discrete values which are more adapted for LUMIRA desktop visualization.
- Rounding is the simplest form of discretization. This may involve reducing the number of unique values by reducing the number of significant digits.
- Rounding to the closest round figure is acceptable for most cases, except when round figures have a special business meaning and act as a kind of threshold.
- the weight of diamonds is measured in carats, and pricing is defined using weight segments such as 1.00-1.49 carats.
- a 1.04 carat diamond will be priced as a 1.00 carat stone, but a 0.96 carat diamond will likely be priced as a 0.9 carat stone. In such special cases, it may be better to round down to the smaller round figure (using the equivalent of the HANA ROUND_DOWN option).
- Such a formula should be provided as a predefined function.
- One example is as follows: signif( ⁇ value>, ⁇ digits>) to round ⁇ value>to its first ⁇ digits>significant digits.
- a helper or a predefined function may create a calculated dimension by just specifying the expected number of significant digits (to avoid having to create a formula involving Log10 and Round).
- Binning is an alternate way of mapping a large set of values into a smaller set of intervals. In binning, however, the interval boundaries are not defined as round figures. Rather, the number of discrete values/intervals is fixed
- a general binning algorithm is to:
- the mapping may be done using either the bin index or a value representative of the bin (such as its min, max or median value).
- Bins may be of equal-width. This is a most common usage of binning, where the measure M is the input value itself, and the data after binning is directly suitable to be displayed as a histogram.
- Equal-width bins may not be adapted for data that is grouped into irregular intervals (e.g. a multimodal distribution with large gaps). Under such conditions, one option is to create bins containing about the same number of data points, which is a way to obtain equal-probability bins. A practical way of achieving this is to compute the rank of the input value (e.g. according to its ascending order), and to use the value rank as the binning measure.
- LUMIRA supports fixed-width binning on the value itself, through a “Group by range” transform.
- the “Others” collector may be added, and low and high out-of-range values may be distinguished.
- the user may define multiple “Group by range” dimensions in order to apply filters having an impact on the min/max values.
- Logarithmic-scale binning is possible but may involve creation of an intermediate calculated dimension (which can be hidden).
- categorization maps source values into a small set of meaningful categories, turning the source variable into a categorical variable. For instance, a flight arrival delay could be categorized as:
- LUMIRA provides a “Group by selection” in the prepare room. This operates on discrete values (i.e., it allows to transform a large set of discrete values into a smaller set). Categorization of continuous variables can be achieved through a calculated dimension.
- Available functions may be constrained by TREX function set. If the concise SQL CASE expression is unavailable, it can be emulated by nested IF THEN ELSE expressions.
- Grouping is another type of data manipulation which may be used to produce a wrangled dataset from the sampled data.
- a value once a value has been discretized into a small set of unique values, it can be used as a GROUP BY column in SQL aggregations to reduce the cardinality of the data set.
- it might be more efficient to materialize the discrete values into an extra column, prior to aggregation.
- discretization may be applied early in the calculation plans that produce data visualizations. In some cases, discretization may occur prior to the first aggregation nodes.
- a decision to discretize a continuous variable/attribute does not have to be made when sampling a large dataset.
- a random sample can be extracted with untransformed values.
- Suitable discretization e.g., rounding, binning, categorization
- the discretization can then be performed on the server side when pushing computations to the full dataset.
- a user employs the calculation engine to conduct data visualization analysis and/or storytelling. For example, the user may perform time enrichment, geographic enrichment, and data blending, and/or create customer hierarchy workflows. This is done utilizing the calculation plan 310 as described above.
- the user may iterate data visualizations 312 , change the chart, add filters, and perform other tasks, before ultimately settling on the appropriate visualizations that tell a cogent story for the sampled data.
- An icon may be provided to explain that the numbers are simulating big data dataset, and that there is an error margin. SORT could be impacted by the error margin if two measures are close enough and RANKING may be used.
- FIGS. 3 D 1 and 3 D 2 show the results of visualization based upon an initial sampling of a larger data set (here, flights to the ATL, SEA, and STL airports).
- a larger data set here, flights to the ATL, SEA, and STL airports.
- FIG. 3 D 1 shows sampling the flight data on a quarterly basis
- FIG. 3 D 2 shows sampling the flight data on a monthly basis.
- the next stage comprises validating and sharing the data of this story.
- a user typically wants to see the results on the entire dataset, and then share it with others.
- LUMIRA Team Server which embeds the calculation engine
- LUMIRA Team Server which embeds the calculation engine
- the minimal grouping sets extraction definition can be provided and schedule options defined to HADOOP.
- finish a job to the LUMIRA Team Server will be scheduled.
- Other jobs scheduled on the LUMIRA Team Server may be rescheduled.
- the grouping sets 320 are generated from the wrangled dataset.
- HADOOP returns the minimal grouping sets 320 from the full source dataset.
- the grouping sets are stored in a Parquet in HADOOP. This is also shown in FIG. 3E .
- Second step 362 comprises a wrangling operation and grouping set definition for each visualization in the story enhanced with filter control dimension. These are transformed as HADOOP SQL.
- HADOOP SQL is run to create a wrangled dataset and grouping set list from source dataset.
- a fourth step 364 the wrangled dataset and grouping set list are stored in HADOOP.
- LUMIRA the User refreshes the LUMIRA files (LUM) with the sample. If the LUMs have been published previously to LUMIRA Team Server, then LUMIRA desktop goes and gets the grouping sets from HADOOP if they all have been computed. This way, an end user can see in LUMIRA desktop the real data.
- the system queries for a new sample. This is similar to the initial acquisition phase discussed above.
- the calculation engine performs the initial and detailed operations according to the calculation plan.
- the visualizations 330 resulting from processing of the grouping sets produced by the calculation plan are then ultimately provided from the calculation engine to the user, via the LUMIRA interface.
- LUMIRA Team Server a user opens a LUM published and scheduled.
- the LUMIRA Team Server opens the LUM file. This transparently triggers the loading of the grouping sets stored in the LUM file during scheduling into calculation engine, and then calculates all related visualizations.
- FIG. 3H is a simplified flow diagram illustrating a process 370 for an end-user to retrieve data and view a story on a full (rather than sampled) HADOOP dataset.
- a first step 372 an end user opens a story that has been published/scheduled.
- BI Business Intelligence
- a third step 376 the grouping set is received from HADOOP dataset by the in-memory database engine.
- a fifth step 380 the first visualization is communicated to a user. Further visualizations based upon the calculation plan/grouping set(s) may also be communicated to a user.
- step 382 the story is communicated to a user.
- FIG. 4 illustrates hardware of a special purpose computing machine configured to perform visualization search and highlighting according to an embodiment.
- computer system 400 comprises a processor 402 that is in electronic communication with a non-transitory computer-readable storage medium 403 of an interface layer.
- This computer-readable storage medium has stored thereon code 405 corresponding to a dataset sampled from a much larger dataset according to user instructions.
- Code 404 corresponds to an in-memory database engine. Code may be configured to reference data stored in a database of a non-transitory computer-readable storage medium, for example as may be present locally or in a remote database server.
- Software servers together may form a cluster or logical network of computer systems programmed with software programs that communicate with each other and work together in order to process requests.
- Computer system 510 includes a bus 505 or other communication mechanism for communicating information, and a processor 501 coupled with bus 505 for processing information.
- Computer system 510 also includes a memory 502 coupled to bus 505 for storing information and instructions to be executed by processor 501 , including information and instructions for performing the techniques described above, for example.
- This memory may also be used for storing variables or other intermediate information during execution of instructions to be executed by processor 501 . Possible implementations of this memory may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both.
- a storage device 503 is also provided for storing information and instructions.
- Storage devices include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash memory, a USB memory card, or any other medium from which a computer can read.
- Storage device 503 may include source code, binary code, or software files for performing the techniques above, for example.
- Storage device and memory are both examples of computer readable mediums.
- Computer system 510 may be coupled via bus 505 to a display 512 , such as a cathode ray tube (CRT) or liquid crystal display (LCD), for displaying information to a computer user.
- a display 512 such as a cathode ray tube (CRT) or liquid crystal display (LCD)
- An input device 511 such as a keyboard and/or mouse is coupled to bus 505 for communicating information and command selections from the user to processor 501 .
- the combination of these components allows the user to communicate with the system.
- bus 505 may be divided into multiple specialized buses.
- Computer system 510 also includes a network interface 504 coupled with bus 505 .
- Network interface 504 may provide two-way data communication between computer system 510 and the local network 520 .
- the network interface 504 may be a digital subscriber line (DSL) or a modem to provide data communication connection over a telephone line, for example.
- DSL digital subscriber line
- Another example of the network interface is a local area network (LAN) card to provide a data communication connection to a compatible LAN.
- LAN local area network
- Wireless links are another example.
- network interface 504 sends and receives electrical, electromagnetic, or optical signals that carry digital data streams representing various types of information.
- Computer system 510 can send and receive information, including messages or other interface actions, through the network interface 504 across a local network 520 , an Intranet, or the Internet 530 .
- computer system 510 may communicate with a plurality of other computer machines, such as server 515 .
- server 515 may form a cloud computing network, which may be programmed with processes described herein.
- software components or services may reside on multiple different computer systems 510 or servers 531 - 535 across the network.
- the processes described above may be implemented on one or more servers, for example.
- a server 531 may transmit actions or messages from one component, through Internet 530 , local network 520 , and network interface 504 to a component on computer system 510 .
- the software components and processes described above may be implemented on any computer system and send and/or receive information across a network, for example.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
-
- binning;
- grouping;
- categorization;
- segmentation; and
- pivoting.
y=POWER(10, FLOOR(LOG10(x))
c=POWER(10, d−1)
result=ROUND(x*c/y)*y/c, or alternately result=ROUND(x/y, d−1)*y
signif(<value>, <digits>) to round<value>to its first<digits>significant digits.
-
- early (delay<0)
- on-time (delay>=0 & delay<5 mn)
- late (delay>=5 mn & delay<30 mn)
- very late (delay>=30 mn & delay<3 h)
- compensation-entitling (delay>=3 h)
For a SQL data source, categorization can be expressed using a CASE expression.
Claims (11)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/575,633 US10459932B2 (en) | 2014-12-18 | 2014-12-18 | Visualizing large data volumes utilizing initial sampling and multi-stage calculations |
| EP15003481.7A EP3035211B1 (en) | 2014-12-18 | 2015-12-07 | Visualizing large data volumes utilizing initial sampling and multi-stage calculations |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/575,633 US10459932B2 (en) | 2014-12-18 | 2014-12-18 | Visualizing large data volumes utilizing initial sampling and multi-stage calculations |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20160179852A1 US20160179852A1 (en) | 2016-06-23 |
| US10459932B2 true US10459932B2 (en) | 2019-10-29 |
Family
ID=54843568
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US14/575,633 Active 2037-03-24 US10459932B2 (en) | 2014-12-18 | 2014-12-18 | Visualizing large data volumes utilizing initial sampling and multi-stage calculations |
Country Status (2)
| Country | Link |
|---|---|
| US (1) | US10459932B2 (en) |
| EP (1) | EP3035211B1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200192895A1 (en) * | 2018-12-14 | 2020-06-18 | Tibco Software Inc. | Process control tool for processing big and wide data |
Families Citing this family (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9900360B1 (en) * | 2015-09-30 | 2018-02-20 | Quantcast Corporation | Managing a distributed system processing a publisher's streaming data |
| GB2549073B (en) | 2016-03-24 | 2020-02-26 | Imagination Tech Ltd | Generating sparse sample histograms |
| US20170337055A1 (en) * | 2016-05-23 | 2017-11-23 | International Business Machines Corporation | Summarized illustrative representation of software changes |
| CN106354772A (en) * | 2016-08-23 | 2017-01-25 | 成都卡莱博尔信息技术股份有限公司 | Mass data system with data cleaning function |
| CN106874381B (en) * | 2017-01-09 | 2020-12-22 | 重庆邮电大学 | A Hadoop-based Radio Environment Map Data Processing System |
| RU2669716C1 (en) * | 2017-05-12 | 2018-10-15 | Общество с ограниченной ответственностью "ВИЗЕКС ИНФО" | System and method for processing and analysis of large amounts of data |
| US10678826B2 (en) | 2017-07-25 | 2020-06-09 | Sap Se | Interactive visualization for outlier identification |
| US10726007B2 (en) | 2017-09-26 | 2020-07-28 | Microsoft Technology Licensing, Llc | Building heavy hitter summary for query optimization |
| CN108256028B (en) * | 2018-01-11 | 2021-09-28 | 北京服装学院 | Multi-dimensional dynamic sampling method for approximate query in cloud computing environment |
| CN109753496A (en) * | 2018-11-27 | 2019-05-14 | 天聚地合(苏州)数据股份有限公司 | A kind of data cleaning method for big data |
| US12248599B1 (en) * | 2019-07-11 | 2025-03-11 | Palantir Technologies Inc. | Centralized data retention and deletion system |
| CN110532300B (en) * | 2019-08-30 | 2021-11-05 | 南京大学 | A high-fidelity visualization method of big data for artificial intelligence data analysis |
| CN110704535B (en) * | 2019-09-26 | 2023-10-24 | 深圳前海微众银行股份有限公司 | Data binning method, device, equipment and computer readable storage medium |
| CN111538720B (en) * | 2020-03-12 | 2023-07-21 | 嘉陵江亭子口水利水电开发有限公司 | Method and system for cleaning basic data of power industry |
| CN113076373B (en) * | 2021-02-26 | 2023-11-21 | 广东科诺勘测工程有限公司 | Sea area flow field and dredging depth real-time hydrologic monitoring big data display and space query method and system |
| CN116821559B (en) * | 2023-07-07 | 2024-02-23 | 中国人民解放军海军工程大学 | Method, system and terminal for rapidly acquiring a group of big data centralized trends |
Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5555403A (en) | 1991-11-27 | 1996-09-10 | Business Objects, S.A. | Relational database access system using semantically dynamic objects |
| US20060053103A1 (en) * | 2001-01-12 | 2006-03-09 | Microsoft Corporation | Database aggregation query result estimator |
| US20080306903A1 (en) * | 2007-06-08 | 2008-12-11 | Microsoft Corporation | Cardinality estimation in database systems using sample views |
| US20090018935A1 (en) | 2007-05-04 | 2009-01-15 | Sap Ag | Computerized method, computer program product and computer environment |
| US20090328175A1 (en) | 2008-06-24 | 2009-12-31 | Gary Stephen Shuster | Identity verification via selection of sensible output from recorded digital data |
| US20140156806A1 (en) * | 2012-12-04 | 2014-06-05 | Marinexplore Inc. | Spatio-temporal data processing systems and methods |
| US20140280042A1 (en) * | 2013-03-13 | 2014-09-18 | Sap Ag | Query processing system including data classification |
| US20140351233A1 (en) * | 2013-05-24 | 2014-11-27 | Software AG USA Inc. | System and method for continuous analytics run against a combination of static and real-time data |
| US20160078361A1 (en) * | 2014-09-11 | 2016-03-17 | Amazon Technologies, Inc. | Optimized training of linear machine learning models |
-
2014
- 2014-12-18 US US14/575,633 patent/US10459932B2/en active Active
-
2015
- 2015-12-07 EP EP15003481.7A patent/EP3035211B1/en active Active
Patent Citations (9)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5555403A (en) | 1991-11-27 | 1996-09-10 | Business Objects, S.A. | Relational database access system using semantically dynamic objects |
| US20060053103A1 (en) * | 2001-01-12 | 2006-03-09 | Microsoft Corporation | Database aggregation query result estimator |
| US20090018935A1 (en) | 2007-05-04 | 2009-01-15 | Sap Ag | Computerized method, computer program product and computer environment |
| US20080306903A1 (en) * | 2007-06-08 | 2008-12-11 | Microsoft Corporation | Cardinality estimation in database systems using sample views |
| US20090328175A1 (en) | 2008-06-24 | 2009-12-31 | Gary Stephen Shuster | Identity verification via selection of sensible output from recorded digital data |
| US20140156806A1 (en) * | 2012-12-04 | 2014-06-05 | Marinexplore Inc. | Spatio-temporal data processing systems and methods |
| US20140280042A1 (en) * | 2013-03-13 | 2014-09-18 | Sap Ag | Query processing system including data classification |
| US20140351233A1 (en) * | 2013-05-24 | 2014-11-27 | Software AG USA Inc. | System and method for continuous analytics run against a combination of static and real-time data |
| US20160078361A1 (en) * | 2014-09-11 | 2016-03-17 | Amazon Technologies, Inc. | Optimized training of linear machine learning models |
Non-Patent Citations (2)
| Title |
|---|
| Extended European Search Report for EP Application 15003481.7 dated Apr. 22, 2016, 8 pages. |
| Mankala, Chandrasekhar. SAP HANA Cookbook. Packt Publishing Ltd, 2013. (Year: 2013). * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200192895A1 (en) * | 2018-12-14 | 2020-06-18 | Tibco Software Inc. | Process control tool for processing big and wide data |
| US11727021B2 (en) * | 2018-12-14 | 2023-08-15 | Tibco Software Inc. | Process control tool for processing big and wide data |
Also Published As
| Publication number | Publication date |
|---|---|
| US20160179852A1 (en) | 2016-06-23 |
| EP3035211A1 (en) | 2016-06-22 |
| EP3035211B1 (en) | 2018-10-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10459932B2 (en) | Visualizing large data volumes utilizing initial sampling and multi-stage calculations | |
| US10817534B2 (en) | Systems and methods for interest-driven data visualization systems utilizing visualization image data and trellised visualizations | |
| KR102627690B1 (en) | Dimensional context propagation techniques for optimizing SKB query plans | |
| US10963477B2 (en) | Declarative specification of visualization queries | |
| US20230350894A1 (en) | Distinct value estimation for query planning | |
| US9934299B2 (en) | Systems and methods for interest-driven data visualization systems utilizing visualization image data and trellised visualizations | |
| US10459940B2 (en) | Systems and methods for interest-driven data visualization systems utilized in interest-driven business intelligence systems | |
| US20160140174A1 (en) | Data driven multi-provider pruning for query execution plan | |
| US11809446B2 (en) | Visualizing time metric database | |
| CN104205039A (en) | Interest-driven business intelligence systems and methods of data analysis using interest-driven data pipelines | |
| US20180300332A1 (en) | Dynamic aggregation for big data analytics | |
| WO2013188795A2 (en) | Simplified interaction with complex database | |
| US12353406B2 (en) | Data aggregator graphical user interface | |
| CN105843862A (en) | Method for establishing crop disease and pest remote sensing and forecasting system and remote sensing and forecasting system | |
| EP3007078A1 (en) | Multivariate insight discovery approach | |
| EP4399614A1 (en) | System and method for generating automatic insights of analytics data | |
| US10311049B2 (en) | Pattern-based query result enhancement | |
| US20130268855A1 (en) | Examining an execution of a business process | |
| US20140136274A1 (en) | Providing multiple level process intelligence and the ability to transition between levels | |
| Subotić et al. | Olap tools in education | |
| US20240028250A1 (en) | Dynamic update of consolidated data based on granular data values | |
| US20150006579A1 (en) | Custom grouping of multidimensional data | |
| Fana et al. | Similarity: Data Warehouse Design with ETL Method (Extract, Transform, And Load) for Company Information Centre | |
| Kazi et al. | MOLAP data warehouse of a software products servicing Call center | |
| Vitanov et al. | Process of Creating and Using Data Warehouse in a Wholesale |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: SAP SE, GERMANY Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAIBO, ALEXIS;XUE, XIAOHUI;LE BIANNIC, YANN;REEL/FRAME:034550/0536 Effective date: 20141218 |
|
| AS | Assignment |
Owner name: BUSINESS OBJECTS SOFTWARE LTD., IRELAND Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:NAIBO, ALEXIS;XUE, XIAOHUI;LE BIANNIC, YANN;REEL/FRAME:034942/0798 Effective date: 20150210 Owner name: SAP SE, GERMANY Free format text: NULLIFICATION OF ASSIGNMENT OF PATENT APPLICATION;ASSIGNORS:NAIBO, ALEXIS;XUE, XIAOHUI;LE BIANNIC, YANN;REEL/FRAME:034963/0510 Effective date: 20150210 |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |