KR101629395B1 - apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data - Google Patents
apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data Download PDFInfo
- Publication number
- KR101629395B1 KR101629395B1 KR1020150186539A KR20150186539A KR101629395B1 KR 101629395 B1 KR101629395 B1 KR 101629395B1 KR 1020150186539 A KR1020150186539 A KR 1020150186539A KR 20150186539 A KR20150186539 A KR 20150186539A KR 101629395 B1 KR101629395 B1 KR 101629395B1
- Authority
- KR
- South Korea
- Prior art keywords
- data
- analysis
- tensor
- user
- distributed system
- Prior art date
Links
Images
Classifications
-
- G06F17/30592—
-
- G06F17/30318—
-
- G06F17/30569—
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
The present invention relates to a data analysis apparatus for analyzing big data, a data analysis method, and a storage medium storing a program for analyzing big data.
Big data is a term that refers to almost all kinds of modern data in which the quantity is large, the form is various, is generated / updated at a high speed, but is not structured, . The structured data is data stored in a fixed field, such as a relational database and a spreadsheet. Unstructured data is data that is not stored in a fixed field, And image / moving image / audio data. Semi-structured data is data that is not stored in a fixed field but includes metadata or schema, and may be XML or HTML text, for example. Big data is sometimes referred to as large data.
Scientific data among big data contains observations / observation results of various phenomena in the real world, and often has a multi-dimensional array structure. For example, when data are constructed by measuring the temperature and the air pressure by region and by time zone nationwide, it has a four-dimensional structure of (place, time, temperature, air pressure). This multidimensional structure is expressed and analyzed in a model called a tensor in mathematical physics. That is, a tensor means a multi-dimensional array of three or more dimensions.
These multidimensional data represent the generation of data between the dimensions and are used to analyze the relationships and patterns among the dimensions. However, analysis based on these multidimensional data representations has generally been limited to small-scale analyzes. For example, assuming the above four-dimensional data, the 'place' is divided into 200 arbitrary points, the 'time' corresponds to the hourly unit (24 * 7 = 168) If the pressure is expressed in 20 steps, it constitutes R 200 * 168 * 40 * 20 space.
In particular, since most scientists' data analysis environments are limited to single machines, it was impossible to create or analyze higher-resolution data at higher dimensions and at different dimensions. For example, using satellite image data, the world is set to 1000 * 1000, the year is changed every hour (365 * 24 = 8760), the concentration of red tides occurring in the ocean is set to 100, , Assuming that the ocean salinity is expressed in 50 steps, it requires a data space of 1000000 * 8760 * 100 * 50 * 50 * 4 bytes (assuming float: 4 bytes) = 7.8 peta bytes. This is the level at which a data center needs to be built at the current technology level.
1, if the resolution is changed from 10 to 100 times for each dimension, the required data space is increased exponentially by 10 6 times, even though the data is actually the same 100 data entries (non-zero = 1000) . Especially, as the dimension and the resolution increase, the density of data (density = # nonzeros / # total_elements) in the entire multidimensional space decreases sharply and sparsity (# zeros / # total_elements) Thus, most of the data space is unnecessarily consumed, and the analysis cost for the data space is unnecessarily increased.
MATLAB, a popular analysis tool for scientific data analysis, has also been developed for a separate extension library such as a tensor toolbox because of its poor processing capability for multi-dimensional sparse data, It is impossible to process large-scale data.
It is an object of the present invention to provide a data analysis apparatus, a data analysis method, and a storage medium for storing a data analysis program for effectively supporting analysis of multi-dimensional data (tensor).
Another object of the present invention is to provide a data analyzing apparatus, a data analyzing method, and a storage medium storing a data analysis program for efficiently managing and analyzing multidimensional data using characteristics of scientific data in which rare data are mostly present.
It is still another object of the present invention to provide a data analysis apparatus, a data analysis method, and a storage medium storing a data analysis program for efficiently managing and analyzing multi-dimensional data using an external linking method.
According to another aspect of the present invention, there is provided a method of analyzing data using a user analysis tool and a remotely located distributed system, the method comprising: when the tensor data is remotely transmitted from the user analysis tool, Converting the data into one of a Dense type and a Sparse type, distributing the data to the distributed system, and storing the data; Recording metadata of the stored data; Performing analysis of the stored data using an analysis algorithm and the distributed system when an analysis command is transmitted from the user analysis tool; And when the data analysis is performed, storing the analysis result in the distributed system and / or transmitting the analysis result to the user analysis tool.
Wherein the data conversion step comprises the steps of: calculating an amount of total data necessary for expressing from the tensor data; Calculating a total data space required to represent the density data from the tensor data; Calculating a ratio of a total amount of data necessary for expressing the compact size to a total data space necessary for expressing the density type; And converting the tensor data into compact data if the ratio is smaller than the threshold value, and converting the tensor data to compact data if the ratio is not smaller than the threshold value.
Further comprising the step of listing-up or retrieving data stored in the distributed system according to a request from the user analysis tool.
The analyzing step may further include converting the part or all of the analysis algorithm into an Einstein summing method.
A data analysis apparatus according to an embodiment of the present invention includes a user analysis unit having a user analysis tool; A distributed system for distributing and storing data; And transforming the tensor data transmitted remotely from the user analysis unit into one of a Dense type and a Sparse type on the basis of a preset threshold value, distributing the data to the distributed system, And a large-capacity tensor analysis engine for analyzing the stored data by performing an analysis algorithm when the analysis command is transmitted from the user analysis unit.
The large capacity tensor analysis engine may store the analysis result of the data in the distributed system and / or transmit it to the user analysis unit.
The large capacity tensor analysis engine includes a data manager for recording the metadata and listing-up or searching for data stored in the distributed system as requested by the user analysis unit; A data conversion unit for converting the tensor data transmitted remotely from the user analysis unit into one of a density type and a dilemma based on a preset threshold value, and distributing the data to the distributed system and storing the data; An analysis library providing a library for performing the analysis algorithm; And a tensor analysis engine that converts part or all of the analysis algorithm into an Einstein summing method to perform the data analysis.
Wherein the data conversion unit calculates a total amount of data necessary for expressing the tensor data from the tensor data, calculates a total data space necessary for expressing the tensor data in a density form, The tensor data is converted into compact data if the ratio between the amount of data and the required total data space for expressing the density type is smaller than the threshold value, and if the ratio is not small, the data is converted into density type data .
The storage medium according to an exemplary embodiment of the present invention converts the tensor data into one of a density type and a small size based on a predetermined threshold value when the tensor data is remotely transmitted from the user analysis tool, And a program for analyzing the stored data by performing an analysis algorithm when the analysis command is transmitted from the user analysis tool.
The storage medium storing the data analyzing apparatus, the data analyzing method, and the data analyzing program according to the present invention has the following advantages from the viewpoint of the user.
First, users who use user analysis tools can easily outsource the data management and analysis functions of multidimensional data.
Second, it is possible to easily perform a series of processes to send data to the external analysis system remotely, to store and manage it, to perform the analysis, and to import the results to the user analysis tool, thereby concentrating on the user analysis tool, Can be reduced.
Third, it automatically converts and manages tensor data into sparse type data or dense type data by direct command of user or automated method, selectively operates analysis algorithm corresponding thereto, and converts it into an optimal form in network data transmission , It is possible to solve the problem that the data of the sparse type is expressed unnecessarily as the dense type, thereby wasting memory and computing time. As a result, optimized data management minimizes the user's data analysis time and significantly reduces the cost of building the system for processing tensor data of the same size.
Fourth, users are connected to a server, which is an external analysis system, by using a user analysis tool for each user, and a plurality of users simultaneously access and share the server, so that data can be shared among users. In particular, in the complex analysis of large-scale multidimensional data, users can simultaneously utilize data analysis tasks, allowing scientists to conduct collaborative research, regardless of geography.
The storage medium for storing the data analysis apparatus, the data analysis method, and the data analysis program according to the present invention has the following advantages in terms of system performance.
First, by distributing and processing the tensor data in the large-capacity tensor analysis system, it is possible to solve the problem of data management / excessive computation concentration in a single machine-based user analysis tool environment, and large-scale data processing can be realized.
Second, based on a general purpose cluster-based distributed data processing platform such as Spark / Hadoop, the scale of the cluster can be scaled up to easily extend its data management / computation capabilities.
Third, since it has a client-server structure and multiple users can access and use at the same time, it is possible to maximize the utilization of resorts such as clusters.
BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic drawing showing an example of density data and sparse data existing in a multidimensional space;
2 is a block diagram showing an embodiment of a data analysis apparatus according to the present invention.
3 is a block diagram showing an embodiment of a data analysis apparatus for storing data in an external system using an external interlocking method according to the present invention.
4 is a flowchart showing an embodiment of a method of storing data in an external system using an external interlocking method in the data analysis apparatus of FIG.
5 is a block diagram showing an embodiment of a data analyzing apparatus for analyzing data stored in an external system using an external interlocking method according to the present invention.
6 is a flowchart showing an embodiment of a method of analyzing data stored in an external system using an external interlocking method in the data analysis apparatus of FIG.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The structure and operation of the present invention shown in the drawings and described by the drawings are described as at least one embodiment, and the technical ideas and the core structure and operation of the present invention are not limited thereby.
As used herein, terms used in the present invention are selected from general terms that are widely used in the present invention while taking into account the functions of the present invention, but these may vary depending on the intention or custom of a person skilled in the art or the emergence of new technologies. Also, in certain cases, there may be a term chosen arbitrarily by the applicant, in which case the meaning shall be described in detail in the description part of the relevant specification. Accordingly, it is intended that the terminology used herein should be defined not only by the nomenclature of the term, but also by the meaning of the term and its scope throughout this specification.
In addition, structural and functional descriptions specific to the embodiments of the present invention disclosed herein are illustrated for the purpose of describing an embodiment according to the concept of the present invention only, and embodiments according to the concept of the present invention May be embodied in various forms and should not be construed as limited to the embodiments set forth herein.
Embodiments in accordance with the concepts herein may be made in various manners and may take various forms, so that specific embodiments are illustrated in the drawings and described in detail herein. It is to be understood, however, that it is not intended to limit the embodiments consistent with the concepts herein to the particular forms disclosed, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.
And in this specification the terms first and / or second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms may be named for the purpose of distinguishing one element from another, for example, without departing from the scope of the rights under the concept of the present disclosure, the first element being referred to as the second element, The second component may also be referred to as a first component.
Also, throughout the specification, when an element is referred to as "including" an element, it means that the element may include other elements as well, without departing from the other elements unless specifically stated otherwise. Quot; and " part (s) " in the specification mean units for processing at least one function or operation, which may be implemented by hardware, software, or a combination of hardware and software.
In the present invention, the term 'tensor' refers to a multidimensional array of three or more dimensions, and is used in the same meaning as multidimensional data and tensor data.
FIG. 2 shows an example of a data analysis apparatus that supports multidimensional scientific data analysis according to the present invention and interworks with an external analysis system to expand the processing capability of a user's analysis tool (e.g., MATLAB, R, etc.).
2 shows the
The
The
The
The
The
That is, the
In one embodiment, the
Also, the
Meanwhile, in the present invention, the
The first and
In other words, the
The large-capacity
The
The
Generally, the dense type is a two-dimensional matrix in which data expressed in a one-dimensional form such as a matrix (data = (1,0,3,4), row = 2, col = 2) by-2). On the contrary, in the case of the sparse type, in the case of the matrix in the previous case, as in ({((1,1), 1), ((2,1), 3) ) In the form of a list. Specifically, the specified value (here, 0) is recognized as the value of the sparse space, and it is not expressed. For example, ((1,2), 0) can be ignored in the Sparse type representation, reducing overall data. However, since the Sparse type expresses coordinate values for each data, additional data space is consumed. Such a technique is a commonly used method, but in the present invention, the
To this end, the
As described above, the
The
The
The distributed
In the present invention, the distributed
Hadoop has four main uses. The first is search engine indexing, and the second is data analysis or statistical analysis. The third is the data pre-processing (Table Precomputation and Rollop), and the fourth is the structured data storage.
At this time, the Hadoop distributed file system divides a file into blocks, replicates the file several times, and stores it in a plurality of nodes (servers). By doing this, when some hardware malfunctions, it replaces the duplicate files of other servers and automatically restores them, and even if the size of one file is large, it can be divided into several nodes, which is very effective for storing unstructured data.
The Hbase is a distributed DBMS (database management system) of NoSQL method for supporting the Hadoop distributed file system, and NoSQL means a non-relational database.
The MapReduce is software that can efficiently perform large amount of data, minimize network trapping, and execute necessary tasks while automatically recovering from a failure by using the Map & Reduce function in a distributed environment such as the Hadoop distributed file system. In other words, MapReduce is one of the big data processing technologies for processing a large data set, and it distributes the data stored in the distributed file system so that large amount of data can be processed in a short time.
That is, the present invention can easily scale data management / computation capabilities by extending the scale of the cluster based on a general purpose cluster-based distributed data processing platform such as Spark / Hadoop. In addition, since it is a client-server structure, multiple users can access and use at the same time, maximizing utilization of resorts such as clusters.
3 is an embodiment of the data analysis apparatus of the present invention for moving and processing the multidimensional data (tensor) of the
That is, the
The
The distributed
Also, the large
If there is a remote request from the user analysis tool 110 (S406), the
At this time, the large capacity
FIG. 5 is a flow chart illustrating a method of browsing tensor data moved to the
That is, the
To this end, the
Here, the analysis algorithm may be called by the
In the Einstein summation processing method, sparse-type tensor data and dense-type tensor data are selectively designated and analyzed.
The large-capacity
In this way, users who use the
Also, since the user can access the
In particular, tensor data is distributed and processed by an external large-capacity tensor analysis engine, thereby solving the problem of data management / excessive computation concentration in a single machine-based user analysis tool environment and realizing large-scale data processing.
The features, structures, effects and the like described in the embodiments are included in at least one embodiment of the present invention and are not necessarily limited to only one embodiment. Furthermore, the features, structures, effects and the like illustrated in the embodiments can be combined and modified by other persons skilled in the art to which the embodiments belong. Therefore, it should be understood that the present invention is not limited to these combinations and modifications.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be understood that various modifications and applications are possible. For example, each component specifically shown in the embodiments can be modified and implemented. It is to be understood that all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
100: user analysis unit 110: user analysis tool
120: first connector 200: external analysis section
210: second connector 220: large capacity tensor analysis engine
221: Mass Data Manager 223: Data Conversion Unit
225: analysis library 227: tensor analysis engine
230: Distributed system
Claims (20)
Transforming the tensor data into one of a Dense type and a Sparse type based on a predetermined threshold value when the tensor data is remotely transmitted from the user analysis tool, ;
Recording metadata of the stored data;
Performing analysis of the stored data using an analysis algorithm and the distributed system when an analysis command is transmitted from the user analysis tool; And
And when the data analysis is performed, storing the analysis result in the distributed system and / or transmitting the result to the user analysis tool.
Calculating a total amount of data required when the tensor data is expressed in small size;
Calculating a total data space required if the tensor data is expressed in a density type;
Calculating a ratio of the total data amount to the total data space; And
Converting the tensor data into compact data if the ratio is smaller than the threshold value and converting the tensor data into compact data if the ratio is not smaller than the threshold value.
Wherein the threshold value is designated by the user or automatically designated by the user.
Wherein the meta data includes at least one of a name of a tensor to be managed, an owner name, a creation date and time, an access right, a dimension, a density, a degree of rareness, a location of data stored in the distributed system, .
Further comprising listing or retrieving data stored in the distributed system upon request from the user analysis tool.
Further comprising the step of converting some or all of the analysis algorithms into an Einstein summing method.
Wherein when the analysis algorithm is performed by the Einstein summing method, the density type data and the miniature data are selectively designated and processed.
A distributed system for distributing and storing data; And
The tensor data transmitted remotely from the user analysis unit is converted into one of a Dense type and a Sparse type on the basis of a preset threshold value and is distributed to the distributed system and stored, And a large capacity tensor analysis engine for analyzing the stored data by performing an analysis algorithm when the analysis command is transmitted from the user analysis unit.
Wherein the analysis result is stored in the distributed system and / or transmitted to the user analysis unit.
A data manager for recording the metadata, listing up the data stored in the distributed system or performing a search according to a request from the user analysis unit;
A data conversion unit for converting the tensor data transmitted remotely from the user analysis unit into one of a density type and a dilemma based on a preset threshold value, and distributing the data to the distributed system and storing the data;
An analysis library providing a library for performing the analysis algorithm; And
And a tensor analysis engine that converts part or all of the analysis algorithm into an Einstein summing method to perform the data analysis.
Calculating a total data amount required when the tensor data is expressed in a compact form, calculating a total data space required when the tensor data is expressed in a density form, and calculating a ratio of the total data amount to the total data space, And converts the tensor data into compact data if the threshold value is smaller than the threshold value and converts the tensor data into density data if the threshold value is not smaller than the threshold value.
Wherein the threshold value is designated by the user or automatically designated by the user.
Wherein the meta data includes at least one of a name of a tensor to be managed, an owner, a creation date and time, an access right, a dimension, a density, a degree of rareness, a location of data stored in the distributed system, .
Wherein when the analysis algorithm is performed by the Einstein summing method, the density type data and the miniature data are selectively designated and processed.
Wherein the analysis algorithm is called from the analysis library.
Wherein the library provided by the analysis library includes data clustering and pattern analysis.
Wherein the analysis algorithm is extended using a user defined function (UDF) generated by a user in the user analysis unit.
And the plurality of user analysis units are simultaneously connected to the large capacity tensor analysis engine to share the data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150186539A KR101629395B1 (en) | 2015-12-24 | 2015-12-24 | apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150186539A KR101629395B1 (en) | 2015-12-24 | 2015-12-24 | apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data |
Publications (1)
Publication Number | Publication Date |
---|---|
KR101629395B1 true KR101629395B1 (en) | 2016-06-13 |
Family
ID=56191391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150186539A KR101629395B1 (en) | 2015-12-24 | 2015-12-24 | apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101629395B1 (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101859845B1 (en) * | 2016-11-23 | 2018-05-21 | 한국과학기술원 | Offline Friend Recommendation using Mobile Context and Online Friend Network Information based on Tensor Factorization |
KR20190060600A (en) | 2017-11-24 | 2019-06-03 | 서울대학교산학협력단 | Apparatus for supporting multi-dimensional data analysis through parallel processing and method for the same |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140048396A (en) * | 2012-10-11 | 2014-04-24 | 주식회사 케이티 | System and method for searching file in cloud storage service, and method for controlling file therein |
-
2015
- 2015-12-24 KR KR1020150186539A patent/KR101629395B1/en active IP Right Grant
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140048396A (en) * | 2012-10-11 | 2014-04-24 | 주식회사 케이티 | System and method for searching file in cloud storage service, and method for controlling file therein |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101859845B1 (en) * | 2016-11-23 | 2018-05-21 | 한국과학기술원 | Offline Friend Recommendation using Mobile Context and Online Friend Network Information based on Tensor Factorization |
KR20190060600A (en) | 2017-11-24 | 2019-06-03 | 서울대학교산학협력단 | Apparatus for supporting multi-dimensional data analysis through parallel processing and method for the same |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10698913B2 (en) | System and methods for distributed database query engines | |
Li et al. | A spatiotemporal indexing approach for efficient processing of big array-based climate data with MapReduce | |
US9805079B2 (en) | Executing constant time relational queries against structured and semi-structured data | |
US9594803B2 (en) | Parallel processing database tree structure | |
CN106708993B (en) | Method for realizing space data storage processing middleware framework based on big data technology | |
US10013440B1 (en) | Incremental out-of-place updates for index structures | |
Chavan et al. | Survey paper on big data | |
US20150088807A1 (en) | System and method for granular scalability in analytical data processing | |
Borkar et al. | Have your data and query it too: From key-value caching to big data management | |
Su et al. | Sdquery dsi: integrating data management support with a wide area data transfer protocol | |
Hu et al. | A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data | |
US10776368B1 (en) | Deriving cardinality values from approximate quantile summaries | |
Huang et al. | R-HBase: A multi-dimensional indexing framework for cloud computing environment | |
KR101629395B1 (en) | apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data | |
CN103034650A (en) | System and method for processing data | |
WO2022082193A1 (en) | Managing and streaming a plurality of large-scale datasets | |
LeFevre et al. | SkyhookDM: Data processing in Ceph with programmable storage | |
CN113312345A (en) | Kubernetes and Ceph combined remote sensing data storage system, storage method and retrieval method | |
Rodriges Zalipynis | Distributed in situ processing of big raster data in the Cloud | |
Raj et al. | A Review on Hadoop Eco System for Big Data | |
Chiu et al. | In-memory query system for scientific dataseis | |
CN110569310A (en) | Management method of relational big data in cloud computing environment | |
Zhao et al. | A multidimensional OLAP engine implementation in key-value database systems | |
Zheng et al. | J-TEXT distributed data storage and management system | |
Mellone et al. | A novel approach for large‐scale environmental data partitioning on cloud and on‐premises storage for compute continuum applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |