KR101629395B1

KR101629395B1 - apparatus for analyzing data, method of analyzing data and storage for storing a program analyzing data

Info

Publication number: KR101629395B1
Application number: KR1020150186539A
Authority: KR
Inventors: 이용; 박경석; 이경하; 엄정호; 이상환
Original assignee: 한국과학기술정보연구원
Priority date: 2015-12-24
Filing date: 2015-12-24
Publication date: 2016-06-13

Abstract

Disclosed are an apparatus and a method for analyzing big data and a storage medium for storing a program for analyzing the data. The method for analyzing the data by using a distributed system located remotely from a user analysis tool may include: a step of converting tensor data to either a dense type data or a sparse type data based on a pre-established critical value when the tensor data are remotely transmitted from the user analysis tool and then distributing the data to the distributed system and storing the data in the distributed system; a step of recording meta data with regard to the stored data; a step of performing an analysis of the stored data by using an analysis algorithm and the distributed system after an analysis command is transmitted from the user analysis tool; and a step of storing an analysis result in the distributed system and/or transmitting the analysis result to the user analysis tool after the data are analyzed.

Description

[0001] The present invention relates to a data analyzing apparatus, a data analyzing method, and a storage medium storing a program for analyzing data,

The present invention relates to a data analysis apparatus for analyzing big data, a data analysis method, and a storage medium storing a program for analyzing big data.

Big data is a term that refers to almost all kinds of modern data in which the quantity is large, the form is various, is generated / updated at a high speed, but is not structured, . The structured data is data stored in a fixed field, such as a relational database and a spreadsheet. Unstructured data is data that is not stored in a fixed field, And image / moving image / audio data. Semi-structured data is data that is not stored in a fixed field but includes metadata or schema, and may be XML or HTML text, for example. Big data is sometimes referred to as large data.

Scientific data among big data contains observations / observation results of various phenomena in the real world, and often has a multi-dimensional array structure. For example, when data are constructed by measuring the temperature and the air pressure by region and by time zone nationwide, it has a four-dimensional structure of (place, time, temperature, air pressure). This multidimensional structure is expressed and analyzed in a model called a tensor in mathematical physics. That is, a tensor means a multi-dimensional array of three or more dimensions.

These multidimensional data represent the generation of data between the dimensions and are used to analyze the relationships and patterns among the dimensions. However, analysis based on these multidimensional data representations has generally been limited to small-scale analyzes. For example, assuming the above four-dimensional data, the 'place' is divided into 200 arbitrary points, the 'time' corresponds to the hourly unit (24 * 7 = 168) If the pressure is expressed in 20 steps, it constitutes R ²⁰⁰ ^{* 168 * 40 * 20} space.

In particular, since most scientists' data analysis environments are limited to single machines, it was impossible to create or analyze higher-resolution data at higher dimensions and at different dimensions. For example, using satellite image data, the world is set to 1000 * 1000, the year is changed every hour (365 * 24 = 8760), the concentration of red tides occurring in the ocean is set to 100, , Assuming that the ocean salinity is expressed in 50 steps, it requires a data space of 1000000 * 8760 * 100 * 50 * 50 * 4 bytes (assuming float: 4 bytes) = 7.8 peta bytes. This is the level at which a data center needs to be built at the current technology level.

1, if the resolution is changed from 10 to 100 times for each dimension, the required data space is increased exponentially by 10 ⁶ times, even though the data is actually the same 100 data entries (non-zero = 1000) . Especially, as the dimension and the resolution increase, the density of data (density = # nonzeros / # total_elements) in the entire multidimensional space decreases sharply and sparsity (# zeros / # total_elements) Thus, most of the data space is unnecessarily consumed, and the analysis cost for the data space is unnecessarily increased.

MATLAB, a popular analysis tool for scientific data analysis, has also been developed for a separate extension library such as a tensor toolbox because of its poor processing capability for multi-dimensional sparse data, It is impossible to process large-scale data.

It is an object of the present invention to provide a data analysis apparatus, a data analysis method, and a storage medium for storing a data analysis program for effectively supporting analysis of multi-dimensional data (tensor).

Another object of the present invention is to provide a data analyzing apparatus, a data analyzing method, and a storage medium storing a data analysis program for efficiently managing and analyzing multidimensional data using characteristics of scientific data in which rare data are mostly present.

It is still another object of the present invention to provide a data analysis apparatus, a data analysis method, and a storage medium storing a data analysis program for efficiently managing and analyzing multi-dimensional data using an external linking method.

According to another aspect of the present invention, there is provided a method of analyzing data using a user analysis tool and a remotely located distributed system, the method comprising: when the tensor data is remotely transmitted from the user analysis tool, Converting the data into one of a Dense type and a Sparse type, distributing the data to the distributed system, and storing the data; Recording metadata of the stored data; Performing analysis of the stored data using an analysis algorithm and the distributed system when an analysis command is transmitted from the user analysis tool; And when the data analysis is performed, storing the analysis result in the distributed system and / or transmitting the analysis result to the user analysis tool.

Wherein the data conversion step comprises the steps of: calculating an amount of total data necessary for expressing from the tensor data; Calculating a total data space required to represent the density data from the tensor data; Calculating a ratio of a total amount of data necessary for expressing the compact size to a total data space necessary for expressing the density type; And converting the tensor data into compact data if the ratio is smaller than the threshold value, and converting the tensor data to compact data if the ratio is not smaller than the threshold value.

Further comprising the step of listing-up or retrieving data stored in the distributed system according to a request from the user analysis tool.

The analyzing step may further include converting the part or all of the analysis algorithm into an Einstein summing method.

A data analysis apparatus according to an embodiment of the present invention includes a user analysis unit having a user analysis tool; A distributed system for distributing and storing data; And transforming the tensor data transmitted remotely from the user analysis unit into one of a Dense type and a Sparse type on the basis of a preset threshold value, distributing the data to the distributed system, And a large-capacity tensor analysis engine for analyzing the stored data by performing an analysis algorithm when the analysis command is transmitted from the user analysis unit.

The large capacity tensor analysis engine may store the analysis result of the data in the distributed system and / or transmit it to the user analysis unit.

The large capacity tensor analysis engine includes a data manager for recording the metadata and listing-up or searching for data stored in the distributed system as requested by the user analysis unit; A data conversion unit for converting the tensor data transmitted remotely from the user analysis unit into one of a density type and a dilemma based on a preset threshold value, and distributing the data to the distributed system and storing the data; An analysis library providing a library for performing the analysis algorithm; And a tensor analysis engine that converts part or all of the analysis algorithm into an Einstein summing method to perform the data analysis.

Wherein the data conversion unit calculates a total amount of data necessary for expressing the tensor data from the tensor data, calculates a total data space necessary for expressing the tensor data in a density form, The tensor data is converted into compact data if the ratio between the amount of data and the required total data space for expressing the density type is smaller than the threshold value, and if the ratio is not small, the data is converted into density type data .

The storage medium according to an exemplary embodiment of the present invention converts the tensor data into one of a density type and a small size based on a predetermined threshold value when the tensor data is remotely transmitted from the user analysis tool, And a program for analyzing the stored data by performing an analysis algorithm when the analysis command is transmitted from the user analysis tool.

The storage medium storing the data analyzing apparatus, the data analyzing method, and the data analyzing program according to the present invention has the following advantages from the viewpoint of the user.

First, users who use user analysis tools can easily outsource the data management and analysis functions of multidimensional data.

Second, it is possible to easily perform a series of processes to send data to the external analysis system remotely, to store and manage it, to perform the analysis, and to import the results to the user analysis tool, thereby concentrating on the user analysis tool, Can be reduced.

Third, it automatically converts and manages tensor data into sparse type data or dense type data by direct command of user or automated method, selectively operates analysis algorithm corresponding thereto, and converts it into an optimal form in network data transmission , It is possible to solve the problem that the data of the sparse type is expressed unnecessarily as the dense type, thereby wasting memory and computing time. As a result, optimized data management minimizes the user's data analysis time and significantly reduces the cost of building the system for processing tensor data of the same size.

Fourth, users are connected to a server, which is an external analysis system, by using a user analysis tool for each user, and a plurality of users simultaneously access and share the server, so that data can be shared among users. In particular, in the complex analysis of large-scale multidimensional data, users can simultaneously utilize data analysis tasks, allowing scientists to conduct collaborative research, regardless of geography.

The storage medium for storing the data analysis apparatus, the data analysis method, and the data analysis program according to the present invention has the following advantages in terms of system performance.

First, by distributing and processing the tensor data in the large-capacity tensor analysis system, it is possible to solve the problem of data management / excessive computation concentration in a single machine-based user analysis tool environment, and large-scale data processing can be realized.

Second, based on a general purpose cluster-based distributed data processing platform such as Spark / Hadoop, the scale of the cluster can be scaled up to easily extend its data management / computation capabilities.

Third, since it has a client-server structure and multiple users can access and use at the same time, it is possible to maximize the utilization of resorts such as clusters.

BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a schematic drawing showing an example of density data and sparse data existing in a multidimensional space;
2 is a block diagram showing an embodiment of a data analysis apparatus according to the present invention.
3 is a block diagram showing an embodiment of a data analysis apparatus for storing data in an external system using an external interlocking method according to the present invention.
4 is a flowchart showing an embodiment of a method of storing data in an external system using an external interlocking method in the data analysis apparatus of FIG.
5 is a block diagram showing an embodiment of a data analyzing apparatus for analyzing data stored in an external system using an external interlocking method according to the present invention.
6 is a flowchart showing an embodiment of a method of analyzing data stored in an external system using an external interlocking method in the data analysis apparatus of FIG.

Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. The structure and operation of the present invention shown in the drawings and described by the drawings are described as at least one embodiment, and the technical ideas and the core structure and operation of the present invention are not limited thereby.

As used herein, terms used in the present invention are selected from general terms that are widely used in the present invention while taking into account the functions of the present invention, but these may vary depending on the intention or custom of a person skilled in the art or the emergence of new technologies. Also, in certain cases, there may be a term chosen arbitrarily by the applicant, in which case the meaning shall be described in detail in the description part of the relevant specification. Accordingly, it is intended that the terminology used herein should be defined not only by the nomenclature of the term, but also by the meaning of the term and its scope throughout this specification.

In addition, structural and functional descriptions specific to the embodiments of the present invention disclosed herein are illustrated for the purpose of describing an embodiment according to the concept of the present invention only, and embodiments according to the concept of the present invention May be embodied in various forms and should not be construed as limited to the embodiments set forth herein.

Embodiments in accordance with the concepts herein may be made in various manners and may take various forms, so that specific embodiments are illustrated in the drawings and described in detail herein. It is to be understood, however, that it is not intended to limit the embodiments consistent with the concepts herein to the particular forms disclosed, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure.

And in this specification the terms first and / or second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms may be named for the purpose of distinguishing one element from another, for example, without departing from the scope of the rights under the concept of the present disclosure, the first element being referred to as the second element, The second component may also be referred to as a first component.

Also, throughout the specification, when an element is referred to as "including" an element, it means that the element may include other elements as well, without departing from the other elements unless specifically stated otherwise. Quot; and " part (s) " in the specification mean units for processing at least one function or operation, which may be implemented by hardware, software, or a combination of hardware and software.

In the present invention, the term 'tensor' refers to a multidimensional array of three or more dimensions, and is used in the same meaning as multidimensional data and tensor data.

FIG. 2 shows an example of a data analysis apparatus that supports multidimensional scientific data analysis according to the present invention and interworks with an external analysis system to expand the processing capability of a user's analysis tool (e.g., MATLAB, R, etc.).

2 shows the user analysis unit 100 and the external analysis unit 200 located remotely. Here, as the user analysis environment, the user analysis unit 100 is used in combination with the first analysis unit for convenience of explanation. The external analysis unit 200 is a large-scale data analysis environment, and is used in combination with a second analysis unit or an external analysis system for convenience of explanation. Also, the external analysis unit 200 may be a server.

The user analysis unit 100 and the external analysis unit 200 may be connected by wire and / or wireless, and in the present invention, they are wirelessly connected.

The user analysis unit 100 and the external analysis unit 200 have a client-server structure.

The user analysis unit 100 includes a user analysis tool 110 and a first connector 120.

The external analysis unit 200 includes a second connector 210, a large-scale tensor analysis (LTA) engine 220, and a distributed system 230.

The user analysis tool 110 performs scientific data analysis using an analysis tool such as MATLAB or R. [ In this case, an analysis tool such as MATLAB or R usually performs analysis by loading necessary data in a single machine environment. Therefore, the present invention can be applied to an external analysis unit 200 ).

That is, the user analysis tool 110 analyzes large-scale multi-dimensional data by performing the following functions in cooperation with the external analysis unit 200.

In one embodiment, the user analysis tool 110 transmits data to the external analysis unit 200 through the first connector 120. The data transmitted here is data stored in a separate disk file or memory in the user analysis environment.

Also, the user analysis tool 110 performs list-up and search of data existing in the data space of the external analysis unit 200, which is a large-capacity data analysis environment, and analyzes data of the large-capacity data analysis environment remotely , And the operation / analysis result data of the large capacity data analysis environment is provided from the external analysis unit 200.

Meanwhile, in the present invention, the user analysis unit 100 includes a first connector 120, and the external analysis unit 200 includes a second connector 210.

The first and second connectors 120 and 210 are interlocked with a user analysis tool 110 used by a user such as a scientist and a large capacity tensor analysis engine 220 of the external analysis unit 200. That is, the data being analyzed by the user analysis tool 110 may be transmitted to the large capacity tensor analysis engine 220 through the first and second connectors 120 and 210, or the data stored in the large capacity tensor analysis engine 220 may be transmitted to the large capacity tensor analysis engine 220, Analysis can be requested. The results analyzed by the large capacity tensor analysis engine 220 through the first and second connectors 120 and 210 may be provided to the user analysis tool 110. The first and second connectors 120 and 210 are installed in the user analysis unit 100 and the external analysis unit 200 in the form of a middleware engine, respectively.

In other words, the user analysis tool 110 may transmit data to the large capacity tensor analysis engine 220 via the first and second connectors 120 and 210, or may perform a data analysis of the large capacity tensor analysis engine 220 Function can be called. Also, the user analysis tool 110 may receive analysis results of the large capacity tensor analysis engine 220 through the first and second connectors 120 and 210.

The large-capacity tensor analysis engine 220 includes a large-scale tensor data manager 221, a sparse / dense transformer 223, a complex tensor analysis library 225, And includes an engine 227 as an embodiment.

The mass data manager 221 can list-up / search data managed by the external analysis unit 200 according to a user's request transmitted remotely from the user analysis tool 110. For example, the mass data manager 221 can list-up / search data stored in the distributed system 230 according to a user's request. At this time, each dataset has its own name, and can be Sparse type or Dense type. Actually, these data are physically managed by the distributed system 230, and the large capacity data manager 221 of the large capacity tensor analysis engine 220 performs generation / deletion of the data and management of a meta data (Tensor Descriptor) do. The meta data, i.e., the tensor descriptor, includes a name of a tensor to be managed, a title, a creation date and time, an access right, a dimension, a density, a sparsity, a structured or binary ), And manages the location (URL) of the managed data and the access method.

The data conversion unit 223 converts the tensor data into a sparse form or a dense form as necessary. That is, the tensor data of the dense type is converted into the tensor data of the sparse type, or the tensor data of the sparse type is converted into the tensor data of the dense type.

Generally, the dense type is a two-dimensional matrix in which data expressed in a one-dimensional form such as a matrix (data = (1,0,3,4), row = 2, col = 2) by-2). On the contrary, in the case of the sparse type, in the case of the matrix in the previous case, as in ({((1,1), 1), ((2,1), 3) ) In the form of a list. Specifically, the specified value (here, 0) is recognized as the value of the sparse space, and it is not expressed. For example, ((1,2), 0) can be ignored in the Sparse type representation, reducing overall data. However, since the Sparse type expresses coordinate values for each data, additional data space is consumed. Such a technique is a commonly used method, but in the present invention, the data conversion unit 223 determines the type of data to be managed and exchanged in consideration of the density of the data.

To this end, the data conversion unit 223 obtains the number (#nz) of non-zeros in the input tensor data and generates a sum Obtain the amount of data (#total_sparse = #nz * #dim * unit_byte). Then, the ratio (ε) to the total data space (#total_dense) necessary for expressing in the dense form is obtained (ε = #total_sparse / #total_dense). If the ratio? Is smaller than a predetermined threshold value, the inputted tensor data is represented as a sparse type, and conversely, if it is larger, the denominator is expressed as a dense type. For example, if ε = 0.01, memory space can be efficiently reduced because only 1% of the data space is represented by the Sparse type, compared to the amount of memory required to represent the entire tensor data as a dense type. In the present invention, the threshold value may be arbitrarily set by the user or may be automatically determined depending on the system configuration (memory, disk, etc.).

As described above, the data conversion unit 223 automatically converts and manages the tensor data into sparse-type data or dense-type data by a user's direct command or an automated method, selectively manages an analysis algorithm corresponding thereto, By converting the data to the optimal form, it is possible to solve the problem that the data of the sparse type is unnecessarily expressed as the dense type and the memory and the computing time are wasted.

The analysis library 225 provides an advanced analysis library for performing complex analysis algorithms. This library can include data clustering, pattern analysis, etc., and can be extended further by user-defined functions (UDFs).

The tensor analysis engine 227 performs intermediate data optimization based on multiplex optimization between tensors. In other words, the tensor analysis typically generates intermediate data, which may not be processed due to excessive intermediate data involved in large-scale tensor analysis. For example, in the case of multiplying the matrices A, B, and C, the intermediate data is generated in this process in which an operation is performed on C after the operations of A and B are performed. In order to solve this problem, the tensor analysis engine 227 utilizes an Einstein summation method as one embodiment. The tensor analysis engine 227 can concurrently perform the operation mapping from the operation objects to the final result product, thereby minimizing the intermediate data generation. In addition, the tensor analysis engine 227 employs a tensor operation method using a subscript by utilizing an Einstein summation library.

The distributed system 230 of the external analysis unit 200 distributes and manages data using a distributed computing environment (for example, Spark / Hadoop), and distributes and processes data operations. In the tensor analysis engine 227, the Einstein summation data operation is converted into MapReduce processing in the distributed system 230 and processed in the cluster.

In the present invention, the distributed system 230 uses Hadoop as an example. The Hadoop is an open source Java software framework that supports distributed applications running on large-scale computer clusters capable of handling large amounts of data. The Hadoop consists of the Hadoop Distributed File System (HDFS), the Database Management System (Hbase), and MapReduce, and the set of these technologies is called the Hadoop ecosystem.

Hadoop has four main uses. The first is search engine indexing, and the second is data analysis or statistical analysis. The third is the data pre-processing (Table Precomputation and Rollop), and the fourth is the structured data storage.

At this time, the Hadoop distributed file system divides a file into blocks, replicates the file several times, and stores it in a plurality of nodes (servers). By doing this, when some hardware malfunctions, it replaces the duplicate files of other servers and automatically restores them, and even if the size of one file is large, it can be divided into several nodes, which is very effective for storing unstructured data.

The Hbase is a distributed DBMS (database management system) of NoSQL method for supporting the Hadoop distributed file system, and NoSQL means a non-relational database.

The MapReduce is software that can efficiently perform large amount of data, minimize network trapping, and execute necessary tasks while automatically recovering from a failure by using the Map & Reduce function in a distributed environment such as the Hadoop distributed file system. In other words, MapReduce is one of the big data processing technologies for processing a large data set, and it distributes the data stored in the distributed file system so that large amount of data can be processed in a short time.

That is, the present invention can easily scale data management / computation capabilities by extending the scale of the cluster based on a general purpose cluster-based distributed data processing platform such as Spark / Hadoop. In addition, since it is a client-server structure, multiple users can access and use at the same time, maximizing utilization of resorts such as clusters.

3 is an embodiment of the data analysis apparatus of the present invention for moving and processing the multidimensional data (tensor) of the user analysis tool 110 to the external analysis unit 200 in a direct indication of the user or in an automated manner, (Tensor) of a user's analysis tool 110 based on FIG. 3 to the external analysis unit 200 and managing the data (tensor) of the analysis tool 110 according to an embodiment of the present invention.

That is, the user analysis tool 110 converts the tensor data to transmit the tensor data to the external analysis unit 200 (S401), and transmits the converted data to the external analysis unit 200 remotely through the first connector 120 (Step S402). The second connector 210 of the external analysis unit 200 outputs the data transmitted remotely from the user analysis unit 100 to the large capacity tensor analysis engine 220. Then, the data conversion unit 223 of the large capacity tensor analysis engine 220 converts the tensor data into sparse or dense data according to predetermined conditions (S403). The method of converting the tensor data into the data of the sparse type or the dense type has been described in detail above, and thus will not be described here.

The data conversion unit 223 distributes the data converted into the sparse type or the dense type to the distributed system 230 (S404).

The distributed system 230 distributes the data distributed by the data conversion unit 223 using a distributed computing environment (e.g. Spark / Hadoop).

Also, the large capacity data manager 221 of the large capacity tensor analysis engine 220 records the metadata about the data stored in the distributed system 230 (S405). The meta data is also referred to as a tensor descriptor and includes a name of the managing tensor, owner name, date and time of creation, access right, dimension, density, sparsity, structured or binary stored in the distributed system 230 (URL) of the data managed by converting the data into binary data, an access method, and the like. The metadata may be provided in the user analysis unit 100 together with the stereotyped or binary data.

If there is a remote request from the user analysis tool 110 (S406), the mass data manager 221 performs a list-up of necessary data and performs a search (S407).

At this time, the large capacity tensor analysis engine 220 may transmit the data stored in the distributed system 230 to the user analysis unit 100 in the reverse order of the storage process. The data transmitted to the user analysis unit 100 may be formatted or binary data, and the metadata may also be transmitted.

FIG. 5 is a flow chart illustrating a method of browsing tensor data moved to the external analysis unit 200 directly from the user analysis tool 110 and accessing contents, analyzing data stored in the external analysis unit 200 remotely, And the result of the analysis is provided to the user analysis tool according to an embodiment of the present invention. FIG. 6 is a flowchart illustrating an embodiment of a data analysis method of the present invention for analyzing tensor data based on FIG.

That is, the user analysis tool 110 remotely requests the external analysis unit 200 to analyze the tensor data stored in the distributed system 230, and monitors the analysis process of the external analysis unit 200. If necessary, the analysis result may be stored in the distributed system 230 of the external analysis unit 200 as it is, or may be transmitted to the user analysis tool 110.

To this end, the user analysis tool 110 converts the analysis command into a form that can be processed by the external analysis unit 200, and then remotely transmits the analysis command to the external analysis unit 200 through the first connector 120 S601). The large capacity tensor analysis engine 220 of the external analysis unit 200 interprets the analysis command received through the second connector 210 and performs the requested analysis task in operation S602. At this time, the analysis command analysis is performed using the analysis algorithm in the analysis library 225 of the large-capacity tensor analysis engine 220. The tensor analysis engine 227 of the large capacity tensor analysis engine 220 converts part or all of the analysis algorithm to an Einstein summing method as necessary. The distributed system 230 is used to perform the analysis algorithm in the Einstein sum method (S603). The distributed system 230 analyzes data using a distributed data processing system such as a spark or Hadoop. When Hadoop is applied to the distributed system 230, the stored data may be analyzed using MapReduce. In other words, the MapReduce can distribute the stored data and process the large amount of data in a short time.

Here, the analysis algorithm may be called by the analysis library 225 or may be provided by a UDF (User-Defined Functions) produced by the user in the user analysis tool 110 and used.

In the Einstein summation processing method, sparse-type tensor data and dense-type tensor data are selectively designated and analyzed.

The large-capacity tensor analysis engine 220 may store the analysis result in the analysis system 230 (S604) or transmit the analysis result to the user analysis tool 110 at the request of the user analysis tool 110 (S605).

In this way, users who use the user analysis tool 110 in the present invention can easily outsource the data management function and the analysis function of the multi-dimensional data. In addition, the user can easily perform a series of processes of remotely sending data to the external analysis unit 200, storing and managing the data, analyzing the data, and importing the data to the user analysis tool, It is possible to greatly reduce the management burden on the user.

Also, since the user can access the external analysis unit 200 using the user analysis tool 110 for each user and also can access the external analysis unit 200 at the same time, the data can be shared among the users You can work. In particular, in the complex analysis of large-scale multidimensional data, users can simultaneously utilize data analysis tasks, allowing scientists to conduct collaborative research, regardless of geography.

In particular, tensor data is distributed and processed by an external large-capacity tensor analysis engine, thereby solving the problem of data management / excessive computation concentration in a single machine-based user analysis tool environment and realizing large-scale data processing.

The features, structures, effects and the like described in the embodiments are included in at least one embodiment of the present invention and are not necessarily limited to only one embodiment. Furthermore, the features, structures, effects and the like illustrated in the embodiments can be combined and modified by other persons skilled in the art to which the embodiments belong. Therefore, it should be understood that the present invention is not limited to these combinations and modifications.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It will be understood that various modifications and applications are possible. For example, each component specifically shown in the embodiments can be modified and implemented. It is to be understood that all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

100: user analysis unit 110: user analysis tool
120: first connector 200: external analysis section
210: second connector 220: large capacity tensor analysis engine
221: Mass Data Manager 223: Data Conversion Unit
225: analysis library 227: tensor analysis engine
230: Distributed system

Claims

A method for analyzing data using a user analysis tool and a remotely located distributed system,
Transforming the tensor data into one of a Dense type and a Sparse type based on a predetermined threshold value when the tensor data is remotely transmitted from the user analysis tool, ;
Recording metadata of the stored data;
Performing analysis of the stored data using an analysis algorithm and the distributed system when an analysis command is transmitted from the user analysis tool; And
And when the data analysis is performed, storing the analysis result in the distributed system and / or transmitting the result to the user analysis tool.

2. The method according to claim 1,
Calculating a total amount of data required when the tensor data is expressed in small size;
Calculating a total data space required if the tensor data is expressed in a density type;
Calculating a ratio of the total data amount to the total data space; And
Converting the tensor data into compact data if the ratio is smaller than the threshold value and converting the tensor data into compact data if the ratio is not smaller than the threshold value.

The method according to claim 1,
Wherein the threshold value is designated by the user or automatically designated by the user.

The method according to claim 1,
Wherein the meta data includes at least one of a name of a tensor to be managed, an owner name, a creation date and time, an access right, a dimension, a density, a degree of rareness, a location of data stored in the distributed system, .

The method according to claim 1,
Further comprising listing or retrieving data stored in the distributed system upon request from the user analysis tool.

2. The method of claim 1,
Further comprising the step of converting some or all of the analysis algorithms into an Einstein summing method.

7. The method of claim 6,
Wherein when the analysis algorithm is performed by the Einstein summing method, the density type data and the miniature data are selectively designated and processed.

A user analysis unit having a user analysis tool;
A distributed system for distributing and storing data; And
The tensor data transmitted remotely from the user analysis unit is converted into one of a Dense type and a Sparse type on the basis of a preset threshold value and is distributed to the distributed system and stored, And a large capacity tensor analysis engine for analyzing the stored data by performing an analysis algorithm when the analysis command is transmitted from the user analysis unit.

9. The apparatus of claim 8, wherein the large capacity tensor analysis engine
Wherein the analysis result is stored in the distributed system and / or transmitted to the user analysis unit.

9. The apparatus of claim 8, wherein the large capacity tensor analysis engine
A data manager for recording the metadata, listing up the data stored in the distributed system or performing a search according to a request from the user analysis unit;
A data conversion unit for converting the tensor data transmitted remotely from the user analysis unit into one of a density type and a dilemma based on a preset threshold value, and distributing the data to the distributed system and storing the data;
An analysis library providing a library for performing the analysis algorithm; And
And a tensor analysis engine that converts part or all of the analysis algorithm into an Einstein summing method to perform the data analysis.

11. The apparatus of claim 10, wherein the data conversion unit
Calculating a total data amount required when the tensor data is expressed in a compact form, calculating a total data space required when the tensor data is expressed in a density form, and calculating a ratio of the total data amount to the total data space, And converts the tensor data into compact data if the threshold value is smaller than the threshold value and converts the tensor data into density data if the threshold value is not smaller than the threshold value.

11. The method of claim 10,
Wherein the threshold value is designated by the user or automatically designated by the user.

11. The method of claim 10,
Wherein the meta data includes at least one of a name of a tensor to be managed, an owner, a creation date and time, an access right, a dimension, a density, a degree of rareness, a location of data stored in the distributed system, .

11. The apparatus of claim 10, wherein the tensor analysis engine
Wherein when the analysis algorithm is performed by the Einstein summing method, the density type data and the miniature data are selectively designated and processed.

11. The method of claim 10,
Wherein the analysis algorithm is called from the analysis library.

11. The method of claim 10,
Wherein the library provided by the analysis library includes data clustering and pattern analysis.

11. The method of claim 10,
Wherein the analysis algorithm is extended using a user defined function (UDF) generated by a user in the user analysis unit.

9. The method of claim 8,
And the plurality of user analysis units are simultaneously connected to the large capacity tensor analysis engine to share the data.

When the tensor data is remotely transmitted from the user analysis tool, the tensor data is converted into one of a density type and a dilemma based on a preset threshold value, and the result is distributed to a distributed system to store the metadata. And analyzing the stored data by executing an analysis algorithm when an analysis command is transmitted from the user analysis tool.

The method of claim 19, wherein the transforming the data comprises: calculating a total amount of data required when the tensor data is expressed in a small size; calculating a total data space required when the tensor data is expressed in a density type; And converts the tensor data into compact data if the ratio of the total data amount to the total data space is smaller than the threshold, and converts the tensor data into compact data if not.