KR20170089067A

KR20170089067A - Bigdata processing system and method

Info

Publication number: KR20170089067A
Application number: KR1020160008634A
Authority: KR
Inventors: 박현; 김세한; 박대헌; 이은주
Original assignee: 한국전자통신연구원
Priority date: 2016-01-25
Filing date: 2016-01-25
Publication date: 2017-08-03

Abstract

According to one aspect of the present invention, there is provided a big data processing system comprising: a collecting unit for collecting data through various paths; A preprocessing unit for performing preprocessing on data transmitted from the collecting unit; A storage unit for dispersively storing input data; An analysis unit for analyzing data transmitted from the preprocessor or data stored in the storage unit and generating an analysis result; A display unit for receiving and displaying the analysis result; And an interworking process executing unit for outputting an interworking process message so that the collecting unit, the preprocessing unit, the analyzing unit, and the display unit operate in real time and process the data.

Description

[0001] Big data processing system and method [0002]

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a big data processing technique, and more particularly, to a big data processing system and method capable of collecting, storing, and analyzing various types of big data such as regular, semi-regular and irregular types in real time.

Recently, interest in big data technology that gives meaningful value to massive data such as stereotyped or unstructured data is increasing. Many application services are required to produce accurate and fast results through big data.

The term big data refers to data sets that have common software tools and computer systems to collect, manage, store, search, share, analyze, and visualize data belonging to a certain amount of time (data set). The size of the big data may have a range of terabytes, exabytes, or zeta bytes.

Big data exists in various fields. For example, web logs, radio frequency identification (RFID), social networks, social data, Internet text and documents, internet search indexing, astronomy, meteorology, genomics, Biogeochemistry, biology, military surveillance, medical records, photographic records, video recordings, and electronic commerce.

Big Data is generally based on an ecosystem called Hadoop. Hadoop collects large amounts of data, such as regular or irregular data, which are stored in redundant distributed data and processed in parallel on distributed network clusters.

This Hadoop gives the big data the technical meaning of processing information in a short period of time and extracting valuable information. Hadoop's Hadoop Distributed File System (HDFS) is an open source, distributed storage of large amounts of data. It is a technology that reliably stores collected data.

However, Hadoop has a problem in that it can not process collected data in real time as a batch processing system. In other words, Hadoop stores the collected data for a certain period of time, and then performs analysis on a large amount of data according to an external request for data analysis.

A recent alternative is the Hadoop echo system, such as Storm and Spark, in-memory data processing technologies.

The Storm can process the events in parallel without storing and process the data in a manner similar to the MapReduce model. In addition, according to the mechanism of Storm, spout generates data in units of tuples, processes data in units of tuples in bolts, and stores processing results.

Spark introduces an abstraction object in a dataset called Resilient Distribute Dataset (RDD) to perform data processing.

However, this conventional technique requires a mechanism based on Hadoop, and further efforts are needed to acquire it. In addition, the prior art is useful for applications that perform a lot of repetitive tasks on large amounts of data, for example, scientific applications such as repetitive data operations.

However, most of the recent applications are data analysis that integrates various kinds of data (sensor data, social data, system data, accumulation data, weather data, environmental public data, etc.) rather than quick calculation of repetitive numerical operations of in-memory There are many applications to find value.

For example, there is a need for a real-time processing method for applications based on various big data, such as an application that collects and analyzes various data such as an environmental disaster disaster at a time.

SUMMARY OF THE INVENTION Accordingly, the present invention has been made in view of the above problems, and it is an object of the present invention to provide a method and apparatus for collecting, storing, and analyzing various types of big data, And to provide a large data processing system and method.

According to an aspect of the present invention, there is provided a big data processing system including: a collecting unit collecting data through various paths; A preprocessing unit for performing preprocessing on data transmitted from the collecting unit; A storage unit for dispersively storing input data; An analysis unit for analyzing data transmitted from the preprocessor or data stored in the storage unit and generating an analysis result; A display unit for receiving and displaying the analysis result; And an interworking process executing unit for outputting an interworking process message so that the collecting unit, the preprocessing unit, the analyzing unit, and the display unit operate in real time and process the data.

The collecting unit collects orthopedic, semi-orthopedic, and irregular data using a data collection tool.

The preprocessor includes a plurality of preprocessing modules for preprocessing data that varies depending on an application service.

The storage unit is a Hadoop Distributed File System (HDFS).

The collecting unit transmits data corresponding to the data type included in the interworking process message to the preprocessing unit in real time.

When the interworking process execution unit outputs the interworking process message, the collecting unit transfers the collected data to the preprocessing unit in real time, and the preprocessing unit delivers the preprocessed data to the analyzing unit in real time, and the analyzing unit Analyzes the delivered data in real time, and transmits the analysis result to the display unit.

According to another aspect of the present invention, there is provided a method of operating a big data processing system for storing and analyzing input data, the method comprising: setting the system to operate in a linked processing mode; Transmitting data collected by the collecting unit to the preprocessing unit in real time; Preprocessing the data transferred by the preprocessing unit in real time and transferring the preprocessed data to the analyzer; Analyzing data transmitted from the preprocessing unit in real time and generating an analysis result; And displaying the analysis result generated by the analyzing unit in real time.

The step of setting the system to operate in the interlocking processing mode includes transmitting an interworking process message to the collecting unit, the preprocessing unit, the analyzing unit, and the display unit.

The transmitting of the collected data to the preprocessing unit in real time may include transmitting data corresponding to the data type included in the interworking process message to the preprocessing unit in real time.

And storing the analysis result generated by the analysis unit in a storage unit.

With the big data processing system and method of the present invention, for real-time processing of various kinds of big data, data can be collected regardless of the type of received data, stored in association with the Hadoop system, In addition to performing the analysis, the analysis result can be visualized automatically.

In addition, since the data transfer between the functional modules is performed in real time under the control of the streaming interworking adaptation module, it is possible to provide real-time services to applications based on various big data.

1 is a diagram showing an example of a configuration of a big data processing system according to an embodiment of the present invention.
2 is a diagram showing an example of a configuration of a storage unit of a big data processing system according to an embodiment of the present invention.
3 is a flowchart showing a procedure according to a first operation of a big data processing system according to an embodiment of the present invention.
4 is a flowchart showing a procedure according to a second operation of the big data processing system according to the embodiment of the present invention.

BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like numbers refer to like elements throughout.

In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. The following terms are defined in consideration of the functions in the embodiments of the present invention, which may vary depending on the intention of the user, the intention or the custom of the operator. Therefore, the definition should be based on the contents throughout this specification.

Hereinafter, a big data processing system and a processing method according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

FIG. 1 is a diagram illustrating an example of a configuration of a big data processing system according to an embodiment of the present invention. FIG. 2 is a diagram illustrating an example of a configuration of a storage unit of a big data processing system according to an embodiment of the present invention.

Referring to FIG. 1, a big data processing system 100 according to an exemplary embodiment of the present invention collectively stores, analyzes, and analyzes various types of data such as regular, semi-regular, And can be used in a system requiring an analysis result of big data in real time, for example, an environmental disaster early detection system.

Specifically, the system 100 may include a collecting unit 110, a preprocessing unit 120, an analyzing unit 130, a storing unit 140, a display unit 150, and an interlocking process executing unit 160 .

The collecting unit 110 collects various types of data through various paths. For example, the collecting unit 110 collects environmental information sensed by sensors (water quality sensors, air sensors, etc.) sensing the environment, , Social information on the Internet, satellite image information, and the like.

The collection unit 110 collects data using various types of data collection tools. For example, the data collection tool may include a Sqoop, a Flume, a Crawler, a Scribe, .

The collecting unit 110 collects data through various types of interface protocols. For example, the interface protocol may be an FTP (File Transfer Protocol), a Hyper Text Transfer Protocol (HTTP), a Transmission Control Protocol / Internet Protocol).

The data collected by the collecting unit 110 may be stored in the storage unit 140 or transferred to the preprocessing unit 120 for pre-processing if necessary.

Particularly, when the collecting unit 110 receives the interworking process message transmitted from the interworking process executing unit 160, the data corresponding to the data type included in the interworking process message among the collected data is transmitted to the preprocessing unit 120 ).

The preprocessing unit 120 preprocesses the data transmitted from the collecting unit 110 and preprocesses the data collected by the collecting unit 110 to generate preprocessed data.

The preprocessing unit 120 may include a plurality of preprocessing modules for preprocessing data that varies according to an application service.

The processed data generated by the preprocessing unit 120 may be stored in the storage unit 140 or transmitted to the analysis unit 130 for analysis as needed.

In particular, upon receiving the interworking process message transmitted from the interworking process execution unit 160, the preprocessing unit 120 preprocesses the data transmitted from the collecting unit 110 in real time and transmits the preprocessed data to the analysis unit 130. [ And transmits the data to the storage unit 140.

At this time, the pre-processing unit 120 continuously determines the state of the pre-processing unit 120 and the state of the analysis unit 130, and transmits data when the states of the pre-processing unit 120 and the analysis unit 130 are normal .

The analysis unit 130 analyzes the data transmitted from the preprocessing unit 120 or the data provided from the storage unit 140 and generates analysis results.

Here, the analysis unit 130 may analyze the data using appropriate analysis modeling according to the application service. The analysis result generated by the analysis unit 130 may be stored in the storage unit 140 or displayed on the display unit 150).

The analysis unit 130 analyzes the data stored in the storage unit 140 in accordance with a user's command or analyzes the data transmitted from the preprocessing unit 120 in response to an instruction from the interlocking process analysis unit 160 Can be analyzed in real time.

In particular, upon receiving the interworking process message transmitted from the interworking process executing unit 160, the analyzing unit 130 analyzes the data transmitted from the preprocessing unit 120 in real time, (150).

The storage unit 140 is configured to distribute input data and may be, for example, a Hadoop Distributed File System (HDFS).

At this time, the storage unit 140 may store data transmitted from the collecting unit 110, data transmitted from the preprocessing unit 120, and analysis results transmitted from the analysis unit 130.

The Hadoop distributed file system is well known in the art, and the structure of the storage unit 140 is briefly described with reference to FIG. 2 showing an example of the storage unit 140 Explain it.

Referring to FIG. 2, the storage unit 140 may include a client node 141, a name node 142, and a data node 143.

The client node 141 is responsible for input / output of data through the HDFS API. The name node 142 stores metadata related to data to be stored and is responsible for storing data in the data node 143.

The data node 143 serves to provide data requested from the name node 142 or to store data provided from the name node 142. At this time, a plurality of data nodes 143 are interlocked with each other, and the data node 143 manages data on a block basis.

The display unit 150 displays an analysis result provided from the analysis unit 130 so that it can be confirmed from the outside for the purpose of providing information to the user.

In particular, upon receiving the interworking process message from the interworking process executing unit 160, the display unit 150 displays the analysis result provided from the analyzing unit 130 in real time.

The interlocking process execution unit 160 performs an interlock process between the collecting unit 110, the preprocessing unit 120, the analysis unit 130, and the display unit 150 according to an external request The preprocessor 120, the analysis unit 130, and the display unit 150 so that the data interlocking of the data processing unit 130 can be performed. At this time, the interworking message includes a data type.

The configuration of the big data processing system according to the embodiment of the present invention and the functions of the respective configurations have been described above. Hereinafter, an operation of a big data processing system according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

3 is a flowchart showing a procedure according to a first operation of a big data processing system according to an embodiment of the present invention.

3 shows a process according to an operation of collectively processing data by a big data processing system. The interworking process execution unit 160 transmits a interworking process message to the collecting unit 110, the preprocessing unit 120, The analysis unit 130, and the display unit 150, the system 100 operates in the normal mode.

First, the collecting unit 110 collects data (S300). The data collected by the collecting unit 110 according to the step S300 is preprocessed by the preprocessing unit 120 (S310). When the preprocessing is completed, The preprocessing unit 120 stores the preprocessed data in the storage unit 140 (S320).

After step S320, the analysis unit 130 receives the data stored in the storage unit 140 and performs a batch analysis on the received data (S330).

The analyzing unit 130 provides the analysis result to the display unit 150 and the display unit 150 displays the analysis result provided from the analyzing unit 130 (S340).

4 is a flowchart showing a procedure according to a second operation of the big data processing system according to the embodiment of the present invention.

4 shows a process according to an operation in which the big data processing system processes data in real time. The linked process execution unit 160 includes a collecting unit 110, a preprocessing unit 120, And transmits the interlocking process message to the display unit 130 and the display unit 150 so that the big data processing system 100 enters the interlocking process mode (S400).

That is, the big data processing system 100 is set to operate in the interlocking processing mode according to step S400.

If the system 100 operates in the interlocking processing mode according to the step S400, the collecting unit 110 collects data (S410), and transmits data corresponding to the data type included in the interlocking processing message To the preprocessing unit 120 (S420).

Thereafter, the preprocessing unit 120 preprocesses the data transmitted from the collecting unit 110 according to the step S420 in real time (S430).

When the preprocessing according to step S430 is completed, the preprocessing unit 120 transmits the preprocessed data to the analyzer 130 in real time in step S440. The analyzer 130 analyzes the data transmitted from the preprocessor 120 (S450).

When the real-time analysis is completed according to the step S450, the analysis unit 130 provides the analysis result to the display unit 150 in real time (S460). The analysis unit 130 provides the analysis result to the storage unit 140 and stores the analysis result in the storage unit 140.

If the analysis result is provided according to the step S460, the display unit 150 displays the analysis result in real time (S470).

While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, but, on the contrary, Various modifications, alterations, and alterations can be made within the scope of the present invention.

Therefore, the embodiments described in the present invention and the accompanying drawings are intended to illustrate rather than limit the technical spirit of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments and accompanying drawings . The scope of protection of the present invention should be construed according to the claims, and all technical ideas within the scope of equivalents should be interpreted as being included in the scope of the present invention.

100: Big data processing system
110: collecting section
120:
130:
140:
150:
160: Interlocking process execution unit

Claims

A collecting unit for collecting data through various paths;
A preprocessing unit for performing preprocessing on data transmitted from the collecting unit;
A storage unit for dispersively storing input data;
An analysis unit for analyzing data transmitted from the preprocessor or data stored in the storage unit and generating an analysis result;
A display unit for receiving and displaying the analysis result; And
And an interworking process executing unit for outputting an interworking process message so that the collecting unit, the preprocessing unit, the analyzing unit, and the display unit operate in real time and process data,
Big data processing system.

The method according to claim 1,
The collecting unit collects the data of the orthopedic, semi-orthopedic, and irregular data using the data collection tool
Big data processing system.

The method according to claim 1,
The preprocessor may include a plurality of preprocessing modules for preprocessing data that varies depending on an application service
Big data processing system.

The method according to claim 1,
The storage unit may be a Hadoop Distributed File System (HDFS)
Big data processing system.

The method according to claim 1,
The collecting unit transmits data corresponding to the data type included in the interworking process message to the preprocessing unit in real time
Big data processing system.

The method according to claim 1,
When the interworking process execution unit outputs the interworking process message, the collecting unit transfers the collected data to the preprocessing unit in real time, and the preprocessing unit delivers the preprocessed data to the analyzing unit in real time, and the analyzing unit Analyzing the transmitted data in real time and transmitting the analysis result to the display unit
Big data processing system.

A method of operating a big data processing system for storing and analyzing input data,
Setting the system to operate in an interworking mode;
Transmitting data collected by the collecting unit to the preprocessing unit in real time;
Preprocessing the data transferred by the preprocessing unit in real time and transferring the preprocessed data to the analyzer;
Analyzing data transmitted from the preprocessing unit in real time and generating an analysis result; And
And displaying the analysis result generated by the analysis unit in real time on the display unit
A method of operating a big data processing system.

8. The method of claim 7,
The step of setting the system to operate in the interlocking processing mode includes transmitting an interworking process message to the collecting unit, the preprocessing unit, the analyzing unit, and the display unit,
A method of operating a large data processing system.

9. The method of claim 8,
Wherein the collecting unit transmits the collected data to the preprocessing unit in real time includes transmitting data corresponding to the data type included in the interworking process message among the collected data to the preprocessing unit in real time
A method of operating a large data processing system.

8. The method of claim 7,
And storing the analysis result generated by the analysis unit in a storage unit
A method of operating a big data processing system.