KR20140125312A

KR20140125312A - Method for processing big data for building data-centric computing environment and apparatus for performing the method

Info

Publication number: KR20140125312A
Application number: KR1020140045604A
Authority: KR
Inventors: 한승철; 김용태; 박영환
Original assignee: 주식회사 엔피코어
Priority date: 2013-04-17
Filing date: 2014-04-16
Publication date: 2014-10-28

Abstract

Disclosed are a method and a device for processing big data. The method for processing big data comprises the steps of: collecting at least one data having different formats or sizes; converting the format for the collected data, into a predetermined specific format to give unity to the data; filtering the duplicate data by comparing checksums of the data converted into the specific data format; and classifying and storing the data based on the size of the data and the number of times the data are filtered as the duplicate data, thereby efficiently storing the massive data while improving the speed of collecting and processing the data.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a large data processing method for constructing a data-centric computing environment,

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a big data processing technique, and more particularly, to a big data processing method using a multi-core processor to construct a data-centric computing environment and an apparatus for performing the same.

Recently, as the use of information processing terminals such as computers and laptops has been expanded and the use of mobile communication terminals such as smart phones and tablet PCs has become common, social network services (SNS) using big data, smart grids data-centric computing applications such as smart grid, intelligent appliances, real-time streaming or real-time decision making are rapidly increasing.

In general, Big-data has a volume of several hundreds of petabytes to several tens of exabytes or more, is composed of various complex types, and is generated in real time. Generation technologies and architectures designed to extract value from a wide variety of large-scale data at low cost and support high-speed acquisition, discovery and analysis of data.

Accordingly, big data processing techniques for building a data-centric computing environment utilizing big data are actively researched.

Conventional big data processing technology attempts to quickly process big data by distributing big data using a plurality of hardware platforms.

However, the conventional large data processing technique described above has a problem that a bottleneck phenomenon may occur due to the limitation of input / output speed for collecting big data. In addition, there is a problem that hardware and software are expensive to purchase because the processor and storage are continuously extended and licenses must be purchased to construct a distributed environment.

As described above, hardware techniques for expanding hardware in order to process big data have been mainly used. However, if a software technique capable of efficiently distributing big data to a plurality of storage devices is applied, it is possible to efficiently process big data .

An object of the present invention to solve the above problems is to provide a big data processing method capable of improving data collection speed and processing performance by minimizing bottlenecks in data collection using a multicore processor have.

Another object of the present invention is to provide a big data processing apparatus capable of efficiently storing a large amount of data by classifying a storage device in which data is stored according to the access frequency and size of the data.

According to an aspect of the present invention, there is provided a method for processing large data in a data server for establishing a data-centric computing environment, the method comprising the steps of: Converting the format of at least one piece of data collected to be coherent to at least one piece of data into a predetermined specific data format, inserting a check sum of at least one piece of data converted into a specific data format checksum values to filter the redundant data, and sorting and storing at least one data based on the number of times the at least one data is filtered with the redundant data and the size of the at least one data.

Here, the data server may include a multi-core processor, a plurality of solid-state drive (SSD) storage devices, and a plurality of hard disk drives (HDD).

Here, collecting at least one data may minimize the bottleneck that may occur when collecting at least one data by using a multicore processor mounted on the data server.

In the filtering of the redundant data, a checksum value is calculated by applying a checksum algorithm to each of at least one data converted into a specific data format, and the calculated checksums are compared with each other, Can be discriminated as duplicate data and filtered.

Here, the step of classifying and storing at least one data may include calculating a number of times that at least one piece of data is discriminated and filtered as redundant data, sorting at least one piece of data based on the calculated number of times of filtering, , It can be determined that the frequency of access to at least one data is high and stored in a plurality of SSD storage devices.

Here, the step of classifying and storing at least one data may be performed by arranging at least one piece of the filtered data on a size basis and storing the divided data in a plurality of hard disks when the size of the at least one piece of data is large .

According to another aspect of the present invention, there is provided a large data processing apparatus, which is implemented in a data server for establishing a data-centric computing environment, includes at least a format A format converter for converting a format of at least one piece of data collected to give uniformity to at least one piece of data into a predetermined specific data format; At least one data is classified based on a number of times that at least one piece of data is filtered with redundant data and a size of at least one piece of data, And a data storage unit for storing the data.

According to the method and apparatus for processing big data according to the embodiment of the present invention, the bottleneck occurring in data collection using the multicore processor can be minimized, thereby improving the data acquisition speed and processing performance.

Also, it is possible to efficiently store a large amount of data by classifying a storage device in which data is stored according to the access frequency and size of the data.

1 is an exemplary diagram illustrating a data-centric computing environment in which a method and apparatus for processing big data according to an embodiment of the present invention operates.
2 is a flowchart illustrating a method of processing big data according to an embodiment of the present invention.
3 is an exemplary diagram illustrating a specific data format for giving unity to data according to an embodiment of the present invention.
FIG. 4 is an exemplary diagram illustrating classification and storage of data according to an embodiment of the present invention.
5 is a block diagram showing a big data processing apparatus according to an embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

1 is an exemplary diagram illustrating a data-centric computing environment in which a method and apparatus for processing big data according to an embodiment of the present invention operates.

Referring to FIG. 1, a data-centric computing environment can be established by connecting a data server 10 and a plurality of user terminals 20.

A data-centric computing environment is a system in which a plurality of user terminals 20 utilize data generated in real time to generate a social network service (SNS), a smart grid, an intelligent home appliance, a real- The next generation technology based on big data processing that can provide various application programs of.

Accordingly, the big data processing method and apparatus are implemented in the data server 10 connected to a plurality of user terminals 20, collecting data generated by a plurality of user terminals 20, processing and storing the data, An environment in which a computing application of the present invention can be performed.

Here, the user terminal 20 may be a computer having a communication device connected to the data server 10 and having an information processing function for generating data according to the operation of the user terminal 20, an information processing terminal , A mobile communication terminal such as a smart phone, a tablet PC, a PDA (Personal Digital Assistant), a smart home appliance, a black box, or a vehicle in which navigation is operated, a train, an airplane.

The data server 10 is connected to a plurality of user terminals 20 through wired and wireless networks such as local area wireless communication, Wi-Fi, 3G (3Generation) and LTE (Long Term Evolution) , A cloud server or a web server that processes a plurality of processors and storage and processes the collected data, but the present invention is not limited thereto.

Conventionally, large data is rapidly processed by distributing large data generated by a plurality of user terminals 20 by using a plurality of hardware platforms. However, the bottleneck due to a difference in speed between a processor and an input / output There has been a problem that the collection speed of the big data and the processing performance are deteriorated. In addition, as the amount of data collected increases, economic efficiency is reduced in that the processors and storage must be expanded or licenses must be purchased to maintain the distributed processing environment.

Thus, the present invention can be applied to a hardware technique for mounting a plurality of processors and storages to distribute large data, a software technique for classifying and processing data collected from a plurality of user terminals 20 according to frequency and size of access And a big data processing method and apparatus capable of processing big data more efficiently.

FIG. 2 is a flowchart illustrating a method of processing big data according to an embodiment of the present invention. FIG. 3 is a diagram illustrating a specific data format for giving unity to data according to an embodiment of the present invention.

4 is an exemplary diagram for explaining classification and storage of data according to an embodiment of the present invention.

Referring to FIG. 2, the big data processing method includes collecting at least one piece of data having a different format or size (S100), converting the collected at least one piece of data into a predetermined specific data format (S200) Comparing the sum values to filter duplicate data (S300), and sorting and storing the data (S400).

The big data processing method can be performed by the data server 10 that establishes a data-centric computing environment.

The data server 10 may include a multi-core processor, a plurality of solid-state disks (SSD), and a plurality of hard disk drives (HDDs) to collect and store at least one data. Can be mounted.

In particular, the data server 10 has a plurality of processors arranged in the form of 6 by 6, and one board in which a plurality of PCI (Peripheral Component Interconnect) expresses are arranged so as to be connected to a plurality of input / It is expected that the data acquisition capability of 200Gbps can be realized.

At least one piece of data having a different format or size may be collected (S100).

At this time, as described above, since the data server 10 is capable of distributing and processing at least one data collected from a plurality of user terminals 20 by mounting 360 processor cores, in the conventional big data processing technology, And the bottleneck caused by the speed difference between the input and output devices can be minimized.

Here, the reason why the collected data is different in format or size is that the type of the user terminal 20 that generates the data differs from the type of the operating system used in the user terminal.

Accordingly, the format of at least one data may be converted into a predetermined specific data format so that uniformity of data collected in different formats in a plurality of user terminals 20 is obtained (S200).

For example, as shown in FIG. 3, the specific data format 30 includes a start code for notifying the start address of the data, a data size indicating the size of the data, A record type indicating a record type constituting the data, and a checksum value calculated through the content of the data. However, the present invention is not limited to this, Various formats capable of giving uniformity so that data of different formats generated in the mobile terminal 20 have the same data format can be used.

In this manner, the redundant data can be filtered by comparing the checksum value of at least one data converted into the specific data format 30 (S300).

Here, the inspection sum value for at least one data can be calculated by using the inspection sum value calculated in the process of converting into the specific data format 30 or by applying the inspection sum algorithm to each of at least one data.

And the data having the same checksum value can be discriminated to be duplicated data and filtered. However, as data is converted into hexadecimal data and then added to the data, the contents of the data are different but can be recognized as the same data because the data values converted into hexadecimal are the same. Therefore, it is possible to discriminate the duplicate data by referring to the size of the data or the record type in addition to the checksum value, but the present invention is not limited thereto.

If the redundant data is filtered through the above process, at least one data may be sorted and stored based on the number of times at least one data is filtered by the redundant data and the size of at least one data in operation S400.

More specifically, as shown in FIG. 4A, the number of times that at least one data is filtered to be discriminated as duplicate data may be calculated and mapped to the corresponding data identification ID. Thus, when the filtering times are sorted in descending order, it can be seen that Data_1 is filtered the largest number of times in 82852, and Data_5 is filtered the least in two times.

At this time, the fact that the number of times of filtering is large is data frequently generated by a plurality of user terminals 20, and it can be determined that the access frequency is high or the importance of data is high.

Accordingly, the contents of the data identification ID Data_1 can be stored in the SSD, which is an expensive storage device, and can be processed at a high speed. In this case, a solid state drive (SSD) may mean a semiconductor memory capable of processing data at a high speed, such as a NAND flash or a DRAM, but is not limited thereto.

Also, after the duplicated data is filtered as shown in FIG. 4B, the data may be sorted in descending order based on the size, and may be mapped to the corresponding data identification ID. Thus, Data_4 has the largest capacity of 5632MB and Data_3 has the smallest capacity of 1MB.

Therefore, the contents of Data_4 having the largest capacity can be stored in the hard disk as a low-cost storage device.

In this way, data can be classified and stored in an expensive SSD and an inexpensive HDD according to the access frequency and size, thereby making it possible to effectively utilize the storage space and to construct a hybrid storage capable of quickly processing data.

5 is a block diagram showing a big data processing apparatus according to an embodiment of the present invention.

5, the big data processing apparatus 100 may include a data collecting unit 110, a data converting unit 120, a deduplication unit 130, and a data storage unit 140.

The big data processing apparatus 100 can be performed by the data server 10 that establishes a data-centric computing environment.

Here, the data server 10 includes a multi-core processor, a plurality of solid state disks (SSD), and a plurality of hard disk drives (HDDs) to collect and store at least one data Lt; / RTI >

In particular, the data server 10 has a plurality of processors arranged in the form of 6 by 6, and one board in which a plurality of PCI (Peripheral Component Interconnect) expresses are arranged so as to be connected to a plurality of input / It is expected that the data collection ability of 200 Gbps level can be realized.

The data collection unit 110 may collect at least one data having a different format or size. Here, the reason why the collected data is different in format or size is that the type of the user terminal 20 that generates the data differs from the type of the operating system used in the user terminal.

Particularly, in collecting data in the data collecting unit 110, as described above, the data server 10 may be capable of distributing and processing at least one data collected from a plurality of user terminals 20 by mounting 360 processor cores , It is possible to minimize the bottleneck caused by the speed difference between the processor and the input / output device in the conventional big data processing technology.

The data conversion unit 120 may convert the format of the at least one data into a predetermined data format so that uniformity of data collected in different formats in a plurality of user terminals 20 is given.

For example, the specific data format may include a start code for indicating the start address of the data, a data size indicating the size of the data, a data identification ID indicating which user terminal 20 has generated the data, A record type indicating a record type, and a checksum value calculated through data content. However, the present invention is not limited to this, and data of different formats Various formats capable of giving uniformity so as to have the same data format can be used.

The redundancy removal unit 130 may filter redundant data by comparing checksum values of at least one data converted into a specific data format.

Here, the redundancy removal unit 130 may include a checksum calculation module 131 and a data filtering module 133.

The checksum calculation module 131 may calculate a checksum value by applying a checksum algorithm to each of at least one data converted into a specific data format.

At this time, the inspection sum value for at least one data can be calculated by using the inspection sum value calculated in the process of converting into the specific data format or by applying the inspection sum algorithm to each of at least one data.

The data filtering module 133 may compare at least one piece of data having the same inspection sum value as the redundant data by comparing the calculated inspection sum values.

However, as data is converted into hexadecimal data and then added to the data, the contents of the data are different but can be recognized as the same data because the data values converted into hexadecimal are the same. Therefore, it is possible to discriminate the duplicate data by referring to the size of the data or the record type in addition to the checksum value, but the present invention is not limited thereto.

When the redundant data is filtered, the data storage unit 140 may classify and store at least one data based on the number of times at least one data is filtered by the redundant data and the size of the at least one data.

More specifically, it is possible to calculate the number of times at least one data is discriminated and filtered as the redundant data, and sort at least one data based on the calculated number of filtering. Here, the fact that the number of times of filtering is large is the data frequently generated by a plurality of user terminals 20, and it can be determined that the access frequency is high or the importance of data is high.

Accordingly, it is possible to store data having a large number of times of filtering in an SSD, which is an expensive storage device, so that the data can be processed at a high speed. In this case, a solid state drive (SSD) may mean a semiconductor memory capable of processing data at a high speed, such as a NAND flash or a DRAM, but is not limited thereto.

Also, after redundant data is filtered, the data can be sorted in descending order by size. Thus, data having the largest capacity can be stored in a hard disk, which is a low-cost storage device.

By categorizing and storing the data in the SSD and HDD of the Gen3 PCIe interface according to the access frequency and size, it is possible to effectively utilize the space of the storage and to construct a hybrid storage capable of quickly processing the data.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

10: data server 20: user terminal
30: Specific data format 100: Big data processing device
110: Data collecting unit 120: Data converting unit
130: duplicate removal 131: checksum calculation module
133: Data filtering module 140: Data storage unit

Claims

A method for processing large data in a data server for establishing a data-centric computing environment,
Collecting at least one piece of data that is different in format or size from each other;
Converting a format of the collected at least one data to a predetermined specific data format so as to give uniformity to the at least one data;
Filtering redundant data by comparing checksum values of at least one data converted into the specific data format; And
And sorting and storing the at least one data based on the number of times the at least one data is filtered into the redundant data and the size of the at least one data.

The method according to claim 1,
The data server comprising:
A plurality of SSD (Solid State Drive) storage devices, and a plurality of hard disk drives (HDDs) are mounted on the hard disk drive.

The method of claim 2,
Wherein collecting the at least one data comprises:
Wherein the bottleneck phenomenon that may occur when collecting the at least one data is minimized by using a multicore processor mounted on the data server.

The method of claim 2,
Wherein filtering the redundant data comprises:
A checksum value is calculated by applying a checksum algorithm to each of the at least one data converted into the specific data format, and the computed checksum value is compared to determine the at least one data having the same checksum value Is judged as redundant data and filtered.

The method of claim 4,
Wherein the classifying and storing the at least one data comprises:
The method comprising: calculating a number of times that the at least one data is determined to be redundant data and filtered; sorting the at least one data based on the calculated number of times of filtering to determine, when the number of times of filtering is high, And storing the data in the plurality of SSD storage devices.

The method of claim 4,
Wherein the classifying and storing the at least one data comprises:
And arranging the at least one filtered data of the redundant data on a size basis to store the redundant data in the plurality of hard disks when the size of the at least one data is large.

In a data server for establishing a data-centric computing environment,
A data collecting unit for collecting at least one data having a different format or size from each other;
A format converter for converting a format of the collected at least one data into a predetermined specific data format so as to give uniformity to the at least one data;
A duplicate removal unit for filtering duplicate data by comparing checksum values of at least one piece of data converted into the specific data format; And
And a data storage unit for classifying and storing the at least one data based on a number of times that the at least one data is filtered with redundant data and a size of the at least one data.

The method of claim 7,
The data server comprising:
A plurality of SSD (Solid State Drive) storage devices, and a plurality of hard disk drives (HDDs).

The method of claim 8,
Wherein the data collecting unit comprises:
Wherein a bottleneck phenomenon that may occur when collecting the at least one data is minimized by using a multicore processor mounted on the data server.

The method of claim 8,
Wherein the de-
A checksum calculation module for calculating a checksum value by applying a checksum algorithm to each of the at least one data converted into the specific data format; And
And a data filtering module for comparing the calculated checksum value and filtering the at least one data having the same checksum value as duplicate data and filtering the at least one data.

The method of claim 10,
The data storage unit stores,
The method comprising: calculating a number of times that the at least one data is determined to be redundant data and filtered; sorting the at least one data based on the calculated number of times of filtering to determine, when the number of times of filtering is high, And stores the data in the plurality of SSD storage devices.

The method of claim 10,
The data storage unit stores,
And arranges at least one piece of the filtered data on a size basis to store the duplicated data in the plurality of hard disks when the size of the at least one piece of data is large.