KR20140125312A - Method for processing big data for building data-centric computing environment and apparatus for performing the method - Google Patents
Method for processing big data for building data-centric computing environment and apparatus for performing the method Download PDFInfo
- Publication number
- KR20140125312A KR20140125312A KR1020140045604A KR20140045604A KR20140125312A KR 20140125312 A KR20140125312 A KR 20140125312A KR 1020140045604 A KR1020140045604 A KR 1020140045604A KR 20140045604 A KR20140045604 A KR 20140045604A KR 20140125312 A KR20140125312 A KR 20140125312A
- Authority
- KR
- South Korea
- Prior art keywords
- data
- format
- filtering
- size
- filtered
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1004—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's to protect a block of data words, e.g. CRC or checksum
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Software Systems (AREA)
- Computer Security & Cryptography (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
BACKGROUND OF THE
Recently, as the use of information processing terminals such as computers and laptops has been expanded and the use of mobile communication terminals such as smart phones and tablet PCs has become common, social network services (SNS) using big data, smart grids data-centric computing applications such as smart grid, intelligent appliances, real-time streaming or real-time decision making are rapidly increasing.
In general, Big-data has a volume of several hundreds of petabytes to several tens of exabytes or more, is composed of various complex types, and is generated in real time. Generation technologies and architectures designed to extract value from a wide variety of large-scale data at low cost and support high-speed acquisition, discovery and analysis of data.
Accordingly, big data processing techniques for building a data-centric computing environment utilizing big data are actively researched.
Conventional big data processing technology attempts to quickly process big data by distributing big data using a plurality of hardware platforms.
However, the conventional large data processing technique described above has a problem that a bottleneck phenomenon may occur due to the limitation of input / output speed for collecting big data. In addition, there is a problem that hardware and software are expensive to purchase because the processor and storage are continuously extended and licenses must be purchased to construct a distributed environment.
As described above, hardware techniques for expanding hardware in order to process big data have been mainly used. However, if a software technique capable of efficiently distributing big data to a plurality of storage devices is applied, it is possible to efficiently process big data .
An object of the present invention to solve the above problems is to provide a big data processing method capable of improving data collection speed and processing performance by minimizing bottlenecks in data collection using a multicore processor have.
Another object of the present invention is to provide a big data processing apparatus capable of efficiently storing a large amount of data by classifying a storage device in which data is stored according to the access frequency and size of the data.
According to an aspect of the present invention, there is provided a method for processing large data in a data server for establishing a data-centric computing environment, the method comprising the steps of: Converting the format of at least one piece of data collected to be coherent to at least one piece of data into a predetermined specific data format, inserting a check sum of at least one piece of data converted into a specific data format checksum values to filter the redundant data, and sorting and storing at least one data based on the number of times the at least one data is filtered with the redundant data and the size of the at least one data.
Here, the data server may include a multi-core processor, a plurality of solid-state drive (SSD) storage devices, and a plurality of hard disk drives (HDD).
Here, collecting at least one data may minimize the bottleneck that may occur when collecting at least one data by using a multicore processor mounted on the data server.
In the filtering of the redundant data, a checksum value is calculated by applying a checksum algorithm to each of at least one data converted into a specific data format, and the calculated checksums are compared with each other, Can be discriminated as duplicate data and filtered.
Here, the step of classifying and storing at least one data may include calculating a number of times that at least one piece of data is discriminated and filtered as redundant data, sorting at least one piece of data based on the calculated number of times of filtering, , It can be determined that the frequency of access to at least one data is high and stored in a plurality of SSD storage devices.
Here, the step of classifying and storing at least one data may be performed by arranging at least one piece of the filtered data on a size basis and storing the divided data in a plurality of hard disks when the size of the at least one piece of data is large .
According to another aspect of the present invention, there is provided a large data processing apparatus, which is implemented in a data server for establishing a data-centric computing environment, includes at least a format A format converter for converting a format of at least one piece of data collected to give uniformity to at least one piece of data into a predetermined specific data format; At least one data is classified based on a number of times that at least one piece of data is filtered with redundant data and a size of at least one piece of data, And a data storage unit for storing the data.
According to the method and apparatus for processing big data according to the embodiment of the present invention, the bottleneck occurring in data collection using the multicore processor can be minimized, thereby improving the data acquisition speed and processing performance.
Also, it is possible to efficiently store a large amount of data by classifying a storage device in which data is stored according to the access frequency and size of the data.
1 is an exemplary diagram illustrating a data-centric computing environment in which a method and apparatus for processing big data according to an embodiment of the present invention operates.
2 is a flowchart illustrating a method of processing big data according to an embodiment of the present invention.
3 is an exemplary diagram illustrating a specific data format for giving unity to data according to an embodiment of the present invention.
FIG. 4 is an exemplary diagram illustrating classification and storage of data according to an embodiment of the present invention.
5 is a block diagram showing a big data processing apparatus according to an embodiment of the present invention.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.
The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.
It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.
The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.
Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.
Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.
1 is an exemplary diagram illustrating a data-centric computing environment in which a method and apparatus for processing big data according to an embodiment of the present invention operates.
Referring to FIG. 1, a data-centric computing environment can be established by connecting a
A data-centric computing environment is a system in which a plurality of
Accordingly, the big data processing method and apparatus are implemented in the
Here, the
The
Conventionally, large data is rapidly processed by distributing large data generated by a plurality of
Thus, the present invention can be applied to a hardware technique for mounting a plurality of processors and storages to distribute large data, a software technique for classifying and processing data collected from a plurality of
FIG. 2 is a flowchart illustrating a method of processing big data according to an embodiment of the present invention. FIG. 3 is a diagram illustrating a specific data format for giving unity to data according to an embodiment of the present invention.
4 is an exemplary diagram for explaining classification and storage of data according to an embodiment of the present invention.
Referring to FIG. 2, the big data processing method includes collecting at least one piece of data having a different format or size (S100), converting the collected at least one piece of data into a predetermined specific data format (S200) Comparing the sum values to filter duplicate data (S300), and sorting and storing the data (S400).
The big data processing method can be performed by the
The
In particular, the
At least one piece of data having a different format or size may be collected (S100).
At this time, as described above, since the
Here, the reason why the collected data is different in format or size is that the type of the
Accordingly, the format of at least one data may be converted into a predetermined specific data format so that uniformity of data collected in different formats in a plurality of
For example, as shown in FIG. 3, the specific data format 30 includes a start code for notifying the start address of the data, a data size indicating the size of the data, A record type indicating a record type constituting the data, and a checksum value calculated through the content of the data. However, the present invention is not limited to this, Various formats capable of giving uniformity so that data of different formats generated in the
In this manner, the redundant data can be filtered by comparing the checksum value of at least one data converted into the specific data format 30 (S300).
Here, the inspection sum value for at least one data can be calculated by using the inspection sum value calculated in the process of converting into the specific data format 30 or by applying the inspection sum algorithm to each of at least one data.
And the data having the same checksum value can be discriminated to be duplicated data and filtered. However, as data is converted into hexadecimal data and then added to the data, the contents of the data are different but can be recognized as the same data because the data values converted into hexadecimal are the same. Therefore, it is possible to discriminate the duplicate data by referring to the size of the data or the record type in addition to the checksum value, but the present invention is not limited thereto.
If the redundant data is filtered through the above process, at least one data may be sorted and stored based on the number of times at least one data is filtered by the redundant data and the size of at least one data in operation S400.
More specifically, as shown in FIG. 4A, the number of times that at least one data is filtered to be discriminated as duplicate data may be calculated and mapped to the corresponding data identification ID. Thus, when the filtering times are sorted in descending order, it can be seen that Data_1 is filtered the largest number of times in 82852, and Data_5 is filtered the least in two times.
At this time, the fact that the number of times of filtering is large is data frequently generated by a plurality of
Accordingly, the contents of the data identification ID Data_1 can be stored in the SSD, which is an expensive storage device, and can be processed at a high speed. In this case, a solid state drive (SSD) may mean a semiconductor memory capable of processing data at a high speed, such as a NAND flash or a DRAM, but is not limited thereto.
Also, after the duplicated data is filtered as shown in FIG. 4B, the data may be sorted in descending order based on the size, and may be mapped to the corresponding data identification ID. Thus, Data_4 has the largest capacity of 5632MB and Data_3 has the smallest capacity of 1MB.
Therefore, the contents of Data_4 having the largest capacity can be stored in the hard disk as a low-cost storage device.
In this way, data can be classified and stored in an expensive SSD and an inexpensive HDD according to the access frequency and size, thereby making it possible to effectively utilize the storage space and to construct a hybrid storage capable of quickly processing data.
5 is a block diagram showing a big data processing apparatus according to an embodiment of the present invention.
5, the big
The big
Here, the
In particular, the
The
Particularly, in collecting data in the
The
For example, the specific data format may include a start code for indicating the start address of the data, a data size indicating the size of the data, a data identification ID indicating which
The
Here, the
The
At this time, the inspection sum value for at least one data can be calculated by using the inspection sum value calculated in the process of converting into the specific data format or by applying the inspection sum algorithm to each of at least one data.
The
However, as data is converted into hexadecimal data and then added to the data, the contents of the data are different but can be recognized as the same data because the data values converted into hexadecimal are the same. Therefore, it is possible to discriminate the duplicate data by referring to the size of the data or the record type in addition to the checksum value, but the present invention is not limited thereto.
When the redundant data is filtered, the
More specifically, it is possible to calculate the number of times at least one data is discriminated and filtered as the redundant data, and sort at least one data based on the calculated number of filtering. Here, the fact that the number of times of filtering is large is the data frequently generated by a plurality of
Accordingly, it is possible to store data having a large number of times of filtering in an SSD, which is an expensive storage device, so that the data can be processed at a high speed. In this case, a solid state drive (SSD) may mean a semiconductor memory capable of processing data at a high speed, such as a NAND flash or a DRAM, but is not limited thereto.
Also, after redundant data is filtered, the data can be sorted in descending order by size. Thus, data having the largest capacity can be stored in a hard disk, which is a low-cost storage device.
By categorizing and storing the data in the SSD and HDD of the Gen3 PCIe interface according to the access frequency and size, it is possible to effectively utilize the space of the storage and to construct a hybrid storage capable of quickly processing the data.
According to the method and apparatus for processing big data according to the embodiment of the present invention, the bottleneck occurring in data collection using the multicore processor can be minimized, thereby improving the data acquisition speed and processing performance.
Also, it is possible to efficiently store a large amount of data by classifying a storage device in which data is stored according to the access frequency and size of the data.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that
10: data server 20: user terminal
30: Specific data format 100: Big data processing device
110: Data collecting unit 120: Data converting unit
130: duplicate removal 131: checksum calculation module
133: Data filtering module 140: Data storage unit
Claims (12)
Collecting at least one piece of data that is different in format or size from each other;
Converting a format of the collected at least one data to a predetermined specific data format so as to give uniformity to the at least one data;
Filtering redundant data by comparing checksum values of at least one data converted into the specific data format; And
And sorting and storing the at least one data based on the number of times the at least one data is filtered into the redundant data and the size of the at least one data.
The data server comprising:
A plurality of SSD (Solid State Drive) storage devices, and a plurality of hard disk drives (HDDs) are mounted on the hard disk drive.
Wherein collecting the at least one data comprises:
Wherein the bottleneck phenomenon that may occur when collecting the at least one data is minimized by using a multicore processor mounted on the data server.
Wherein filtering the redundant data comprises:
A checksum value is calculated by applying a checksum algorithm to each of the at least one data converted into the specific data format, and the computed checksum value is compared to determine the at least one data having the same checksum value Is judged as redundant data and filtered.
Wherein the classifying and storing the at least one data comprises:
The method comprising: calculating a number of times that the at least one data is determined to be redundant data and filtered; sorting the at least one data based on the calculated number of times of filtering to determine, when the number of times of filtering is high, And storing the data in the plurality of SSD storage devices.
Wherein the classifying and storing the at least one data comprises:
And arranging the at least one filtered data of the redundant data on a size basis to store the redundant data in the plurality of hard disks when the size of the at least one data is large.
A data collecting unit for collecting at least one data having a different format or size from each other;
A format converter for converting a format of the collected at least one data into a predetermined specific data format so as to give uniformity to the at least one data;
A duplicate removal unit for filtering duplicate data by comparing checksum values of at least one piece of data converted into the specific data format; And
And a data storage unit for classifying and storing the at least one data based on a number of times that the at least one data is filtered with redundant data and a size of the at least one data.
The data server comprising:
A plurality of SSD (Solid State Drive) storage devices, and a plurality of hard disk drives (HDDs).
Wherein the data collecting unit comprises:
Wherein a bottleneck phenomenon that may occur when collecting the at least one data is minimized by using a multicore processor mounted on the data server.
Wherein the de-
A checksum calculation module for calculating a checksum value by applying a checksum algorithm to each of the at least one data converted into the specific data format; And
And a data filtering module for comparing the calculated checksum value and filtering the at least one data having the same checksum value as duplicate data and filtering the at least one data.
The data storage unit stores,
The method comprising: calculating a number of times that the at least one data is determined to be redundant data and filtered; sorting the at least one data based on the calculated number of times of filtering to determine, when the number of times of filtering is high, And stores the data in the plurality of SSD storage devices.
The data storage unit stores,
And arranges at least one piece of the filtered data on a size basis to store the duplicated data in the plurality of hard disks when the size of the at least one piece of data is large.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020130041982 | 2013-04-17 | ||
KR20130041982 | 2013-04-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20140125312A true KR20140125312A (en) | 2014-10-28 |
Family
ID=51995190
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020140045604A KR20140125312A (en) | 2013-04-17 | 2014-04-16 | Method for processing big data for building data-centric computing environment and apparatus for performing the method |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20140125312A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101678480B1 (en) | 2015-07-08 | 2016-11-22 | 주식회사 유비콤테크놀로지 | Field programmable gate array system and method for processing big data based on r language |
KR102411912B1 (en) * | 2021-11-17 | 2022-06-22 | (주)인에이블 | Apparatus for Processing Vehicle Blackbox Video and Driving Method Thereof |
-
2014
- 2014-04-16 KR KR1020140045604A patent/KR20140125312A/en not_active Application Discontinuation
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR101678480B1 (en) | 2015-07-08 | 2016-11-22 | 주식회사 유비콤테크놀로지 | Field programmable gate array system and method for processing big data based on r language |
KR102411912B1 (en) * | 2021-11-17 | 2022-06-22 | (주)인에이블 | Apparatus for Processing Vehicle Blackbox Video and Driving Method Thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI743458B (en) | Method, device and system for parallel execution of blockchain transactions | |
CN106708016B (en) | fault monitoring method and device | |
CN102694868B (en) | A kind of group system realizes and task dynamic allocation method | |
CN111966649B (en) | Lightweight online file storage method and device capable of efficiently removing weight | |
US8898422B2 (en) | Workload-aware distributed data processing apparatus and method for processing large data based on hardware acceleration | |
US11025271B2 (en) | Compression of high dynamic ratio fields for machine learning | |
WO2022088632A1 (en) | User data monitoring and analysis method, apparatus, device, and medium | |
US11836067B2 (en) | Hyper-converged infrastructure (HCI) log system | |
CN113485792B (en) | Pod scheduling method in kubernetes cluster, terminal equipment and storage medium | |
CN111241177A (en) | Data acquisition method, system and network equipment | |
WO2022007596A1 (en) | Image retrieval system, method and apparatus | |
KR20140125312A (en) | Method for processing big data for building data-centric computing environment and apparatus for performing the method | |
TW201931118A (en) | Data processing system and operating method thereof | |
Cheptsov | HPC in big data age: An evaluation report for java-based data-intensive applications implemented with Hadoop and OpenMPI | |
CN114598731B (en) | Cluster log acquisition method, device, equipment and storage medium | |
CN113760856B (en) | Database management method and device, computer readable storage medium and electronic equipment | |
CN113516506A (en) | Data processing method and device and electronic equipment | |
TW202315360A (en) | Microservice allocation method, electronic equipment, and storage medium | |
US8495033B2 (en) | Data processing | |
US20240135287A1 (en) | System and method for automated workload identification, workload model generation and deployment | |
US20140365681A1 (en) | Data management method, data management system, and data management apparatus | |
Yan et al. | Analysis of energy consumption of deduplication in storage systems | |
US20240134779A1 (en) | System and method for automated test case generation based on queuing curve analysis | |
US20240232057A9 (en) | System and method for automated test case generation based on queuing curve analysis | |
CN115225515B (en) | Network survivability analysis method and related equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WITN | Withdrawal due to no request for examination |