KR20190011353A - System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data - Google Patents
System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data Download PDFInfo
- Publication number
- KR20190011353A KR20190011353A KR1020170093312A KR20170093312A KR20190011353A KR 20190011353 A KR20190011353 A KR 20190011353A KR 1020170093312 A KR1020170093312 A KR 1020170093312A KR 20170093312 A KR20170093312 A KR 20170093312A KR 20190011353 A KR20190011353 A KR 20190011353A
- Authority
- KR
- South Korea
- Prior art keywords
- data
- log
- database
- unit
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2219—Large Object storage; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/258—Data format conversion from or to a database
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
Abstract
Description
The present invention relates to a system and method for collecting and processing various types of data and converting and storing them into a big data base, and more particularly, to a system and method for converting and storing various types of formal data (e.g., databases) To a system and method for collecting data (e. G., Mail and logs), processing personal information or the like requiring protection from the data, and converting and storing the processed data into a big data base.
A variety of techniques have been developed for collecting, decomposing, and transmitting email. For example, Patent No. 913288, Patent Application Publication Nos. 2011-0016219, No. 496771, and Patent Application Publication No. 2003-0028936 collect e-mails and decompose them into headers, texts, and attachments, But it does not mention a technique for converting and storing the email data thus decomposed into a big data base. In addition, although Patent Application Publication No. 2013-0048087 discloses a technique for processing user information according to stored rules, this does not mention a technique for converting and storing such processed data into a big data base.
Meanwhile, a variety of techniques for collecting and utilizing large data have been developed, but integrated data collection and processing technology based on mail, log, and database in the enterprise has not yet been developed. Therefore, we developed a technology that collects various types of fixed and unstructured data using one integrated module, processes information that needs to be protected, such as personal information, from this data, and then converts and stores the processed data into a big data base .
The present invention relates to a system for collecting various types of fixed and unstructured data using one integrated module, processing information requiring protection such as personal information from the data, and converting and processing the processed data into a big data base And a method thereof.
In accordance with this object, an embodiment of the present invention provides a mail server, a log folder, and a database server for collecting mail, log and database data, and processing information contained in each collected data according to a policy A data collecting unit for collecting data; (2) a message queue (MessageQueue) for transmitting the processed mail, log, and database data to the data conversion unit; (3) a data conversion unit for converting mail, log, and database data transmitted through the message queue; (4) Kafka transmitting the converted data to the data loading unit; And (5) a data loading unit for storing the data in Hadoop according to the type of data transmitted through the Kafka, as well as a system for collecting, processing, converting and storing data for use as big data.
It is preferable that the data collection processing unit includes a mail collection processing unit, a log collection processing unit, and a database collection processing unit in parallel.
The mail collection processing unit extracts the header, attribute, and body of the collected mail data, and extracts the contents of the attached file when the mail data has the attached file.
If the body of the extracted mail data or the attached file contains information requiring protection, the mail collection processing unit processes the mail according to the policy inputted by the user.
The log collection processing unit checks the collected log file list on a line-by-line basis, and processes lines requiring exception processing.
The database collection processing unit checks and processes data of the collected database objects on a row-by-row basis.
The data conversion unit may include a mail conversion unit, a log conversion unit, and a database conversion unit in parallel.
The mail conversion unit first converts the mail data into a format for a data storage unit in units of a mail message, and then performs a second conversion on a Jason (JSON) format.
The log conversion unit performs logarithmic conversion on a line-by-line basis into a format for a data storage unit, and then converts logarithmically into a Jason format.
The database conversion unit performs a primary conversion of the database data into a format for a data storage unit on a row-by-row basis, and then a secondary conversion to a Jason format.
The data loader classifies the transmitted message according to the type of mail, log, or database, and stores the classified message in the mail, log, or database table of Hadoop, respectively.
(1) collecting mail, log and database data by accessing a mail server, a log folder and a database server, and processing information included in each collected data according to a policy; (2) transmitting the processed mail, log, and database data to the data conversion unit through a message queue (MessageQueue); (3) converting the mail, log, and database data transmitted through the message queue; (4) transmitting the converted data to a data loading unit via a Kafka; And (5) storing the data in Hadoop according to the type of data transmitted via the Kafka, as well as a method for collecting, processing, converting and storing data for use as big data.
Preferably, the data collection processing step includes a mail collection processing step, a log collection processing step, and a database collection processing step in parallel.
In the mail collection processing step, the header, attribute, and body of the collected mail data are extracted. When the mail data has the attached file, the contents of the attached file are also extracted.
If the main body of the extracted mail data or the attached file contains information requiring protection in the mail collection processing step, it processes it according to the policy inputted by the user.
In the log collection processing step, the collected log file list is checked on a line-by-line basis, and a line requiring exception processing is processed.
In the database collection processing step, the data of the collected database object is checked and processed in a row unit.
Preferably, the data conversion step includes a mail conversion step, a log conversion step, and a database conversion step in parallel.
In the mail conversion step, the mail data is firstly converted into a format for storing data in units of mail messages, and then converted into a Jason (JSON) format.
In the log conversion step, the log data is firstly converted into a format for data storage step by line, and then converted into a Jason format.
In the database conversion step, the database data is firstly converted into a format for data storage step by row, and then converted into a Jason format.
In the data storing step, the transmitted messages are classified according to the type of mail, log, or database, and are stored in the mail, log, or database table of Hadoop, respectively.
The system and method according to embodiments of the present invention are efficient because various types of data such as mail, log, and database can be collected and processed without developing a separate module, and can be converted and stored into a big data base. In addition, since information requiring protection such as personal information included in the data of the attached file as well as the main body of the mail data is securely processed, it is possible to prevent important information of the enterprise and the individual from being leaked to the outside.
1 is a schematic block diagram of a system according to an embodiment of the present invention.
2 is a configuration diagram of a mail collection processing unit in a system according to an embodiment of the present invention.
3 is a flowchart of a mail collection processing unit in a system according to an embodiment of the present invention.
4 is a block diagram of a log collection processing unit in a system according to an embodiment of the present invention.
5 is a configuration diagram of a database collection processing unit in a system according to an embodiment of the present invention.
6 is a configuration diagram of a mail conversion unit of the system according to an embodiment of the present invention.
7 is a configuration diagram of a log conversion unit in a system according to an embodiment of the present invention.
8 is a configuration diagram of a database conversion unit in a system according to an embodiment of the present invention.
9 is a configuration diagram of a data loading unit of a system according to an embodiment of the present invention.
A system for collecting, processing, converting, and storing data for use as big data according to an embodiment of the present invention includes (1) a mail server, a log folder, and a database server to access mail, log and database data, A data collection processing unit for processing information included in each of the collected data according to a policy; (2) a message queue (MessageQueue) for transmitting the processed mail, log, and database data to the data conversion unit; (3) a data conversion unit for converting mail, log, and database data transmitted through the message queue; (4) Kafka transmitting the converted data to the data loading unit; And (5) a data loading unit for storing the data in Hadoop according to the type of data transmitted through the Kafka.
Hereinafter, a system for collecting, processing, converting, and storing data for use as big data according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.
1 is a schematic block diagram of a system according to an embodiment of the present invention.
The data
The mail
Data processed in the mail
The
The
The data converted by the
The
2 is a configuration diagram of a mail collection processing unit in a system according to an embodiment of the present invention.
The mail
3 is a flowchart of a mail collection processing unit in a system according to an embodiment of the present invention.
The mail
Then, when information requiring protection such as personal information is detected in the extracted body text or the contents of the attached file, the mail
The mail data that has undergone such processing is collected again and sent in parallel to the message queue.
4 is a block diagram of a log collection processing unit in a system according to an embodiment of the present invention.
The log
In addition, the log
The log data that has undergone such a process is transmitted in parallel to the message queue on a line-by-line basis.
5 is a configuration diagram of a database collection processing unit in a system according to an embodiment of the present invention.
The database
And transmits the collected database data to the message queue in a row-by-row manner.
6 is a configuration diagram of a mail conversion unit of the system according to an embodiment of the present invention.
The
The data converted into the Jason format is transferred to the
7 is a configuration diagram of a log conversion unit in a system according to an embodiment of the present invention.
The
The data converted into the Jason format is transmitted to the
8 is a configuration diagram of a database conversion unit in a system according to an embodiment of the present invention.
The
The data converted into the Jason format is transmitted to the
9 is a configuration diagram of a data loading unit of a system according to an embodiment of the present invention.
The
The format of the
More specifically, the type indicates the type of message to be stored in the HBase table, and is a mail, log, or database.
The table is the name of the table to load when loading data into HBase. In the case of mail, the table is decomposed into headers, attributes, and body. In the case of logs and databases, each log collection index and database collection index are added and stored in the table.
The row key is the low key (index value) of HBase.
A column is a list of columns in a table that will be stored in HBase. For example, a column of an e-mail body, which is a mail body table, is composed of a mail key (MailKey), an attachment file name (FileName), and a body content (including attachment contents in the case of an attachment).
The start date time is the data analysis start time.
The transmission date and time is the time when data was transmitted from the message queue.
The value is a data object to be stored in HBase and consists of a pair of key and value. The HBase column name is the key, and the data to be entered into the column is the value. For example, the file name, which is an attachment file column in the mail body, becomes a key, and the attachment file name becomes a value.
The
If the classified message type is logged, the log message (Hbase table name, row key, column information (log field), value (log field - log field value)) is stored in the HBase table.
If the classified message type is a database, it is stored in the HBase table in the form of a database message (HBase table name, row key, column information (column of the table or view to be collected), value (column-column value)).
10: Mail server
11: IMAP
12: POP3
13: EML file
20: Log folder
30: Database server
100: Data collection processing unit
110: mail collection processing unit
120: log collection processing unit
130: Database collection processing unit
200: Message Queue
300: Data conversion unit
310: mail conversion unit
320: log conversion unit
330: Database conversion unit
400: Kafka
500: Data loading unit
600: Hadoop
Claims (11)
(2) a message queue (MessageQueue) for transmitting the processed mail, log, and database data to the data conversion unit;
(3) a data conversion unit for converting mail, log, and database data transmitted through the message queue;
(4) Kafka transmitting the converted data to the data loading unit; And
(5) a data loading section for storing in Hadoop according to the type of data transmitted via the Kafka;
A system for collecting, processing, converting and storing data for use as Big Data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170093312A KR20190011353A (en) | 2017-07-24 | 2017-07-24 | System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020170093312A KR20190011353A (en) | 2017-07-24 | 2017-07-24 | System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20190011353A true KR20190011353A (en) | 2019-02-07 |
Family
ID=65367040
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020170093312A KR20190011353A (en) | 2017-07-24 | 2017-07-24 | System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20190011353A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111338821A (en) * | 2020-02-25 | 2020-06-26 | 北京思特奇信息技术股份有限公司 | Method, system and electronic equipment for realizing data load balance |
US11360985B1 (en) | 2020-12-28 | 2022-06-14 | Coupang Corp. | Method for loading data and electronic apparatus therefor |
KR20220168420A (en) * | 2021-06-16 | 2022-12-23 | 인터리젠 주식회사 | Real-time abnormal symptoms detection system and method by in-memory |
KR102640115B1 (en) * | 2023-05-19 | 2024-02-23 | 쿠팡 주식회사 | Operating method for electronic apparatus for providing information and electronic apparatus supporting thereof |
-
2017
- 2017-07-24 KR KR1020170093312A patent/KR20190011353A/en not_active Application Discontinuation
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111338821A (en) * | 2020-02-25 | 2020-06-26 | 北京思特奇信息技术股份有限公司 | Method, system and electronic equipment for realizing data load balance |
US11360985B1 (en) | 2020-12-28 | 2022-06-14 | Coupang Corp. | Method for loading data and electronic apparatus therefor |
US11734284B2 (en) | 2020-12-28 | 2023-08-22 | Coupang Corp. | Method for loading data and electronic apparatus therefor |
KR20220168420A (en) * | 2021-06-16 | 2022-12-23 | 인터리젠 주식회사 | Real-time abnormal symptoms detection system and method by in-memory |
KR102640115B1 (en) * | 2023-05-19 | 2024-02-23 | 쿠팡 주식회사 | Operating method for electronic apparatus for providing information and electronic apparatus supporting thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7593995B1 (en) | Methods and systems of electronic message threading and ranking | |
US11188657B2 (en) | Method and system for managing electronic documents based on sensitivity of information | |
KR20190011353A (en) | System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data | |
US7668849B1 (en) | Method and system for processing structured data and unstructured data | |
US7899871B1 (en) | Methods and systems for e-mail topic classification | |
US20080028028A1 (en) | E-mail archive system, method and medium | |
WO2007143223A2 (en) | System and method for entity based information categorization | |
US20130138428A1 (en) | Systems and methods for automatically detecting deception in human communications expressed in digital form | |
US20120254333A1 (en) | Automated detection of deception in short and multilingual electronic messages | |
US20060277154A1 (en) | Data structure generated in accordance with a method for identifying electronic files using derivative attributes created from native file attributes | |
US20090157705A1 (en) | Systems, methods and computer products for name disambiguation by using private/global directories, and communication contexts | |
US9069818B2 (en) | Textual search for numerical properties | |
EP2851837A2 (en) | Controlling disclosure of structured data | |
WO2012082414A2 (en) | Using text messages to interact with spreadsheets | |
US20060277169A1 (en) | Using the quantity of electronically readable text to generate a derivative attribute for an electronic file | |
US20160373466A1 (en) | Message Quarantine | |
CN104125129A (en) | System and method for withdrawing mail | |
US8700628B1 (en) | Personalized aggregation of annotations | |
Debnath et al. | Post-disaster Situational Analysis from WhatsApp Group Chats of Emergency Response Providers. | |
US10628466B2 (en) | Smart exchange database index | |
CN111754131A (en) | Enterprise information dynamic monitoring method, equipment and medium | |
RU2583713C2 (en) | System and method of eliminating shingles from insignificant parts of messages when filtering spam | |
US9391942B2 (en) | Symbolic variables within email addresses | |
JP4802523B2 (en) | Electronic message analysis apparatus and method | |
US11681966B2 (en) | Systems and methods for enhanced risk identification based on textual analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E601 | Decision to refuse application |