KR20190011353A - System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data - Google Patents

System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data Download PDF

Info

Publication number
KR20190011353A
KR20190011353A KR1020170093312A KR20170093312A KR20190011353A KR 20190011353 A KR20190011353 A KR 20190011353A KR 1020170093312 A KR1020170093312 A KR 1020170093312A KR 20170093312 A KR20170093312 A KR 20170093312A KR 20190011353 A KR20190011353 A KR 20190011353A
Authority
KR
South Korea
Prior art keywords
data
mail
log
database
unit
Prior art date
Application number
KR1020170093312A
Other languages
Korean (ko)
Inventor
최동효
견지현
이정주
Original Assignee
주식회사 닷넷소프트
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 주식회사 닷넷소프트 filed Critical 주식회사 닷넷소프트
Priority to KR1020170093312A priority Critical patent/KR20190011353A/en
Publication of KR20190011353A publication Critical patent/KR20190011353A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes

Abstract

Provided is a system for collecting, processing, converting, and storing data to use as big data, including: a data collection process unit collecting a mail, a log, and database data by accessing to a mail server, a log folder, and a database server, and processing information included in collected data in accordance with a policy; a MessageQueue transmitting the processed mail, log, and database data to a data conversion unit; the data conversion unit converting the transmitted mail, log, and database data through the MessageQueue; a Kafka transmitting the converted data to a data load unit; and the data load unit storing the transmitted data to a Hadoop in accordance with a type of the data transmitted through the Kafka.

Description

System for Retrieving, Processing, Converting, and Saving Data for Use as Big Data [

The present invention relates to a system and method for collecting and processing various types of data and converting and storing them into a big data base, and more particularly, to a system and method for converting and storing various types of formal data (e.g., databases) To a system and method for collecting data (e. G., Mail and logs), processing personal information or the like requiring protection from the data, and converting and storing the processed data into a big data base.

A variety of techniques have been developed for collecting, decomposing, and transmitting email. For example, Patent No. 913288, Patent Application Publication Nos. 2011-0016219, No. 496771, and Patent Application Publication No. 2003-0028936 collect e-mails and decompose them into headers, texts, and attachments, But it does not mention a technique for converting and storing the email data thus decomposed into a big data base. In addition, although Patent Application Publication No. 2013-0048087 discloses a technique for processing user information according to stored rules, this does not mention a technique for converting and storing such processed data into a big data base.

Meanwhile, a variety of techniques for collecting and utilizing large data have been developed, but integrated data collection and processing technology based on mail, log, and database in the enterprise has not yet been developed. Therefore, we developed a technology that collects various types of fixed and unstructured data using one integrated module, processes information that needs to be protected, such as personal information, from this data, and then converts and stores the processed data into a big data base .

Korean Patent No. 913288 (issued on Aug. 21, 2009) Korean Patent Application Publication No. 2011-0016219 (disclosed on February 17, 2011) Korean Patent No. 496771 (Announced on June 23, 2005) Korean Patent Application Publication No. 2003-0028936 (published on April 11, 2003) Korean Patent Application Publication No. 2013-0048087 (published on May 9, 2013)

The present invention relates to a system for collecting various types of fixed and unstructured data using one integrated module, processing information requiring protection such as personal information from the data, and converting and processing the processed data into a big data base And a method thereof.

In accordance with this object, an embodiment of the present invention provides a mail server, a log folder, and a database server for collecting mail, log and database data, and processing information contained in each collected data according to a policy A data collecting unit for collecting data; (2) a message queue (MessageQueue) for transmitting the processed mail, log, and database data to the data conversion unit; (3) a data conversion unit for converting mail, log, and database data transmitted through the message queue; (4) Kafka transmitting the converted data to the data loading unit; And (5) a data loading unit for storing the data in Hadoop according to the type of data transmitted through the Kafka, as well as a system for collecting, processing, converting and storing data for use as big data.

It is preferable that the data collection processing unit includes a mail collection processing unit, a log collection processing unit, and a database collection processing unit in parallel.

The mail collection processing unit extracts the header, attribute, and body of the collected mail data, and extracts the contents of the attached file when the mail data has the attached file.

If the body of the extracted mail data or the attached file contains information requiring protection, the mail collection processing unit processes the mail according to the policy inputted by the user.

The log collection processing unit checks the collected log file list on a line-by-line basis, and processes lines requiring exception processing.

The database collection processing unit checks and processes data of the collected database objects on a row-by-row basis.

The data conversion unit may include a mail conversion unit, a log conversion unit, and a database conversion unit in parallel.

The mail conversion unit first converts the mail data into a format for a data storage unit in units of a mail message, and then performs a second conversion on a Jason (JSON) format.

The log conversion unit performs logarithmic conversion on a line-by-line basis into a format for a data storage unit, and then converts logarithmically into a Jason format.

The database conversion unit performs a primary conversion of the database data into a format for a data storage unit on a row-by-row basis, and then a secondary conversion to a Jason format.

The data loader classifies the transmitted message according to the type of mail, log, or database, and stores the classified message in the mail, log, or database table of Hadoop, respectively.

(1) collecting mail, log and database data by accessing a mail server, a log folder and a database server, and processing information included in each collected data according to a policy; (2) transmitting the processed mail, log, and database data to the data conversion unit through a message queue (MessageQueue); (3) converting the mail, log, and database data transmitted through the message queue; (4) transmitting the converted data to a data loading unit via a Kafka; And (5) storing the data in Hadoop according to the type of data transmitted via the Kafka, as well as a method for collecting, processing, converting and storing data for use as big data.

Preferably, the data collection processing step includes a mail collection processing step, a log collection processing step, and a database collection processing step in parallel.

In the mail collection processing step, the header, attribute, and body of the collected mail data are extracted. When the mail data has the attached file, the contents of the attached file are also extracted.

If the main body of the extracted mail data or the attached file contains information requiring protection in the mail collection processing step, it processes it according to the policy inputted by the user.

In the log collection processing step, the collected log file list is checked on a line-by-line basis, and a line requiring exception processing is processed.

In the database collection processing step, the data of the collected database object is checked and processed in a row unit.

Preferably, the data conversion step includes a mail conversion step, a log conversion step, and a database conversion step in parallel.

In the mail conversion step, the mail data is firstly converted into a format for storing data in units of mail messages, and then converted into a Jason (JSON) format.

In the log conversion step, the log data is firstly converted into a format for data storage step by line, and then converted into a Jason format.

In the database conversion step, the database data is firstly converted into a format for data storage step by row, and then converted into a Jason format.

In the data storing step, the transmitted messages are classified according to the type of mail, log, or database, and are stored in the mail, log, or database table of Hadoop, respectively.

The system and method according to embodiments of the present invention are efficient because various types of data such as mail, log, and database can be collected and processed without developing a separate module, and can be converted and stored into a big data base. In addition, since information requiring protection such as personal information included in the data of the attached file as well as the main body of the mail data is securely processed, it is possible to prevent important information of the enterprise and the individual from being leaked to the outside.

1 is a schematic block diagram of a system according to an embodiment of the present invention.
2 is a configuration diagram of a mail collection processing unit in a system according to an embodiment of the present invention.
3 is a flowchart of a mail collection processing unit in a system according to an embodiment of the present invention.
4 is a block diagram of a log collection processing unit in a system according to an embodiment of the present invention.
5 is a configuration diagram of a database collection processing unit in a system according to an embodiment of the present invention.
6 is a configuration diagram of a mail conversion unit of the system according to an embodiment of the present invention.
7 is a configuration diagram of a log conversion unit in a system according to an embodiment of the present invention.
8 is a configuration diagram of a database conversion unit in a system according to an embodiment of the present invention.
9 is a configuration diagram of a data loading unit of a system according to an embodiment of the present invention.

A system for collecting, processing, converting, and storing data for use as big data according to an embodiment of the present invention includes (1) a mail server, a log folder, and a database server to access mail, log and database data, A data collection processing unit for processing information included in each of the collected data according to a policy; (2) a message queue (MessageQueue) for transmitting the processed mail, log, and database data to the data conversion unit; (3) a data conversion unit for converting mail, log, and database data transmitted through the message queue; (4) Kafka transmitting the converted data to the data loading unit; And (5) a data loading unit for storing the data in Hadoop according to the type of data transmitted through the Kafka.

Hereinafter, a system for collecting, processing, converting, and storing data for use as big data according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

1 is a schematic block diagram of a system according to an embodiment of the present invention.

The data collection processing unit 100 preferably includes a mail collection processing unit 110, a log collection processing unit 120, and a database collection processing unit 130 in parallel.

The mail collection processing unit 110 accesses the mail server 10 to collect mail data, and then processes information requiring protection such as personal information included in the collected data according to a registered policy. The log collection processing unit 120 accesses the log folder 20, collects a log file list, and processes information that requires exception processing. The database collection processing unit 130 accesses the database server 30 to collect and process data of the database objects.

Data processed in the mail collection processing unit 110, the log collection processing unit 120 and the database collection processing unit 130 are transmitted to the data conversion unit 300 through a message queue 200.

The data conversion unit 300 preferably includes the mail conversion unit 310, the log conversion unit 320, and the database conversion unit 330 in parallel.

The mail conversion unit 310 converts the mail data transmitted from the mail collection processing unit 110 through the message queue 200 into the data storage unit 500. The log conversion unit 310 converts the log data transmitted from the log collection processing unit 120 through the message queue 200 into data for the data storage unit 500. The database conversion unit 330 converts the database data transmitted from the database collection processing unit 130 through the message queue 200 into the data storage unit 500.

The data converted by the mail conversion unit 310, the log conversion unit 320 and the database conversion unit 330 are transmitted to the data loading unit 500 through the Kafka 400.

The data loader 500 classifies each data transmitted through the Kafka 400 into message types according to mail, log, and database, and stores the classified messages in Hadoop 600, respectively.

2 is a configuration diagram of a mail collection processing unit in a system according to an embodiment of the present invention.

The mail collection processing unit 110 may collect the mail in real time or collect the stored mail file. The method of collecting mail in real time uses a mail protocol of Internet Message Access Protocol (IMAP) 11 or POP3 (Post Office Protocol 3; 12). When the user enters the host, port, and journaling mailbox information, it is possible to collect mail in real time. On the other hand, the method of collecting the stored mail file collects mail data through a stored EML file (a text file in Mime format 13 including an e-mail header and a body text). You can collect the saved EML file by setting the path of the EML file on the mail collection page.

3 is a flowchart of a mail collection processing unit in a system according to an embodiment of the present invention.

The mail collection processing unit 110 extracts the header, attribute, and main content of the mail from the collected mail data. If the attached file exists in the mail, and if the attached file is in a format that can be supported (for example, a Microsoft Office 2000 or later version of the file, a pdf file, or a text file), the contents are also extracted. If the attached file is a compressed file, the compressed file of the attached file is released and its contents are extracted.

Then, when information requiring protection such as personal information is detected in the extracted body text or the contents of the attached file, the mail collection processing unit 110 replaces it with the converted character inputted by the user in advance. At this time, personal information may include a resident registration number, an account number, a credit card number, and the like.

The mail data that has undergone such processing is collected again and sent in parallel to the message queue.

4 is a block diagram of a log collection processing unit in a system according to an embodiment of the present invention.

The log collection processing unit 120 receives a log collection location, a log separator (tab, colon, semicolon blank, or the like) or a previously registered system log format from a user, and collects log files from a specified log location.

In addition, the log collection processing unit 120 checks the log format in which the user registers the collected log file lists on a line-by-line basis, and excludes lines including prefixes registered as exclusions from the collection object such as annotations.

The log data that has undergone such a process is transmitted in parallel to the message queue on a line-by-line basis.

5 is a configuration diagram of a database collection processing unit in a system according to an embodiment of the present invention.

The database collection processing unit 130 receives user database information (host, port, database type, database account information, schema information, and the like). Based on this, it is checked whether it is possible to connect to the corresponding database (Oracle, MySQL, MSSQL, PostgreSQL, MariaDB, etc.). When the database to be collected is connectable, the data of the object (specific table or view) to be collected is read in row units.

And transmits the collected database data to the message queue in a row-by-row manner.

6 is a configuration diagram of a mail conversion unit of the system according to an embodiment of the present invention.

The mail conversion unit 310 converts the mail data transmitted through the message queue into a format for the data loading unit 500 in units of a mail message (header, attribute, body). This includes the data type to be loaded (mail), HBase table name (mail header table, mail attribute table, mail body table), column list (attribute value list of mail header, mail attribute value list, ), HBase's row key (RowKey), and the value to be stored in HBase (mail attribute information-mail attribute value). The converted data is collected and converted into JSON format again.

The data converted into the Jason format is transferred to the data loading unit 500 via the Kafka.

7 is a configuration diagram of a log conversion unit in a system according to an embodiment of the present invention.

The log conversion unit 320 converts the log data transmitted through the message queue into a format for the data storage unit 500 in units of lines. This includes the data type to be loaded (log), HBase table name, column list (field list of logs), HBase low key, value to be stored in HBase (field of log - value corresponding to field of log). The converted data is collected and converted into Jason format once more.

The data converted into the Jason format is transmitted to the data loading unit 500 via the Kafka.

8 is a configuration diagram of a database conversion unit in a system according to an embodiment of the present invention.

The database conversion unit 330 converts the database data transmitted through the message queue into a format for the data storage unit 500 on a row-by-row basis. This includes the type of data to be loaded (database), the name of the HBase table, the list of columns (the list of columns in the database to be collected), the row key of HBase, and the value to be stored in HBase (database column - database column value). The converted data is collected and converted into Jason format once more.

The data converted into the Jason format is transmitted to the data loading unit 500 via the Kafka.

9 is a configuration diagram of a data loading unit of a system according to an embodiment of the present invention.

The data loading unit 500 receives various types of data (mail, log, and database) converted and transmitted by the data converting unit 300 in the form of a message.

The format of the data loading unit 500 is shown in Table 1 below.

field shape Explanation type String Separate mail, log, and database table String Hbase storage table name Low key String Save Hbase key column String List of columns of table to be stored in Hbase Start Date Time String Data analysis start time (yyyy-MM-dd HH: mm: ss fff) Transfer date and time String Data message queue transfer time (yyyy-MM-dd HH: mm: ss fff) value Object Mail, log and database data to be stored in Hbase

More specifically, the type indicates the type of message to be stored in the HBase table, and is a mail, log, or database.

The table is the name of the table to load when loading data into HBase. In the case of mail, the table is decomposed into headers, attributes, and body. In the case of logs and databases, each log collection index and database collection index are added and stored in the table.

The row key is the low key (index value) of HBase.

A column is a list of columns in a table that will be stored in HBase. For example, a column of an e-mail body, which is a mail body table, is composed of a mail key (MailKey), an attachment file name (FileName), and a body content (including attachment contents in the case of an attachment).

The start date time is the data analysis start time.

The transmission date and time is the time when data was transmitted from the message queue.

The value is a data object to be stored in HBase and consists of a pair of key and value. The HBase column name is the key, and the data to be entered into the column is the value. For example, the file name, which is an attachment file column in the mail body, becomes a key, and the attachment file name becomes a value.

The data loading unit 500 analyzes the transmitted messages and classifies them according to data types (i.e., mail, log, or database). When the classified message type is mail, the mail data decomposed into the header, the attribute, and the body (including the attachment file if the attachment exists) are stored in the mail header, mail attribute and mail body table of Hbase, respectively.

If the classified message type is logged, the log message (Hbase table name, row key, column information (log field), value (log field - log field value)) is stored in the HBase table.

If the classified message type is a database, it is stored in the HBase table in the form of a database message (HBase table name, row key, column information (column of the table or view to be collected), value (column-column value)).

10: Mail server
11: IMAP
12: POP3
13: EML file
20: Log folder
30: Database server
100: Data collection processing unit
110: mail collection processing unit
120: log collection processing unit
130: Database collection processing unit
200: Message Queue
300: Data conversion unit
310: mail conversion unit
320: log conversion unit
330: Database conversion unit
400: Kafka
500: Data loading unit
600: Hadoop

Claims (11)

(1) a data collection processing unit for accessing a mail server, a log folder, and a database server to collect mail, log and database data, and to process information contained in each collected data according to a policy;
(2) a message queue (MessageQueue) for transmitting the processed mail, log, and database data to the data conversion unit;
(3) a data conversion unit for converting mail, log, and database data transmitted through the message queue;
(4) Kafka transmitting the converted data to the data loading unit; And
(5) a data loading section for storing in Hadoop according to the type of data transmitted via the Kafka;
A system for collecting, processing, converting and storing data for use as Big Data.
The system according to claim 1, wherein the data collection processing unit collects, processes, converts, and stores data for use as big data, including a mail collection processing unit, a log collection processing unit, and a database collection processing unit in parallel. The mail server according to claim 2, wherein the mail collection processing unit extracts a header, an attribute, and a body of the collected mail data, extracts the contents of the attached file when the mail data has the attached file, Processing, transforming, and storing the same. The mail server according to claim 3, wherein the mail collection processing unit processes the extracted mail data in accordance with a policy inputted by the user when the body of the mail data or the attached file contains information requiring protection, A system for collecting, processing, converting and storing. The system according to claim 2, wherein the log collection processing unit collects, processes, converts, and stores data for use as big data, which checks the collected log file list on a line-by-line basis and processes lines requiring exception processing. The system according to claim 2, wherein the database collection processing unit collects, processes, converts, and stores data for use as big data, which checks and processes data of the collected database objects row by row. The system according to claim 1, wherein the data conversion unit collects, processes, converts, and stores data for use as big data, which includes a mail conversion unit, a log conversion unit, and a database conversion unit in parallel. The mail server according to claim 7, wherein the mail conversion unit performs a primary conversion of mail data into a format for a data storage unit in units of mail messages, and then a secondary conversion of the mail data into a Jason (JSON) format, , Converting and storing. The log conversion unit according to claim 7, wherein the log conversion unit converts the log data into a format for data storage unit in a line unit, and then performs a second conversion on the Jason format, collects, processes, converts, and stores data for use as big data System. The method of claim 7, wherein the database conversion unit is configured to convert the database data into a format for a data storage unit in a row-by-row basis, and then perform a second conversion in a Jason format, collecting, processing, System. The data loading unit according to claim 1, wherein the data loading unit classifies the transmitted message according to the type of mail, log, or database, and stores the message in the mail, log, or database table of Hadoop, respectively. A system for collecting, processing, converting and storing.
KR1020170093312A 2017-07-24 2017-07-24 System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data KR20190011353A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020170093312A KR20190011353A (en) 2017-07-24 2017-07-24 System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020170093312A KR20190011353A (en) 2017-07-24 2017-07-24 System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data

Publications (1)

Publication Number Publication Date
KR20190011353A true KR20190011353A (en) 2019-02-07

Family

ID=65367040

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020170093312A KR20190011353A (en) 2017-07-24 2017-07-24 System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data

Country Status (1)

Country Link
KR (1) KR20190011353A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111338821A (en) * 2020-02-25 2020-06-26 北京思特奇信息技术股份有限公司 Method, system and electronic equipment for realizing data load balance
US11360985B1 (en) 2020-12-28 2022-06-14 Coupang Corp. Method for loading data and electronic apparatus therefor
KR20220168420A (en) * 2021-06-16 2022-12-23 인터리젠 주식회사 Real-time abnormal symptoms detection system and method by in-memory
KR102640115B1 (en) * 2023-05-19 2024-02-23 쿠팡 주식회사 Operating method for electronic apparatus for providing information and electronic apparatus supporting thereof

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111338821A (en) * 2020-02-25 2020-06-26 北京思特奇信息技术股份有限公司 Method, system and electronic equipment for realizing data load balance
US11360985B1 (en) 2020-12-28 2022-06-14 Coupang Corp. Method for loading data and electronic apparatus therefor
US11734284B2 (en) 2020-12-28 2023-08-22 Coupang Corp. Method for loading data and electronic apparatus therefor
KR20220168420A (en) * 2021-06-16 2022-12-23 인터리젠 주식회사 Real-time abnormal symptoms detection system and method by in-memory
KR102640115B1 (en) * 2023-05-19 2024-02-23 쿠팡 주식회사 Operating method for electronic apparatus for providing information and electronic apparatus supporting thereof

Similar Documents

Publication Publication Date Title
US7593995B1 (en) Methods and systems of electronic message threading and ranking
US11188657B2 (en) Method and system for managing electronic documents based on sensitivity of information
KR20190011353A (en) System for Retrieving, Processing, Converting, and Saving Data for Use As Big Data
US7668849B1 (en) Method and system for processing structured data and unstructured data
US7899871B1 (en) Methods and systems for e-mail topic classification
US20080028028A1 (en) E-mail archive system, method and medium
WO2007143223A2 (en) System and method for entity based information categorization
US20130138428A1 (en) Systems and methods for automatically detecting deception in human communications expressed in digital form
US20120254333A1 (en) Automated detection of deception in short and multilingual electronic messages
US20060277154A1 (en) Data structure generated in accordance with a method for identifying electronic files using derivative attributes created from native file attributes
US20090157705A1 (en) Systems, methods and computer products for name disambiguation by using private/global directories, and communication contexts
US9069818B2 (en) Textual search for numerical properties
EP2851837A2 (en) Controlling disclosure of structured data
WO2012082414A2 (en) Using text messages to interact with spreadsheets
US20060277169A1 (en) Using the quantity of electronically readable text to generate a derivative attribute for an electronic file
US20160373466A1 (en) Message Quarantine
CN104125129A (en) System and method for withdrawing mail
US8700628B1 (en) Personalized aggregation of annotations
Debnath et al. Post-disaster Situational Analysis from WhatsApp Group Chats of Emergency Response Providers.
US10628466B2 (en) Smart exchange database index
CN111754131A (en) Enterprise information dynamic monitoring method, equipment and medium
RU2583713C2 (en) System and method of eliminating shingles from insignificant parts of messages when filtering spam
US9391942B2 (en) Symbolic variables within email addresses
JP4802523B2 (en) Electronic message analysis apparatus and method
US11681966B2 (en) Systems and methods for enhanced risk identification based on textual analysis

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E601 Decision to refuse application