KR20130011207A

KR20130011207A - Method and system for automatic classification of data

Info

Publication number: KR20130011207A
Application number: KR1020110072191A
Authority: KR
Inventors: 오태영; 한학희; 김준수; 허성희; 백승희; 인소영
Original assignee: (주) 케이씨넷
Priority date: 2011-07-21
Filing date: 2011-07-21
Publication date: 2013-01-30
Also published as: KR101242141B1

Abstract

본 발명은 데이터의 자동 분류 방법에 관한 것으로, 상기 데이터 자동분류 방법은 데이터 입력 단계와 입력 데이터 중 메인 데이터베이스에 일정기간 축적된 데이터를 대상으로 키워드를 추출하고 상기 키워드의 속성에 따라 메타데이터를 생성하여 저장하는 메타데이터 데이터베이스 구축 단계와 입력 데이터를 단어나 문자 단위로 분할하여 단위데이터를 생성하는 단계 및 상기 단위데이터를 메타데이터와 비교하고 그 결과에 따라 추출된 키워드를 정렬하여 그룹핑하는 단계를 포함한다. 본 발명에 의하면 데이터로부터 메타데이터를 자동으로 추출하여 관리하고 의미나 특정한 목적에 따라 그룹화하여 출력함으로써 데이터 분류 작업의 정확도를 높일 수 있으며, 데이터 분석 작업에 소요되는 시간과 인력을 감소시킬 수 있다.The present invention relates to a method for automatically classifying data. The method for automatically classifying data extracts a keyword for data accumulated in a main database from a data input step and input data for a predetermined period of time, and generates metadata according to the attribute of the keyword. A metadata database construction step of storing and storing the input data by dividing the input data into word or character units, and comparing the unit data with the metadata and sorting and grouping extracted keywords according to the result. do. According to the present invention, by automatically extracting and managing metadata from data and grouping and outputting the data according to a meaning or a specific purpose, the accuracy of data classification can be increased, and the time and manpower required for data analysis can be reduced.

Description

Method and system for automatic classification of data

본 발명은 데이터 품질 관리를 위하여 입력된 데이터를 분석하여 표준화하고 상기 표준화된 데이터를 기준으로 자동 분류하는 방법 및 그 시스템에 관한 것이다.The present invention relates to a method and system for analyzing and standardizing input data for data quality management and automatically classifying the standardized data.

일반적으로 데이터 품질관리란 기관이나 조직 내외부의 정보 시스템 및 데이터베이스 사용자의 기대를 충족시키기 위해 수행하는 데이터 관리 및 개선 활동을 총칭한다. 데이터베이스는 동일 조직 내의 다른 데이터베이스와 서로 연결되거나, 때로는 다른 조직의 데이터베이스와 연계되어 있기 때문에 연계 대상이 되는 조직들의 데이터 간에 일관성을 유지하는 것이 중요하다.In general, data quality management refers to data management and improvement activities performed to meet the expectations of information system and database users, both inside and outside an organization or organization. Because databases are linked to each other or sometimes to databases of other organizations within the same organization, it is important to maintain consistency between the data of the organizations to which they are linked.

즉, 조직의 모든 업무가 전산화되면서 방대한 양의 데이터가 발생하는데 조직 간에 데이터의 중복이 심화되고 용어가 불일치하게 되며 용어의 의미 또한 불명확해지는 문제들로 인해 데이터의 품질 저하가 발생한다. 이러한 데이터의 품질 저하는 데이터의 손실과 재작업으로 이어지고 그로 인한 손실은 고스란히 조직의 경쟁력 저하로 나타나게 된다.In other words, as all the tasks of an organization are computerized, a large amount of data is generated, and the data quality decreases due to deepening of data duplication, inconsistency of terms, and unclear meaning of terms. This deterioration of data leads to data loss and rework, which results in a loss of organizational competitiveness.

구체적으로 운항선사 및/또는 항공사를 대신하여 국내에 입출항하는 모든 선박이나 항공기에 적재된 용선사 및/또는 포워더들의 화물목록을 취합하여 제출하도록 지원하는 적하목록취합시스템(MFCS, ManiFestation Consolidation System)을 예로 들어 데이터의 품질 저하 문제를 살펴보면 다음과 같다.Specifically, a ManiFestation Consolidation System (MFCS) is provided to assist in the collection and presentation of cargo lists of charterers and / or forwarders on board vessels or aircraft entering and departing on Korea's behalf. For example, the data quality problem is as follows.

적하목록취합시스템에 적하목록을 입력할 때, 선사와 포워더가 동일한 화물에 대하여 서로 다른 명칭을 부여하여 입력하거나 비표준어, 약어, 방언 등을 이용하여 입력하는 경우가 비일비재하며, 특히 높은 관세를 피하려는 악의적인 목적으로 실제 물품과 다른 명칭을 입력하기도 한다. 이 경우 적하목록취합시스템의 데이터베이스 내에 저장되어 있는 화물의 표준 명칭이나 화물 코드와 매칭되지 않아 적하목록의 제출 및 승인 절차가 지연되는 문제가 발생한다.When entering manifests into the manifest collection system, it is not uncommon for shipping companies and forwarders to enter different names for the same cargo or to enter them using non-standard words, abbreviations, dialects, etc., especially to avoid high tariffs. For malicious purposes, the name may be different from the actual item. In this case, there is a problem that the submission and approval process of the manifest is delayed because it does not match the standard name or the cargo code of the cargo stored in the manifest collection system database.

또한, 선사와 포워더가 동일한 품목에 대하여 각자의 방식으로 적하목록을 입력하고 상기 적하목록에 존재하는 오타, 유사어, 띄어쓰기, 약어와 같은 표기 방식의 차이로 인해 동일 품목이 그룹화되지 않아 이를 수작업 처리함으로써 시간 및 인력 낭비의 문제가 발생한다.In addition, the shipping company and the forwarder enter the manifest in the respective way for the same item, and the same items are not grouped due to differences in the notation such as typos, synonyms, spacing, and abbreviations in the manifest. The problem of waste of time and manpower arises.

이러한 문제점을 해결하기 위해 적하목록에 대한 데이터 그룹핑 방법이 제안되었다. 종래의 데이터 그룹핑 기술은 표 1과 같이 입력 데이터(상호)들이 데이터베이스 내에 저장된 화물의 표준 명칭과 동일할 경우 그룹핑하며, 상기 그룹핑된 입력 데이터에 그룹 ID를 부여하여 자동으로 분류한다.To solve this problem, a data grouping method for the manifest is proposed. Conventional data grouping techniques group when the input data (mutual) is the same as the standard name of the cargo stored in the database, as shown in Table 1, and automatically classifies the grouped input data by assigning a group ID.

그룹 IDGroup id 상호Mutual 주소address 00000000060000000006 AGRICOLA NACIONAL S A C E LAGRICOLA NACIONAL S A C E L ALMIRANTE PASTENE 300 SANTIAGO CHILEALMIRANTE PASTENE 300 SANTIAGO CHILE 00000000060000000006 AGRICOLA NACIONAL S A C E LAGRICOLA NACIONAL S A C E L ALMIRANTE PASTENE 300, SANTIAGO, CHILEALMIRANTE PASTENE 300, SANTIAGO, CHILE

하지만, 종래의 데이터 그룹핑 기술은 주소가 다른 동일한 상호가 입력되었을 경우 상기 주소를 분석하여 그룹핑할 수 없으며, 상기 상호의 물리적인 형태만을 기준으로 그룹핑한다는 단점이 있다.However, in the conventional data grouping technique, when the same mutual name having different addresses is input, the address cannot be analyzed and grouped, and there is a disadvantage in that grouping is based only on the physical form of the mutual name.

일례로, 종래의 그룹핑 방법에 의하면 표 2의 그룹 ID 0000000030과 같이 마지막 단어가 일부 상이한 상호들에게 동일한 그룹 ID가 부여된다. ID 0000000073과 같이 상호는 동일하지만 그 상호에 대응하는 주소가 상이한 경우에는 서로 별개인 상호이어야 하지만 둘 다 동일한 그룹 ID가 부여되는 오류가 발생한다.For example, according to the conventional grouping method, the same group ID is given to some mutually different words, such as group ID 0000000030 in Table 2. If the IDs are the same as the ID 0000000073, but the addresses corresponding to the IDs are different from each other, an error occurs that both are given the same group ID.

그룹 IDGroup id 상호Mutual 주소address 00000000300000000030 CELULOSA ARAU Y CONTITUCIONCELULOSA ARAU Y CONTITUCION R.U.T 93.458.000-1 FABRICA DE CELULOSAR.U.T 93.458.000-1 FABRICA DE CELULOSA 00000000300000000030 CELULOSA ARAU Y CONSTCELULOSA ARAU Y CONST ITUCION S.A. SANTIAGO CHILEITUCION S.A. SANTIAGO CHILE 00000000730000000073 DISTRIBUIDORA AUTOMOTRIZ SANTIAGO S ADISTRIBUIDORA AUTOMOTRIZ SANTIAGO S A COQUIMBO NRO.345 SANTIAGO/CHILECOQUIMBO NRO.345 SANTIAGO / CHILE 00000000730000000073 DISTRIBUIDORA AUTOMOTRIZ SANTIAGO S ADISTRIBUIDORA AUTOMOTRIZ SANTIAGO S A SANTA ROSA 455 SANTIAGOSANTA ROSA 455 SANTIAGO

이와 같은 문제점이 지속적으로 발생하면 표 3과 같이 물리적인 형태와 주소가 각각 상이한 상호들이 동일한 그룹 ID를 부여받게 되어 결국 그룹핑 기술의 신뢰도가 저하되는 결과를 초래한다.If such a problem occurs continuously, as shown in Table 3, mutually different physical forms and addresses are given the same group ID, resulting in a deterioration of the reliability of the grouping technique.

그룹 IDGroup id 상호Mutual 주소address 00000000550000000055 CONSORCIO MADERERO S ACONSORCIO MADERERO S A MONEDA 920 OF. 708 SANTIAGOMONEDA 920 OF. 708 SANTIAGO 00000000550000000055 CONSORCIO MADERERO S ACONSORCIO MADERERO S A AVENIDA ANDRES BELLO 2777 OR 2004 LAS CONDES SANTIAGO CHILEAVENIDA ANDRES BELLO 2777 OR 2004 LAS CONDES SANTIAGO CHILE 00000000550000000055 CONSORCIO MADERERO SACONSORCIO MADERERO SA AVENIDA ANDRES BELLO 2777 OR 2004 LAS CONDES SANTIAGOAVENIDA ANDRES BELLO 2777 OR 2004 LAS CONDES SANTIAGO

또한, 표 4와 같이 선사와 포워더의 품명 표기 방식에 따라 동일한 품목들이 서로 다른 품목으로 분류되어 그룹핑된다. 선사와 포워더에 의해 동일한 신고 품목인 'CRAB FROZEN SWIMMING'에 CUT, CUTTED, SPOT의 단어가 추가되거나 품목 내 단어들의 순서가 변경되어 신고 품명이 입력되면, 상기 신고 품명의 물리적인 형태에 따라 신고 품목들은 각기 다른 품목으로 분류되어 표 4와 같이 5개의 그룹으로 그룹핑된다.In addition, as shown in Table 4, the same items are classified into different items and grouped according to the item names of the shipping company and the forwarder. If the word of CUT, CUTTED, SPOT is added to the same item of declaration (CRAB FROZEN SWIMMING) or the order of the words in the item is changed by the shipping company and the forwarder, and the item of the notification is entered, Are classified into different items and grouped into five groups as shown in Table 4.

그룹IDGroup ID 개수Count 신고품명New article name A0800001A0800001 1One FROZEN SWIMMING CRABFROZEN SWIMMING CRAB A0800002A0800002 1One FROZEN CUT SWIMMING CRABFROZEN CUT SWIMMING CRAB A0800003A0800003 1One FROZEN CUTTED SWIMMING CRABFROZEN CUTTED SWIMMING CRAB A0800004A0800004 1One FROZEN SPOT SWIMMING CRAB CUTFROZEN SPOT SWIMMING CRAB CUT A0800005A0800005 1One CUT CARB SPOT FROZEN SWIMMINGCUT CARB SPOT FROZEN SWIMMING

표 4의 그룹핑 결과 신고 품명 내의 오타, 유사어, 방언, 약어와 띄어쓰기 오류 등을 고려하여 상이한 신고 품명을 가지는 품목들 중에서 동일한 품목을 그룹핑하기 위해 별도의 수작업 처리를 병행해야 한다는 점에서 인력 및/또는 시간 낭비의 문제를 초래한다.As a result of the grouping result of Table 4, the manual and / or the manual work should be performed in order to group the same item among the items with different declared item names in consideration of typos, synonyms, dialects, abbreviations, and spacing errors. It causes the problem of wasting time.

본 발명의 실시예가 해결하려는 과제는 입력된 데이터로부터 그룹핑의 기준으로서 의미가 있는 키워드를 추출하고, 이를 미리 정해진 정책에 따라 정제하여 표준화한 후, 표준화된 데이터를 이용하여 그룹핑을 수행하는 방법을 제시하는 것이다.The problem to be solved by the embodiment of the present invention is to extract a meaningful keyword as a criterion for grouping from the input data, refine and standardize it according to a predetermined policy, and then propose a method for performing grouping using the standardized data. It is.

본 발명의 다른 실시예가 해결하려는 과제는 입력된 데이터를 정제하기 위한 표준화 데이터를 지속적으로 업그레이드하는 방법을 제시하는 것이다.Another problem to be solved by another embodiment of the present invention is to propose a method for continuously upgrading standardized data for purifying input data.

상기 과제를 해결하기 위해 본 발명은, 데이터 입력 단계와 상기 입력 데이터 중 메인 데이터베이스에 일정기간 축적된 데이터를 대상으로 키워드를 추출하고 상기 키워드의 속성에 따라 메타데이터를 생성하여 저장하는 메타데이터 데이터베이스 구축 단계와 상기 입력 데이터를 단어나 문자 단위로 분할하여 단위데이터를 생성하는 단계 및 상기 단위데이터를 메타데이터와 비교하고 그 결과에 따라 추출된 키워드를 정렬하여 그룹핑하는 단계를 포함하는 데이터의 자동분류 방법을 일 실시예로 제안한다.In order to solve the above problems, the present invention provides a metadata database for extracting a keyword for data accumulated in a main database from a data input step and the input data, and generating and storing metadata according to the attribute of the keyword. And generating the unit data by dividing the input data into word or character units, and comparing the unit data with metadata and sorting and grouping extracted keywords according to the result. Is proposed as an embodiment.

여기서 메인 데이터베이스는 일정기간 축적된 데이터를 포함하는 데이터베이스관리시스템, 플랫 파일 데이터베이스 중 적어도 하나로 구현할 수 있다.The main database may be implemented as at least one of a database management system and a flat file database including data accumulated for a certain period of time.

또한 메타데이터 데이터베이스 구축단계에서 키워드는 상기 축적된 데이터를 단어 단위로 분할하여 상기 단어 중 의미를 파악할 수 있는 유효데이터의 품사 및 속성에 따라 생성할 수 있다.In the metadata database construction step, the keyword may be generated according to parts of speech and attributes of valid data capable of grasping the meaning of the words by dividing the accumulated data into word units.

또한 상기 메타데이터 데이터베이스는 상기 키워드의 속성에 따른 오타-정타, 약어/속어/방언/비표준어-표준어, 복수-단수 구조 정보를 가지는 메타데이터를 저장하는 속성 데이터베이스와 상기 무효데이터를 저장하는 구분자 데이터베이스를 포함할 수 있다.The metadata database may include an attribute database for storing metadata having a typo-correction, an abbreviation / slang / verbal / non-standard-word, and a plural-singular structure information according to the attribute of the keyword and a separator database for storing the invalid data. It may include.

또한 상기 그룹핑단계는 상기 단위데이터를 메타데이터와 비교하여 상기 단위데이터가 정제화된 키워드를 추출할 수 있다.In addition, the grouping step may extract the keyword in which the unit data is refined by comparing the unit data with metadata.

또한 상기 키워드 정렬은 알파벳의 순서에 따라 오름차순이나 내림차순으로 정렬할 수 있다.In addition, the keyword sort may be sorted in ascending or descending order according to the alphabetical order.

또한 상기 그룹핑단계는 상기 단위데이터와 동일한 키워드가 추출되지 않으면 상기 단위데이터는 메타데이터 데이터베이스의 등록대상으로 자동 분류되고, 관리자의 승인을 통해 메타데이터 데이터베이스에 등록할 수 있다.In addition, in the grouping step, if the same keyword as the unit data is not extracted, the unit data may be automatically classified as a registration target of the metadata database, and registered in the metadata database through the approval of an administrator.

또한 상기 그룹핑단계는 상기 키워드 검색 결과에 따라 상기 입력 데이터의 입력 오류, 데이터 중복, 연관데이터 불일치 상태를 출력할 수 있다.In addition, the grouping step may output an input error, data duplication, and related data mismatch state of the input data according to the keyword search result.

본 발명의 실시예에 의하면 데이터로부터 메타데이터를 자동으로 추출하여 관리하고 의미나 특정한 목적에 따라 그룹화 하여 출력함으로써 데이터 분류 작업의 정확도를 높일 수 있으며, 데이터 분석 작업에 소요되는 시간과 인력을 감소시킬 수 있다.According to an embodiment of the present invention, by automatically extracting and managing metadata from data and grouping and outputting the data according to a meaning or a specific purpose, the accuracy of data classification can be improved, and the time and manpower required for data analysis can be reduced. Can be.

도 1은 본 발명의 실시예에 따른 데이터의 자동분류를 위한 시스템을 도시한다.
도 2는 본 발명의 실시예에 따라 화물목록이 자동 분류되어 시스템에 등록되는 모습을 예시한다.
도 3은 본 발명의 실시예에 따른 메타데이터 생성 과정을 나타낸 것이다.
도 4는 본 발명의 실시예에 따른 데이터 그룹핑 과정을 도시한 플로우챠트이다.1 shows a system for automatic classification of data according to an embodiment of the invention.
2 illustrates a state in which a freight list is automatically classified and registered in the system according to an embodiment of the present invention.
3 illustrates a metadata generation process according to an embodiment of the present invention.
4 is a flowchart illustrating a data grouping process according to an embodiment of the present invention.

이하에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예들에 한정되지 않는다. 그리고 도면에서는 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분을 생략하였으며 명세서 전체를 통하여 동일한 부분에 대해서는 동일한 도면 부호를 사용한다. 또한, 명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. The present invention may be embodied in many different forms and is not limited to the embodiments described herein. In the drawings, the same reference numerals are used to denote the same parts throughout the specification. In addition, when any part of the specification is to "include" any component, which means that it may further include other components, except to exclude other components unless otherwise stated.

명세서 전체에서 "키워드"라 함은 입력 데이터가 메인데이터베이스 내에 저장된 표준 및/또는 기준에 따라 정제된 데이터(표준어)를 의미하며, "메타데이터"라 함은 상기 "키워드"의 속성과 구조 및 저장된 위치에 대한 정보를 의미한다.
Throughout the specification, "keyword" means data (standard words) in which input data is refined according to standards and / or criteria stored in the main database, and "metadata" means attributes and structures of the "keyword" and stored Means information about the location.

도 1은 본 발명의 실시예에 따른 데이터의 자동분류를 위한 시스템을 도시한다.1 shows a system for automatic classification of data according to an embodiment of the invention.

도 1의 자동분류 시스템은, 입력부(10)와, 상기 입력부를 통해 입력된 데이터 중 일정기간 축적된 데이터를 대상으로 키워드를 추출하고 메타데이터를 지정하여 저장하는 데이터관리부(20) 및 상기 입력부를 통해 입력된 데이터를 정제하고 그룹핑하여 출력하는 데이터처리부(30)를 포함한다.The automatic classification system of FIG. 1 includes an input unit 10, a data management unit 20 for extracting a keyword from data input through the input unit, and storing metadata by specifying a keyword and storing the keyword. It includes a data processing unit 30 for outputting the purified and grouped data input through.

입력부(10)는 상기 자동분류 시스템의 관리자나 사용자로부터 단어 및/또는 문장 형태의 데이터를 입력받기 위한 수단이다.The input unit 10 is a means for receiving data in the form of words and / or sentences from an administrator or a user of the automatic classification system.

데이터관리부(20)는 일정기간 동안 입력된 데이터들이 축적된 작업데이터와 입력부(10)를 통해 새로 입력되는 데이터들을 저장하는 메인데이터베이스(21)와, 상기 작업데이터로부터 키워드를 추출하고 메타데이터를 생성하는 메타데이터 생성부(23) 및 상기 키워드와 메타데이터를 저장하는 메타데이터 데이터베이스(23)를 포함한다.The data management unit 20 includes a main database 21 storing work data accumulated for a predetermined period of time and data newly input through the input unit 10, extracting keywords from the work data, and generating metadata. It includes a metadata generating unit 23 and a metadata database 23 for storing the keyword and metadata.

메인데이터베이스(이하 메인 DB)(21)는 입력부(10)를 통해 들어오는 데이터가 실시간으로 저장되며, 입력된 데이터 중 일정기간이 경과된 데이터(예를 들면, 1년이 경과한 데이터)를 작업데이터로 따로 분류하여 저장한다.The main database (hereinafter, the main DB) 21 stores the data coming in through the input unit 10 in real time, and the work data includes data that has passed a predetermined period of time (for example, data that has passed one year) among the input data. Store them separately.

메인 DB는 상기 작업데이터를 포함하는 Oracle, Sybase, SQL-Server 등의 DBMS(Database Management System, 데이터베이스관리시스템) 및/또는 플랫 파일 데이터베이스로 구현할 수 있다.The main DB may be implemented as a database management system (DBMS) and / or a flat file database such as Oracle, Sybase, or SQL-Server including the work data.

메타데이터 생성부(22)는 상기 작업데이터로부터 키워드를 추출하고 상기 키워드의 속성과 구조를 저장하며 그룹핑의 기준이 되는 메타데이터를 생성한다.The metadata generator 22 extracts a keyword from the work data, stores the attribute and structure of the keyword, and generates metadata that is a standard for grouping.

구체적으로, 메타데이터 생성부(22)는 메인 DB(21)로부터 작업데이터가 전달되면, 상기 작업데이터를 단어 단위로 분할하고 분할된 단어별로 사용빈도를 조사한다.Specifically, when the work data is transmitted from the main DB 21, the metadata generator 22 divides the work data into word units and checks the frequency of use of the divided words.

상기 메인 DB(21)로부터 전달된 작업데이터가 단어 단위로 분할되면, 메타데이터 생성부(22)는 분할된 각 단어를 사전적인 의미를 가지는 유효데이터와 사전적인 의미를 파악할 수 없는 무효데이터로 분류한다. 이때 상기 유효데이터는 상기 메인 DB(21)로부터 전달된 작업데이터의 키워드로서 추출된다.When the work data transmitted from the main DB 21 is divided into word units, the metadata generator 22 classifies each of the divided words into valid data having a dictionary meaning and invalid data which cannot grasp the dictionary meaning. do. At this time, the valid data is extracted as a keyword of the work data transferred from the main DB (21).

유효데이터는 의미 파악이 가능한 표준어를 1차적인 대상으로 하지만 그외에도 의미의 파악만 가능하다면 표준어가 아닌 오타, 약어, 속어, 광의어 등도 유효데이터로 인정될 수 있다. 표 5의 예에서, 참치를 의미하는 TUNA는 표준어로서 유효데이터가 되지만 실수로 'TUAN'가 입력되더라도 메인데이터베이스에 TUNA-TUAN이 메타데이터로 미리 등록되어 있다면 TUAN도 유효데이터로 인정된다.The valid data is primarily targeted to the standard words that can be understood. However, if only the meaning can be understood, typos, abbreviations, slangs, and broad terms that are not standard words can be recognized as valid data. In the example of Table 5, TUNA, which means tuna, becomes valid data as a standard word, but even if 'TUAN' is mistakenly entered, TUAN is recognized as valid data if TUNA-TUAN is pre-registered as metadata in the main database.

또한, 'DW CANNED TUNA'의 경우 첫번째 단어인 DW은 메인데이터베이스 내에 브랜드 '동원'의 약어로 미리 등록되어 있으므로 유효데이터로 인정한다. 두번째 단어인 CANNED의 경우에는 사전적 의미가 다양한 광의어로서 유효데이터로 인정되며, 마지막 단어인 TUNA 역시 사전적 의미를 가지므로 유효데이터로 인정된다.In the case of 'DW CANNED TUNA', DW, the first word, is registered as an abbreviation of brand 'mobilization' in the main database, so it is considered valid data. In the case of the second word CANNED, the dictionary meaning is recognized as valid data as various broad terms, and the last word TUNA is also considered valid data because it has a dictionary meaning.

다만, CRAB EXTRACT-MS1197는 단어 단위로 분할한 결과 중 '-'와 '1197'은 사전적인 의미가 없는 특수문자와 숫자로 분류되어 무효데이터로 인정되고, 이를 제외한 나머지 단어들은 유효데이터로 인정된다.However, CRAB EXTRACT-MS1197 is regarded as invalid data because '-' and '1197' are classified as special characters and numbers that do not have a dictionary meaning, and the remaining words are considered valid data. .

작업데이터의 형태Type of work data 사례case 오타typo TUAN -> TUNA(참치)TUAN-> TUNA 약어Abbreviation DW CANNED TUNA (DW -> '동원'브랜드의 약어)DW CANNED TUNA (DW-> Abbreviation for 'mobilization' brand) 속어slang TAREFSTAREFS 광의어Broad term CAN(조동사, 명사, 고유명사 등 의미 불분명)CAN (unclear meaning of verb, verb, proper noun, etc.) 데이터조합 오류Data Combination Error ABALONE EXTRACT & LAVER EXTRACTABALONE EXTRACT & LAVER EXTRACT 콩글리시Konglish DAKGALBI (닭갈비의 콩글리시 표기)DAKGALBI (Conglish notation of chicken ribs) 브랜드brand MOKWOOCHON ('목우촌' 브랜드)MOKWOOCHON ('Mokwoochon' brand) 띄어쓰기 오류Spacing error SOYSAUCE -> SOY SOURCE (두단어를 한단어로 표기)SOYSAUCE-> SOY SOURCE (write two words in one word) 의미없는 단어Meaningless words FROZEN BOILED REDFROZEN BOILED RED 모델명model name CRAB EXTRACT-MS1197 (제품의 모델명)CRAB EXTRACT-MS1197 (Model name of the product)

메타데이터 생성부(22)는 표 5에 제시된 유효데이터의 형태에 따라 오타는 정타와 함께 오타-정타, 약어는 본래 표기인 표준어와 함께 약어-표준어, 속어는 옳은 표기인 표준어와 함께 속어-표준어와 같은 구조의 키워드를 생성하고, 작업데이터 중 유효데이터를 명사, 형용사, 부사와 같은 품사별로 분류한다.The metadata generating unit 22 is a typo-taking with a typo, a typo, and an abbreviation-standard word with a standard word in its original notation, and a slang-slang word with a standard word with the right word. A keyword having a structure as shown in the figure is generated, and valid data of the work data are classified by parts of speech such as nouns, adjectives, and adverbs.

이때 메타데이터 생성부(22)에서 생성된 키워드는 단어 쌍 구조의 키워드와 품사별 키워드에 따라 표 6과 같이 구분되며, 상기 키워드의 속성 및 구조에 대한 정보를 가지는 메타데이터가 생성된다.In this case, the keyword generated by the metadata generator 22 is classified according to the keyword of the word pair structure and the keyword for each part of speech, as shown in Table 6, and metadata having information on the attribute and structure of the keyword is generated.

키워드 구분Keyword division 메타데이터Metadata W키워드W Keyword 유효데이터 중 명사형태Noun form among valid data WA키워드WA keyword 유효데이터 중 형용사형태Adjective form among valid data WK키워드WK Keyword 유효데이터의 명사형태 중 주요 키워드Key keywords among noun forms of valid data PS키워드PS Keyword 단복수 형태의 자료쌍Plural data pairs PM키워드PM Keyword 오타-정타 형태의 자료쌍Typo-Staking Data Pairs PP키워드PP keyword 동의어/방언/속어-표준어 형태의 자료쌍Data pairs in the form of synonyms, dialects, slang-standards T키워드T keyword 복합어/약어-표준어 형태의 자료쌍Compound pairs in compound / abbreviation-standard form PT키워드PT Keyword 변경대상 데이터-변경될 데이터의 자료쌍Change-data-data pairs AB키워드AB keyword 약어-표준어 형태의 자료쌍Data pairs in abbreviation-standard form CH키워드CH Keyword 일반화학용어-표준화학용어 형태의 자료쌍Data pairs in general chemical term and standard chemical term FL키워드FL Keyword 외국어 정보Foreign language information

상기 키워드 구분과 상기 키워드에 따른 메타데이터는 표 6에 한정하는 것은 아니며, 상기 키워드의 형태 및/또는 속성에 따라 얼마든지 추가될 수 있다.The keyword classification and metadata according to the keyword are not limited to Table 6, and may be added according to the type and / or attribute of the keyword.

한편, 무효데이터는 표 7과 같이 의미를 파악할 수 없는 특수문자, 숫자, 사전에 없는 문자, 방언 등의 형태를 가진다.On the other hand, invalid data has the form of special characters, numbers, characters not found in the dictionary, dialects, etc., whose meaning cannot be understood as shown in Table 7.

무효데이터 형태Invalid data type 사례case 특수문자Special Characters : ; ! @ # $ % ^ & * ( ):; ! @ # $% ^ & * () 숫자number 1 2 3 4 5 6 7 8 91 2 3 4 5 6 7 8 9 사전에 없는 단어Word not in a dictionary onitsumoonitsumo 방언dialect tater ('potato(감자)'의 방언)tater (dialect of 'potato')

메타데이터 데이터베이스(이하 메타데이터 DB)(23)는 메타데이터 생성부(22)를 통해 생성된 키워드와 상기 키워드에 지정된 메타데이터를 저장하며, 상기 키워드의 속성에 대한 메타데이터를 저장하는 속성 데이터베이스(이하 속성 DB)(도 1에 미도시)와 상기 무효데이터를 저장하는 구분자 데이터베이스(이하 구분자 DB)(도 1에 미도시)를 포함한다.The metadata database (hereinafter, referred to as metadata DB) 23 stores a keyword generated through the metadata generator 22 and metadata specified for the keyword, and an attribute database for storing metadata about the attribute of the keyword ( The attribute DB) (not shown in FIG. 1) and the separator database (hereinafter referred to as separator DB) (not shown in FIG. 1) for storing the invalid data are included.

속성 DB는 상기 키워드의 속성에 따라 오타-정타, 약어/속어/방언/비표준어-표준어, 복수-단수, 복합어-표준어 등의 단어 쌍 구조의 메타데이터를 저장한다.The attribute DB stores metadata of a word pair structure such as a typo-correction, an abbreviation, a slang, a dialect, a non-standard word, a plural-singular number, a compound word-standard word, etc. according to the attribute of the keyword.

구분자 DB는 상기 무효데이터에 속하는 알파벳이나 숫자, 특수문자, 공백을 저장한다.The delimiter DB stores alphabets, numbers, special characters, and spaces belonging to the invalid data.

다시 도 1의 설명으로 돌아와서, 데이터처리부(30)는 입력부(10)를 통해 입력된 데이터로부터 단위데이터를 생성하고 상기 단위데이터와 메타데이터를 비교 결과에 따라 키워드를 추출하여 그룹핑한다.Returning to the description of FIG. 1 again, the data processor 30 generates unit data from the data input through the input unit 10 and extracts and groups the unit data and the metadata according to a comparison result.

이를 위해, 데이터처리부(30)는 입력 데이터로부터 단위데이터를 생성하는 데이터 변환부(31)와, 상기 단위데이터와 메타데이터를 비교하여 키워드를 추출하고 상기 키워드의 조합에 따라 입력 데이터를 분류하는 그룹핑처리부(22)를 포함한다.To this end, the data processing unit 30 compares the unit data and metadata with the data converter 31 generating unit data from the input data, extracts a keyword, and classifies the input data according to the combination of the keywords. It includes a processing unit 22.

데이터 변환부(31)는 입력부(10)를 통해 입력된 데이터를 단어 및/또는 문자 단위로 분할하여 단위데이터를 생성하고 상기 단위데이터를 그룹핑처리부(32)로 보낸다.The data converter 31 generates unit data by dividing the data input through the input unit 10 into words and / or letters, and sends the unit data to the grouping processor 32.

그룹핑처리부(32)는 상기 단위데이터를 기준으로 메타데이터를 검색하고, 상기 메타데이터에 지정된 키워드를 추출한다.The grouping processor 32 searches for metadata based on the unit data, and extracts a keyword specified in the metadata.

구체적으로, 그룹핑처리부(32)는 상기 단위데이터들의 연관정보를 파악하여 단위데이터가 숫자만으로 구성되어 있을 경우 무효데이터로 판단한다. 만일 숫자와 문자의 조합으로 구성되어 있을 경우 숫자는 무효데이터로 판단하여 제외하고, 남은 유효데이터들을 상대로 메타데이터를 검색한다.Specifically, the grouping processor 32 determines the invalid data when the unit data is composed of only numbers by grasping the related information of the unit data. If it is composed of a combination of numbers and letters, the number is considered invalid and the metadata is searched for the remaining valid data.

그리고, 메타데이터로부터 상기 단위데이터와 대응되는 키워드를 추출하고, 상기 키워드를 알파벳 순서에 따라 오름차순 및/또는 내림차순으로 정렬한다. 이후 정렬된 키워드 조합에 그룹 ID를 부여한다.Then, keywords corresponding to the unit data are extracted from metadata, and the keywords are sorted in ascending and / or descending order in alphabetical order. The group ID is then assigned to the sorted keyword combination.

또한, 그룹핑처리부(32)에서 상기 단위데이터에 대응되는 키워드가 검색되지 않을 경우 이를 오류 메시지 또는 경고 메시지의 형태로 출력한다. 상기 단위데이터는 메타데이터 DB(23)의 등록대상으로 자동 분류되고, 시스템 관리자의 확인 및/또는 승인 과정을 거쳐 메타데이터 DB(23)에 등록하여 메타데이터 DB(23)를 업그레이드 할 수 있다.In addition, when the keyword corresponding to the unit data is not searched by the grouping processor 32, the grouping processor 32 outputs the keyword in the form of an error message or a warning message. The unit data may be automatically classified as a registration target of the metadata DB 23, and the metadata DB 23 may be upgraded by registering with the metadata DB 23 through a system administrator's confirmation and / or approval process.

도 2는 적하목록취합시스템(MFCS)에 화물목록이 입력될 때 본 발명의 실시예에 따라 화물목록이 자동 분류되어 시스템에 등록되는 모습을 예시한다.2 illustrates a case in which a freight list is automatically classified and registered in the system according to an embodiment of the present invention when a freight list is input to a manifest list collection system (MFCS).

도 2에서 신고 품명은 도 1에 예시된 시스템의 입력부(10)를 통해 입력되는 데이터이고, 자동 추출 키워드는 그룹핑처리부(22)를 통해 검색된 상기 신고 품명에 매칭되는 키워드의 조합이며, 그룹 ID는 상기 신고 품명이 분류되어 출력된 결과이다.In FIG. 2, the report item name is data input through the input unit 10 of the system illustrated in FIG. 1, and the automatic extraction keyword is a combination of keywords matching the report item name searched through the grouping processing unit 22. The reported item names are classified and output.

입력부(10)를 통해 신고품명에 'FROZEN 3 SPOTED CUT CRAB'가 입력되면, 데이터변환부(31)에서 5개의 단위데이터 즉, FROZEN, 3, SPOTED, CUT, CRAB로 분할된다. 상기 분할된 5개의 단위데이터들은 그룹핑처리부(32)를 통해 유효데이터인 FROZEN, SPOTED, CRAB와 3, CUT의 무효데이터로 분류된다.When 'FROZEN 3 SPOTED CUT CRAB' is input to the declared product name through the input unit 10, the data conversion unit 31 is divided into five unit data, that is, FROZEN, 3, SPOTED, CUT, CRAB. The divided five unit data are classified into invalid data of FROZEN, SPOTED, CRAB and 3, CUT which are valid data through the grouping processing unit 32.

이후 상기 유효데이터는 메타데이터 DB(21) 내의 메타데이터와 비교되고 상기 유효데이터와 매칭되는 키워드가 추출되어 알파벳 오름차순으로 'CRAB FROZEN SPOTTED'와 같이 정렬하여 그룹핑한다. 그룹핑한 키워드 조합에는 그룹 ID(A0800001)를 부여한다.Thereafter, the valid data is compared with the metadata in the metadata DB 21, and a keyword matching the valid data is extracted and sorted and grouped as 'CRAB FROZEN SPOTTED' in ascending alphabetical order. Group ID (A0800001) is assigned to the grouped keyword combination.

도 3은 본 발명의 실시예에 따른 메타데이터 생성 과정을 나타낸 것이다.3 illustrates a metadata generation process according to an embodiment of the present invention.

도 3의 메타데이터 생성 과정은 도 2의 실시예를 참조하여 상세히 설명한다.The metadata generation process of FIG. 3 will be described in detail with reference to the embodiment of FIG. 2.

먼저, 상기 메타데이터 생성을 위한 신고 품명이 'FROZEN 3 SPOTED CUT CRAB' 입력되고(S301), 상기 신고 품명은 입력부(10)를 통해 입력된 데이터 중 메인 DB(21)에 일정기간 이상 저장된 데이터를 대상으로 한다.First, the declared product name for generating the metadata is inputted as 'FROZEN 3 SPOTED CUT CRAB' (S301), and the declared product name includes data stored in the main DB 21 for a predetermined period of time from among the data inputted through the input unit 10. It is targeted.

상기 신고 품명을 FROZEN, 3, SPOTED, CUT, CRAB와 같이 단어 단위로 분할하며(S302), 상기 단어 중에서 의미를 파악할 수 있는 FROZEN, SPOTED, CUT, CRAB를 유효데이터로 분류하고 의미를 파악할 수 없는 숫자 3을 무효데이터로 분류한다.The declared item name is divided into word units such as FROZEN, 3, SPOTED, CUT, and CRAB (S302), and FROZEN, SPOTED, CUT, and CRAB, which can grasp the meaning among the words, are classified as valid data and cannot be understood. Class 3 is classified as invalid data.

이후 상기 유효데이터 FROZEN, SPOTED, CUT, CRAB를 기준으로 상기 유효데이터와 대응되는 표준어를 조합하여 단어 쌍 구조의 키워드를 생성하고(S303), 상기 키워드의 속성 및 구조를 분석한다(S304). 상기 키워드의 속성 및 구조의 분석이 끝나면, 그 결과에 따라 메타데이터를 생성하여 저장한다(S305).Subsequently, a keyword having a word pair structure is generated by combining standard words corresponding to the valid data based on the valid data FROZEN, SPOTED, CUT, and CRAB (S303), and analyzing the attribute and structure of the keyword (S304). After analysis of the attribute and structure of the keyword, metadata is generated and stored according to the result (S305).

예를 들어, 상기 유효데이터 중 FROZEN과 CRAB는 키워드를 생성하지만 표준어이므로 별도의 메타데이터를 생성하지 않으며, SPOTED는 형용사인 SPOTTED의 오타로 오타-정타 즉, SPOTED-SPOTTED 단어쌍 구조를 키워드를 생성하며 상기 품사와 구조 정보를 포함하는 메타데이터를 생성한다.For example, among the valid data, FROZEN and CRAB generate keywords, but do not generate separate metadata because they are standard words, and SPOTED generates keywords for the spoofed typo-taking, that is, SPOTED-SPOTTED word-pair structure. And generate metadata including the part-of-speech and structure information.

이때, CUT은 유효데이터이긴 하지만 의미가 불분명한 광의어로 키워드 대상에서 제외되며, 메타데이터 역시 생성되지 않는다.In this case, the CUT is valid data but is not defined as a keyword as a broad term, and metadata is not generated.

도 4는 본 발명의 실시예에 따른 데이터 그룹핑 과정을 도시한 플로우챠트이다.4 is a flowchart illustrating a data grouping process according to an embodiment of the present invention.

도 4의 데이터 그룹핑 과정 역시 도 2의 예를 참조하여 상세히 설명한다.The data grouping process of FIG. 4 will also be described in detail with reference to the example of FIG. 2.

먼저, 입력부(10)를 통해 신고 품명이 'FROZEN 3 SPOTED CUT CRAB'와 같이 입력되면(S401), 상기 신고 품명을 단어 단위로 분할하여 FROZEN, 3, SPOTED, CUT, CRAB와 같이 단위데이터를 생성한다(S402).First, when the declared product name is inputted as 'FROZEN 3 SPOTED CUT CRAB' through the input unit 10 (S401), the declared product name is divided into word units to generate unit data such as FROZEN, 3, SPOTED, CUT, and CRAB. (S402).

이후 상기 단위데이터(FROZEN, 3, SPOTED, CUT, CRAB)를 메타데이터 DB(23)의 메타데이터와 비교하여(S403), 매칭되는 키워드(표준화된 데이터)를 추출한다(S404).Thereafter, the unit data FROZEN, 3, SPOTED, CUT, and CRAB are compared with metadata of the metadata DB 23 (S403), and a matching keyword (standardized data) is extracted (S404).

이때, 3은 의미가 없는 숫자 즉, 무효데이터로 분류되어 키워드가 추출되지 않으며, CUT은 의미가 불분명한 광의어로써 키워드 대상에서 제외되었으므로 동일한 키워드가 검색되지 않는다.In this case, 3 is classified as meaningless number, that is, invalid data, and keywords are not extracted, and the same keyword is not searched because CUT is excluded from the keyword target as a broad term with an unclear meaning.

만일 동일한 키워드를 추출하지 못하면 상기 단위데이터를 메타데이터 DB(23)의 등록대상으로 분류하고(S405), 관리자의 승인을 통해 메타데이터 DB(23)에 등록한다(S406).If the same keyword is not extracted, the unit data is classified as a registration target of the metadata DB 23 (S405), and registered in the metadata DB 23 through the administrator's approval (S406).

추출된 키워드는 알파벳 순서에 따라 CRAB, FROZEN, SPOTTED 순으로 정렬된다(S407). 상기 키워드는 CRAB FROZEN SPOTTED의 조합으로 분류되며, 상기 키워드 조합에 그룹 ID 'A0800001'를 부여하여 출력한다(S408).The extracted keywords are sorted in order of CRAB, FROZEN, and SPOTTED in alphabetical order (S407). The keyword is classified into a combination of CRAB FROZEN SPOTTED, and a group ID 'A0800001' is assigned to the keyword combination and output.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments, It belongs to the scope of right.

10 : 입력부 20 : 데이터관리부 21 : 메인 데이터베이스
22 : 메타데이터 생성부 23 : 메타데이터 데이터베이스
30 : 데이터처리부 31 : 데이터 변환부
32 : 그룹핑 처리부10: input unit 20: data management unit 21: main database
22: metadata generating unit 23: metadata database
30: data processing unit 31: data conversion unit
32: grouping processing unit

Claims

Data input step;
A metadata database construction step of extracting a keyword for data accumulated in a main database among the input data and generating and storing metadata according to the attribute of the keyword;
Generating unit data by dividing the input data into word or character units; And
Comparing the unit data with metadata and sorting and grouping extracted keywords according to the result
Automatic classification method of data comprising a.

The method of claim 1,
The main database is an automatic classification method of data implemented in at least one of a database management system, a flat file database including data accumulated for a certain period.

The method of claim 1,
In the metadata database construction step, a keyword is generated according to parts of speech and attributes of valid data capable of identifying meaning among the words by dividing the accumulated data into word units.

The method of claim 1,
The metadata database includes an attribute database for storing metadata having a typo-correction, an abbreviation / slang / verbal / nonstandard-word, and a plural-singular structure information according to the attribute of the keyword, and a separator database for storing the invalid data. How to automatically sort your data.

The method of claim 1,
In the grouping step, the unit data is compared with metadata, and the automatic classification method of data for extracting a keyword in which the unit data is refined.

The method of claim 1,
The keyword sorting method is an automatic classification method of data arranged in ascending or descending order according to the alphabetical order.

The method of claim 1,
In the grouping step, if the same keyword as the unit data is not extracted, the unit data is automatically classified as a registration target of the metadata database, and registered in the metadata database through the approval of an administrator.

The method of claim 1,
The grouping step is an automatic classification method of data for outputting an input error, data duplication, and a related data mismatch state of the input data according to the keyword search result.