KR102511139B1

KR102511139B1 - Reclassification of Electronic Records Disclosure system and method thereof

Info

Publication number: KR102511139B1
Application number: KR1020210017730A
Authority: KR
Inventors: 김진아; 정기성; 최명일
Original assignee: 대한민국
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2023-03-17
Also published as: KR20220114339A

Abstract

본 발명은 공공기관에서 생산 전자기록물에 대하여 개인정보 등의 비공개대상정보를 분류하여 가리고 공개할 수 있도록 하는, 전자기록물 공개재분류를 위한 시스템 및 그 방법에 관한 것이다.
개시된 전자기록물 공개재분류를 위한 시스템은, 적어도 하나 이상의 전자기록물을 저장하는 문서 데이터베이스; 상기 하나 이상의 전자기록물 중 공개할 전자기록물을 가져와 텍스트 데이터로 변환하는 문서 변환부; 상기 변환된 텍스트 데이터를 학습 데이터로 로딩(loading)하고 코버트(KOBERT) 모델을 통하여 기계 학습을 수행함과 더불어 패턴 분석, 주소DB등을 동시에 수행하여, 비공개 부분을 구분하여 재분류하는 분석 모델부; 및 상기 하나 이상의 전자기록물의 재분류 및 공개 과정을 제어하는 제어부를 포함할 수 있다.The present invention relates to a system and method for disclosing and reclassifying electronic records, which enable public institutions to classify, hide, and disclose information subject to non-disclosure, such as personal information, in electronic records produced by public institutions.
The disclosed system for open reclassification of electronic records includes a document database for storing at least one electronic record; a document converting unit for fetching electronic records to be disclosed among the one or more electronic records and converting them into text data; An analysis model unit that loads the converted text data as learning data, performs machine learning through a KOBERT model, and simultaneously performs pattern analysis and address DB to classify and reclassify private parts. ; and a control unit controlling reclassification and disclosure of the one or more electronic records.

Description

Reclassification of Electronic Records Disclosure system and method thereof}

본 발명은 전자기록물 공개재분류를 위한 그 방법에 관한 것으로서, 더욱 자세하게는 공공기관에서 관리하는 전자기록물에 대하여 개인정보 등의 비공개대상정보를 분류하여 공개여부를 판단할 수 있도록 하는, 전자기록물 공개재분류를 위한 시스템 및 그 방법에 관한 것이다.The present invention relates to a method for reclassifying electronic records for disclosure, and more specifically, for electronic records managed by public institutions, to classify non-disclosed information such as personal information to determine whether or not to disclose electronic records. It relates to a system and method for reclassification.

공공기관은 업무와 관련하여 문서를 생산·접수하며, 보존 관리한다. 그리고, 전자기록생산시스템에 등록하며, 중앙기록물관리기관의 장이 정하는 등록정보를 전자적으로 생산·관리한다. 또한 전자 문서의 안전하고 체계적인 관리 및 활용을 위하여 전자기록물 관리체계를 구축·운영하고 있다. Public institutions produce, receive, and preserve and manage documents related to their work. In addition, it is registered in the electronic records production system, and the registration information determined by the head of the central records management institution is electronically produced and managed. In addition, for the safe and systematic management and utilization of electronic documents, an electronic records management system is established and operated.

공공기관 및 기록물관리기관의 장은 기록물이 국민에게 공개되어 활용될 수 있도록 『공공기록물 관리에 관한 법률』 (이하 "공공기록물법") 및 『공공기관의 정보공개에 관한 법률』 (이하 "정보공개법")에 따라 기록물의 공개 여부를 관리하고 있다. The heads of public institutions and records management institutions comply with the 『Public Records Management Act』 (hereinafter referred to as the "Public Records Act") and the 『Information Disclosure Act of Public Institutions』 (hereinafter referred to as the "Information Disclosure Act") so that records can be disclosed and used by the public. ") to manage the disclosure of records.

기록물 공개란, 공공기관이 정보공개법에 따라 정보를 열람하게 하거나 그 사본, 복제물을 교부하는 것 또는 『전자정부법』제2조제10호에 따른 정보통신망(이하 "정보통신망")을 통하여 정보나 기록을 제공하는 것을 말한다.Disclosure of records refers to allowing public institutions to peruse information in accordance with the Information Disclosure Act, or issuing copies or copies thereof, or information or records through information and communications networks (hereinafter referred to as “information and communications networks”) pursuant to Article 2, subparagraph 10 of the 『Electronic Government Act』 means to provide

기록물의 공개여부는 공개, 부분공개, 비공개로 구분되어 등록되며, 공공기관은 기록물 생산시 기록물 건 단위로 공개여부를 구분하여 등록 관리한다. 공공기관은 관할 기록물관리기관으로 기록물을 이관하는 경우 공개여부를 재분류하여 이관하며, 비공개로 재분류 된 기록물에 대해서는 재분류 된 연도의 다음 연도부터 5년마다 공개 여부를 재분류 하여야 한다. 그리고 비공개 기록물은 생산연도 종료 후 30년이 지나면 모두 공개하는 것을 원칙으로 한다. Whether or not records are disclosed is registered by classifying them into open, partial, and non-disclosure, and public institutions classify and manage whether or not they are open on a record-by- record basis when producing records. When a public institution transfers records to a competent records management institution, it reclassifies whether or not to disclose and transfers them, and for records reclassified as undisclosed, it must reclassify whether to disclose every five years from the year following the year of reclassification. In principle, all undisclosed records are disclosed 30 years after the end of the production year.

또한 공공기관은 국민의 알권리 충족 및 다양한 기록정보 서비스 기반 마련을 위하여 공공기록물법 제35조에 따라 주기적인 재분류를 하여 비공개 사유가 소멸한 경우 해당 기록물을 공개로 전환한다. 기록물의 공개여부를 판단하기 위해서는 정보공개법 제9조제1항 각호에 해당하는 비공개대상정보가 포함되어 있는지 여부 등을 비롯한 종합적인 분석 등으로 공개여부가 결정된다. 이중 특히 가장 많은 부분을 차지하고 있으며 보호에 대한 중요도가 높아지고 있는 제6호 개인정보에 대한 검출 여부가 매우 중요하다고 할 수 있다. In addition, public institutions periodically reclassify in accordance with Article 35 of the Public Records Act to satisfy the people's right to know and prepare a basis for various records information services, and when the reason for non-disclosure disappears, the relevant records are converted to public. In order to determine whether or not to disclose records, whether or not to disclose them is determined through a comprehensive analysis, including whether or not non-disclosed information corresponding to each subparagraph of Article 9 (1) of the Information Disclosure Act is included. Among them, it can be said that it is very important to detect personal information No. 6, which accounts for the largest part and the importance of protection is increasing.

공공기관이 생산, 관리 및 이관하는 전자기록물의 양이 많아짐에 따라 주기적 공개재분류 대상도 증가하고 있다. 이를 위해 대량의 전자기록물을 대상으로 한 공개재분류의 효율적 관리가 필요하다. 전자기록물의 개인정보 등 비공개대상정보의 포함 여부 판단을 위한 수작업 처리시에는 대규모 시간과 예산이 소요되는 문제가 있다. 그러나 시중의 개인정보 검출 제품은 한정적인 숫자 패턴 중심의 검출만 가능할 뿐 기록물에 포함되어 있는 이름, 주소를 비롯한 다양한 유형의 개인정보 등의 검출에는 한계가 있다. As the amount of electronic records produced, managed, and transferred by public institutions increases, the subject of periodic disclosure and reclassification is also increasing. To this end, it is necessary to efficiently manage open reclassification of large amounts of electronic records. There is a problem that a large amount of time and budget is required for manual processing to determine whether confidential information such as personal information in electronic records is included. However, commercially available personal information detection products can only detect limited number patterns, and have limitations in detecting various types of personal information including names and addresses included in records.

대한민국 등록특허공보 제10-1504577호(등록일:2015.03.16)에는 전자기록에 대한 장기 검증 방법 및 시스템이 기재되어 있다.Republic of Korea Patent Registration No. 10-1504577 (registration date: 2015.03.16) describes a long-term verification method and system for electronic records.

전술한 문제점을 해결하기 위한 본 발명의 목적은, 공공기관에서 전자기록물에 대하여 개인정보 등 비공개대상정보에 대한 분석 및 추출 등을 할 수 있도록 하는 시스템 및 그 방법을 제공함에 있다.An object of the present invention to solve the above problems is to provide a system and method for analyzing and extracting non-disclosure target information such as personal information with respect to electronic records in public institutions.

전술한 목적을 달성하기 위한 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템은, 적어도 하나 이상의 전자기록물을 저장하는 문서 데이터베이스; 상기 하나 이상의 전자기록물 중 대상이 되는 전자기록물을 가져와 텍스트 데이터로 변환하는 문서 변환부; 상기 변환된 텍스트 데이터에 대하여, 기계학습 사전학습 패턴, 정규식 표현 패턴, 정규식 표현 패턴 + DB, 단어사전 + 글자수 패턴, 단어사전 중 한가지 이상이 결합된 하이브리드 방식으로 각 개인정보에 대응하는 패턴 분석을 수행하고, 비공개 부분을 구분하여 재분류하는 분석 모델부; 및 상기 하나 이상의 전자기록물의 재분류 및 공개 과정을 제어하는 제어부를 포함할 수 있다.In order to achieve the above object, a system for open reclassification of electronic records according to an embodiment of the present invention includes a document database for storing at least one electronic record; a document converting unit for fetching target electronic records from among the one or more electronic records and converting them into text data; For the converted text data, pattern analysis corresponding to each personal information is performed in a hybrid method in which one or more of a machine learning dictionary learning pattern, a regular expression expression pattern, a regular expression expression pattern + DB, a word dictionary + character count pattern, and a word dictionary are combined. an analysis model unit that performs and classifies and reclassifies the non-public part; and a control unit controlling reclassification and disclosure of the one or more electronic records.

또한, 상기 재분류된 전자기록물의 공개 시에 상기 비공개 부분을 가려서 마스크(mask) 처리하는 마스크 처리부를 더 포함할 수 있다.In addition, when the reclassified electronic record is disclosed, a mask processing unit may be further included to cover and mask the non-public portion.

상기 정규식 표현 패턴은 주소를 포함하고, 상기 분석 모델부는 상기 정규식 표현 패턴에 해당하는 주소 중 상기 데이터베이스의 공개 대상 단어를 포함하는 문자열을 예외 처리할 수 있다.The regular expression expression pattern includes an address, and the analysis model unit may process a character string including a word to be disclosed of the database among addresses corresponding to the regular expression expression pattern as an exception.

또한, 상기 정규식 표현 패턴은, 이메일, 여권번호, 주민등록번호, 외국인등록번호, 사업자등록번호, 군번, 운전면허번호, 카드번호, 주소, 소송번호를 포함하고, 상기 기계학습 사전학습 패턴은 제6호(개인정보) 비공개 단어 및 인명을 포함하고, 상기 단어사전 + 글자수 패턴은 제6호 비공개 단어 및 제2호(국가안전보장·국방·통일·외교관계) 비공개 단어를 포함하고, 상기 단어 사전 패턴은 기타 비공개 단어를 포함할 수 있다.In addition, the regular expression expression pattern includes e-mail, passport number, resident registration number, alien registration number, business registration number, serial number, driver's license number, card number, address, and lawsuit number, and the machine learning pre-learning pattern is number 6 (Personal information) includes undisclosed words and personal names, and the word dictionary + character count pattern includes No. 6 undisclosed words and No. 2 (National Security/Defense/Unification/Diplomatic Relations) undisclosed words, and the word dictionary The pattern may contain other private words.

상기 코버트(kobert) 모델은, BERT + CRF(Conditional Random Field) 기반의 한국어 NER(Name Entity Recognition) 태거(Tagger)를 활용한 PyTorch 모델로서, 상기 NER Tagger의 PER(인명), LOC(지명), ORG(기관명), POH(기타), DAT(날짜), TIM(시간), DUR(기간), MNY(통화), PNT(비율), NOH(기타 수량표현)를 포함하는 10개 기본 구분 태그를 사용할 수 있다.The kobert model is a PyTorch model using the Korean Name Entity Recognition (NER) Tagger based on BERT + CRF (Conditional Random Field), and the PER (person) and LOC (place names) of the NER Tagger , ORG (Institution Name), POH (Other), DAT (Date), TIM (Time), DUR (Period), MNY (Currency), PNT (Rate), NOH (Other Quantity Expression) can be used.

또한, 상기 코버트(kobert) 모델은, 상기 변환된 텍스트 데이터를 컴퓨터가 이해할 수 있는 수치화를 적용한 문장 단위 임베딩 기법을 사용하여 사전 학습 모델을 기반으로 목적에 맞는 학습 데이터 적용을 통한 자연어로 처리하고, 상기 학습 데이터의 후처리(fine-tuning) 작업을 통하여, 비공개정보에 해당하는 인명과, 필요 시 지명, 기관명 등을 추출하여 비공개정보 판단에 활용할 수 있다.In addition, the kobert model processes the converted text data into natural language by applying learning data suitable for the purpose based on a pre-learning model using a sentence-by-sentence embedding technique to which computer-understandable digitization is applied, , Through fine-tuning of the learning data, the name of the person corresponding to the non-public information and, if necessary, the name of the organization, etc. can be extracted and used to determine the non-public information.

본 발명의 하이브리드 모델은, 상기 변환된 텍스트 데이터에서 제6호(개인정보) 비공개 단어를 개인정보 유형별과 업무분류 체계별 2가지 기준으로 유형을 분류하고, 기본 비공개 단어에서 글자 간의 띄어쓰기를 추가하며, 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출하도록 학습할 수 있다.The hybrid model of the present invention classifies the private word No. 6 (personal information) in the converted text data into two criteria by personal information type and business classification system, and adds spaces between letters in the basic private word, , it can learn to detect even words with space errors during text conversion.

또한, 상기 비공개 부분이 법령에서 정하는 바에 따라 열람할 수 있는 정보, 사생활의 비밀 또는 자유를 부당하게 침해하지 아니하는 정보, 공개하는 것이 공익이나 개인의 권리 구제를 위하여 필요하다고 인정되는 정보, 직무를 수행한 공무원의 성명·직위, 국가 또는 지방자치단체가 업무의 일부를 위탁 또는 위촉한 개인의 성명·직업 중 하나에 해당되는 경우에 예외처리를 수행할 수 있다.In addition, the non-disclosed part is information that can be viewed as prescribed by law, information that does not unreasonably infringe on the privacy or freedom of privacy, information that is deemed necessary for the public interest or relief of individual rights, and duties that are not disclosed. Exceptions may be made if they fall under one of the names and positions of the public officials who performed them, or the names and occupations of individuals to whom the state or local governments have entrusted or commissioned part of their duties.

한편, 전술한 목적을 달성하기 위한 본 발명의 실시 예에 따른 전자기록물 공개재분류 방법은, 적어도 하나 이상의 전자기록물을 DB에 저장하고, 전자기록물 중 대상이 되는 전자기록물을 처리하는 방법으로서, (a) 문서 변환부가 상기 하나 이상의 전자 기록물 중 대상이 되는 전자기록물을 가져와 텍스트 데이터로 변환하는 단계; (b) 분석 모델부가 상기 변환된 텍스트 데이터를 학습 데이터로 로딩(loading)하여 기계 학습을 수행함과 더불어 정규식 표현 패턴, 기계학습 사전학습 패턴, 단어사전 + 글자수 패턴, 단어 사전 패턴 중 하나 이상을 포함하는 하이브리드 방식으로 패턴 분석을 수행하는 단계; 및 (c) 제어부가 상기 기계 학습 및 패턴 분석의 결과에 근거해 상기 대상이 되는 전자기록물에서 비공개 부분을 구분하여 재분류하는 단계를 포함할 수 있다.On the other hand, a method for publicly reclassifying electronic records according to an embodiment of the present invention for achieving the above object is a method of storing at least one or more electronic records in a DB and processing target electronic records among the electronic records, ( a) converting a document converting unit into text data by bringing a target electronic record among the one or more electronic records; (b) In addition to performing machine learning by loading the converted text data as learning data, the analysis model unit loads one or more of a regular expression expression pattern, a machine learning dictionary learning pattern, a word dictionary + letter count pattern, and a word dictionary pattern. Performing pattern analysis in a hybrid manner that includes; and (c) a step of classifying and reclassifying, by the control unit, non-public parts from the target electronic records based on the results of the machine learning and pattern analysis.

또한, 상기 (b) 단계에서 상기 분석 모델부는 상기 기계 학습에 대하여, (b-1) 상기 하나 이상의 전자기록물에서 구분할 비공개 단어에 대한 라이브러리를 구축하는 단계; (b-2) 상기 비공개 단어에 대하여 제6호(개인정보) 비공개 단어, 제2호(국가안전보장·국방·통일·외교관계) 비공개 단어, 및 기타 비공개 단어를 포함하는 컨텐츠 파일별 분석 프레임 워크를 구축하는 단계; (b-3) 상기 비공개 단어를 입력받아 CRF(Conditional Random Field) 기반의 한국어 NER(Name Entity Recognition) 태거(Tagger)를 활용하여, B(begin)-I(inside)-O(outside) 태깅(Tagging)을 수행하는 단계; 및 (b-4) 상기 B-I-O 태깅에 기반하여 상기 학습 데이터의 후처리(fine-tuning) 작업을 통하여, PER(인명) 뿐만 아리라, LOC(지명), ORG(기관명)을 포함하는 개체명을 인식하여 추출하는 단계를 포함할 수 있고, 필요 시, POH(기타), DAT(날짜), TIM(시간), DUR(기간), MNY(통화), PNT(비율), NOH(기타 수량표현) 등을 추가적으로 추출할 수 있다.In addition, in the step (b), the analysis model unit builds a library for undisclosed words to be distinguished from the one or more electronic records in relation to the machine learning; (b-2) With regard to the above-mentioned closed words, analysis frames for each content file including the 6th (personal information) hidden words, 2nd (National Security/Defense/Unification/Diplomatic Relations) hidden words, and other hidden words building a work; (b-3) B (begin) -I (inside) -O (outside) tagging ( tagging); and (b-4) Recognize object names including not only PER (human name) but also LOC (place name) and ORG (organization name) through post-processing (fine-tuning) of the learning data based on the B-I-O tagging and extracting, if necessary, POH (other), DAT (date), TIM (time), DUR (period), MNY (currency), PNT (ratio), NOH (other quantity expression), etc. can be additionally extracted.

또한, 상기 (c) 단계에서 상기 제어부는, 상기 재분류된 전자기록물의 공개 시에 마스크 처리부를 통하여 상기 비공개 부분을 가려서 마스크(mask) 처리한 후 공개할 수 있다.In addition, in the step (c), when disclosing the reclassified electronic record, the control unit may cover and mask the non-public portion through the mask processing unit, and then disclose the reclassified electronic record.

또한, 상기 (b) 단계에서 상기 분석 모델부는, 상기 변환된 텍스트 데이터에서 제6호(개인정보) 비공개 단어를 개인정보 유형별과 업무분류 체계별 2가지 기준으로 유형을 분류하고, 상기 변환된 텍스트 데이터에서 제2호(국가안전보장·국방·통일·외교관계) 비공개 단어를 분류하여, 기본 비공개 단어에서 글자 간의 띄어쓰기를 추가하며, 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출하도록 학습할 수 있다.In addition, in the step (b), the analysis model unit classifies the private word No. 6 (personal information) in the converted text data into two criteria by personal information type and by business classification system, and the converted text It can learn to classify the second (National Security, Defense, Unification, Diplomatic Relations) closed words in the data, add spaces between letters in the basic closed words, and detect even words with spacing errors when converting text.

본 발명에 의하면, “다양한 방법의 조합”을 통해 효율적인 개인정보 식별 방법을 구현한 ‘하이브리드 방법’을 이용하여 전자기록물의 공개재분류를 위한 비공개대상정보를 용이하게 파악할 수 있다.According to the present invention, by using a 'hybrid method' that implements an efficient personal information identification method through a "combination of various methods", it is possible to easily identify non-disclosed information for public reclassification of electronic records.

또한, 각 개인정보에 알맞게 기계학습 + 정규식 + DB + 단어사전 및 글자패턴을 사용한 하이브리드 방법을 이용하여 전자기록물의 공개를 위한 재분류를 실행할 수 있다.In addition, it is possible to perform reclassification for disclosure of electronic records by using a hybrid method using machine learning + regular expression + DB + word dictionary and letter patterns appropriate for each personal information.

또한, 전자기록물에 대하여, 코버트 모델의 기계학습을 통하여 재분류할 때 인명에서 더 높은 인식률을 가질 수 있다.In addition, for electronic records, when reclassifying through machine learning of the Covert model, a higher recognition rate can be obtained in human life.

또한, 본 발명에 의하면, 하이브리드 방법을 이용할 때, 인명은 코버트 모델, 다른 것은 정규 표현식, 또 다른 것은 단어사전을 사용함으로써 신뢰도가 높은 재분류를 실행할 수 있다.Further, according to the present invention, when using the hybrid method, a person can perform highly reliable reclassification by using a Covert model, another a regular expression, and another a word dictionary.

또한, 국가기록원 공개재분류 기준서 등을 참고하여 전자기록물 공개재분류를 위한 비공개 정보 기준을 도출하고 패턴화 할 수 있다.In addition, it is possible to derive and pattern the non-disclosed information criteria for open reclassification of electronic records by referring to the National Archives and Records Administration's open reclassification standards.

또한, 본 발명에 의하면, 비공개 정보 대상 자동 필터링을 위한 자동화·지능화 기술을 한글, MS-OFFICE, PDF 등 전자기록물의 다양한 유형에 적용할 수 있다.In addition, according to the present invention, automation and intelligence technology for automatic filtering of undisclosed information can be applied to various types of electronic records such as Hangul, MS-OFFICE, and PDF.

또한, 본 발명에 의하면, 비공개 대상 정보의 정확성 및 적합성 향상을 위한 정보를 습득하고, 도출된 패턴으로 자동 필터링을 위한 학습 데이터를 구축할 수 있다.In addition, according to the present invention, it is possible to acquire information for improving the accuracy and suitability of non-disclosed information, and build learning data for automatic filtering with the derived pattern.

또한, 본 발명에 의하면, 향후 기록물 관리 시스템(CAMS, RMS) 연계를 위한 프로토타입 시스템을 구축할 수 있다.In addition, according to the present invention, it is possible to build a prototype system for linking future records management systems (CAMS, RMS).

또한, 본 발명에 의하면, 인공지능 플랫폼을 이용하여 자동분류 테스트베드를 구축하고 전자기록물의 텍스트를 자동으로 분류할 수 있다.In addition, according to the present invention, it is possible to build an automatic classification test bed using an artificial intelligence platform and automatically classify the text of electronic records.

또한, 본 발명에 의하면, 시퀀스 레이블링(Sequence labeling) 기반 음절단위 품사를 태깅하여 기계학습을 수행하고, Structural SVMs를 활용하여 분류 알고리즘을 완성할 수 있다.In addition, according to the present invention, machine learning can be performed by tagging syllable-based parts of speech based on sequence labeling, and a classification algorithm can be completed using structural SVMs.

또한, 본 발명에 의하면, 다양한 자연어 처리(텍스트 분석)의 작업(개체명인식 모델, 문서분류, 대화모델 등)을 사전학습(pre-trainning)과 후처리(fine-tuning) 작업을 통한 최고 성능의 결과물을 산출할 수 있다.In addition, according to the present invention, the highest performance through pre-training and post-processing (fine-tuning) of various natural language processing (text analysis) tasks (entity recognition model, document classification, conversation model, etc.) results can be produced.

또한, 본 발명에 의하면, 대용량의 사전학습 모델을 통하여, 모든 자연어처리(텍스트 분석) 작업을 최소한의 노력으로 해결할 수 있다.In addition, according to the present invention, all natural language processing (text analysis) tasks can be solved with minimal effort through a large-capacity pre-learning model.

또한, 본 발명에 의하면, 공개재분류 기계학습을 위한 기록물 파일 종류는 학습데이터 활용을 위해 텍스트로 변환시 오류가 적은 원문파일을 대상으로 하고, 기록물의 비공개 필터링에 대하여 기계학습과 정규 표현식(패턴) 및 DB, 단어사전 및 글자패턴 등을 혼용한 하이브리드 방식을 적용할 수 있다. 이를 적용시 기록물 건에서 비공개정보를 추출하는데 있어서 높은 정확도를 보일 수 있다.In addition, according to the present invention, the type of records file for public reclassification machine learning targets original text files with fewer errors when converted to text for use of learning data, and machine learning and regular expression (pattern) for non-disclosure filtering of records And a hybrid method that uses DB, word dictionary, and letter pattern in combination can be applied. When this is applied, it can show high accuracy in extracting confidential information from records.

또한, 본 발명에 의하면, 원본 기록물을 대상으로 PDF로 변환하는 과정을 거치지 않고, 원본 기록물을 직접 텍스트 데이터로 변환한 데이터를 기본 입력데이터로 사용하였으며, 이 경우 원문 기록물을 텍스트로 변환하는 과정 상의 텍스트 손실을 최소화할 수 있다.In addition, according to the present invention, data obtained by directly converting original records into text data was used as basic input data without going through a process of converting the original records into PDF, and in this case, the process of converting the original records into text was used. Text loss can be minimized.

도 1은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템의 전체적인 구성을 개략적으로 나타낸 구성도이다.
도 2는 본 발명의 실시 예에 따른 전자기록물의 비공개 정보 대상에 대한 분석 과정을 나타낸 도면이다.
도 3은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템이 국가 기록원으로부터 기록물을 가져와 공개재분류하는 과정의 일 예를 나타낸 도면이다.
도 4는 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템에 대한 테스트베드 구축 및 학습 과정을 나타낸 도면이다.
도 5는 본 발명의 실시 예에 따른 마스크 처리부를 별도의 장치인 마스킹 서버로 구현할 때 마스킹 서버의 처리 흐름도를 나타낸 도면이다.
도 6은 본 발명의 실시 예에 따른 마스킹 서버의 마스킹 편집 화면의 한 예를 나타낸 도면이다.
도 7은 본 발명의 실시 예에 따른 공개재분류를 위한 학습 대상 기록물의 예들을 나타낸 도면이다.
도 8은 본 발명의 실시 예에 따른 전자기록물 공개재분류 방법을 설명하기 위한 동작 흐름도를 나타낸 도면이다.
도 9는 본 발명의 실시 예에 따른 분석 모델부에서 텍스트 데이터를 기계학습 및 패턴 분석하는 과정을 나타낸 흐름도이다.
도 10은 아래한글로 작성된 PDF 파일을 텍스트 파일로 변환한 예를 나타낸 도면이다.
도 11은 엑셀(MicroSoft Excel) 파일을 텍스트 파일로 변환한 예를 나타낸 도면이다.
도 12는 파워포인트(MicroSoft Powerpoint) 파일을 텍스트 파일로 변환한 예를 나타낸 도면이다.
도 13은 본 발명의 실시 예에 따른 분석 모델부의 기계학습 절차를 나타낸 도면이다.
도 14는 본 발명의 실시 예에 따른 분석모델 업그레이드 반복 과정을 나타낸 도면이다.
도 15는 본 발명의 실시 예에 따른 분석 모델부의 Kobert 모델에 적용하는 bert 모델의 구성을 개략적으로 나타낸 도면이다.
도 16은 본 발명의 실시 예에 따른 분석 모델부의 하이브리드형 기계학습 프로세스를 나타낸 도면이다.
도 17은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템을 CAMS(RMS) 및 마스킹 서버와 연계하는 구성도를 개략적으로 나타낸 도면이다.
도 18은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템의 예외처리 과정을 나타낸 도면이다.1 is a configuration diagram schematically showing the overall configuration of a system for open reclassification of electronic records according to an embodiment of the present invention.
2 is a diagram illustrating a process of analyzing non-public information objects of electronic records according to an embodiment of the present invention.
3 is a diagram showing an example of a process in which the system for publicly reclassifying electronic records according to an embodiment of the present invention retrieves records from the National Archives and reclassifies them publicly.
4 is a diagram illustrating a test bed construction and learning process for a system for open reclassification of electronic records according to an embodiment of the present invention.
5 is a flowchart illustrating a process of a masking server when a mask processing unit according to an embodiment of the present invention is implemented as a masking server, which is a separate device.
6 is a diagram showing an example of a masking editing screen of a masking server according to an embodiment of the present invention.
7 is a diagram showing examples of learning target records for public reclassification according to an embodiment of the present invention.
8 is a flowchart illustrating an operation method for reclassifying electronic records according to an embodiment of the present invention.
9 is a flowchart illustrating a process of machine learning and pattern analysis of text data in an analysis model unit according to an embodiment of the present invention.
10 is a diagram showing an example of converting a PDF file written in lower Hangul into a text file.
11 is a diagram showing an example of converting an Excel (MicroSoft Excel) file into a text file.
12 is a diagram illustrating an example of converting a Microsoft Powerpoint file into a text file.
13 is a diagram illustrating a machine learning procedure of an analysis model unit according to an embodiment of the present invention.
14 is a diagram illustrating an analysis model upgrade iterative process according to an embodiment of the present invention.
15 is a diagram schematically showing the configuration of the bert model applied to the Kobert model of the analysis model unit according to an embodiment of the present invention.
16 is a diagram illustrating a hybrid machine learning process of an analysis model unit according to an embodiment of the present invention.
17 is a diagram schematically showing a configuration diagram linking a system for open reclassification of electronic records according to an embodiment of the present invention with a CAMS (RMS) and a masking server.
18 is a diagram showing an exception handling process of a system for open reclassification of electronic records according to an embodiment of the present invention.

이하, 첨부한 도면을 참고로 하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily carry out the present invention. The present invention may be embodied in many different forms and is not limited to the embodiments described herein.

본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조 부호를 붙이도록 한다.In order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals are assigned to the same or similar components throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. . In addition, when a certain component is said to "include", this means that it may further include other components without excluding other components unless otherwise stated.

어느 부분이 다른 부분의 "위에" 있다고 언급하는 경우, 이는 바로 다른 부분의 위에 있을 수 있거나 그 사이에 다른 부분이 수반될 수 있다. 대조적으로 어느 부분이 다른 부분의 "바로 위에" 있다고 언급하는 경우, 그 사이에 다른 부분이 수반되지 않는다.When a part is referred to as being “on” another part, it may be directly on top of the other part or may have other parts in between. In contrast, when a part is said to be “directly on” another part, there are no other parts in between.

제1, 제2 및 제3 등의 용어들은 다양한 부분, 성분, 영역, 층 및/또는 섹션들을 설명하기 위해 사용되나 이들에 한정되지 않는다. 이들 용어들은 어느 부분, 성분, 영역, 층 또는 섹션을 다른 부분, 성분, 영역, 층 또는 섹션과 구별하기 위해서만 사용된다. 따라서, 이하에서 서술하는 제1 부분, 성분, 영역, 층 또는 섹션은 본 발명의 범위를 벗어나지 않는 범위 내에서 제2 부분, 성분, 영역, 층 또는 섹션으로 언급될 수 있다.Terms such as first, second and third are used to describe, but are not limited to, various parts, components, regions, layers and/or sections. These terms are only used to distinguish one part, component, region, layer or section from another part, component, region, layer or section. Accordingly, a first part, component, region, layer or section described below may be referred to as a second part, component, region, layer or section without departing from the scope of the present invention.

여기서 사용되는 전문 용어는 단지 특정 실시 예를 언급하기 위한 것이며, 본 발명을 한정하는 것을 의도하지 않는다. 여기서 사용되는 단수 형태들은 문구들이 이와 명백히 반대의 의미를 나타내지 않는 한 복수 형태들도 포함한다. 명세서에서 사용되는 "포함하는"의 의미는 특정 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분을 구체화하며, 다른 특성, 영역, 정수, 단계, 동작, 요소 및/또는 성분의 존재나 부가를 제외시키는 것은 아니다.The terminology used herein is only for referring to specific embodiments and is not intended to limit the present invention. As used herein, the singular forms also include the plural forms unless the phrases clearly indicate the opposite. The meaning of "comprising" as used herein specifies particular characteristics, regions, integers, steps, operations, elements and/or components, and the presence or absence of other characteristics, regions, integers, steps, operations, elements and/or components. Additions are not excluded.

"아래", "위" 등의 상대적인 공간을 나타내는 용어는 도면에서 도시된 한 부분의 다른 부분에 대한 관계를 보다 쉽게 설명하기 위해 사용될 수 있다. 이러한 용어들은 도면에서 의도한 의미와 함께 사용 중인 장치의 다른 의미나 동작을 포함하도록 의도된다. 예를 들면, 도면 중의 장치를 뒤집으면, 다른 부분들의 "아래"에 있는 것으로 설명된 어느 부분들은 다른 부분들의 "위"에 있는 것으로 설명된다. 따라서 "아래"라는 예시적인 용어는 위와 아래 방향을 전부 포함한다. 장치는 90˚ 회전 또는 다른 각도로 회전할 수 있고, 상대적인 공간을 나타내는 용어도 이에 따라서 해석된다.Terms indicating relative space, such as “below” and “above,” may be used to more easily describe the relationship of one part to another shown in the drawings. These terms are intended to include other meanings or operations of the device in use with the meaning intended in the drawings. For example, if the device in the figures is turned over, certain parts described as being “below” other parts will be described as being “above” the other parts. Thus, the exemplary term "below" includes both directions above and below. The device may rotate 90 degrees or other angles, and terms denoting relative space are interpreted accordingly.

다르게 정의하지는 않았지만, 여기에 사용되는 기술용어 및 과학용어를 포함하는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 일반적으로 이해하는 의미와 동일한 의미를 가진다. 보통 사용되는 사전에 정의된 용어들은 관련 기술문헌과 현재 개시된 내용에 부합하는 의미를 가지는 것으로 추가 해석되고, 정의되지 않는 한 이상적이거나 매우 공식적인 의미로 해석되지 않는다.Although not defined differently, all terms including technical terms and scientific terms used herein have the same meaning as commonly understood by those of ordinary skill in the art to which the present invention belongs. Terms defined in commonly used dictionaries are additionally interpreted as having meanings consistent with related technical literature and currently disclosed content, and are not interpreted in ideal or very formal meanings unless defined.

이하, 첨부한 도면을 참조하여 본 발명의 실시 예에 대하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시 예에 한정되지 않는다.Hereinafter, with reference to the accompanying drawings, embodiments of the present invention will be described in detail so that those skilled in the art can easily carry out the present invention. However, the present invention may be implemented in many different forms and is not limited to the embodiments described herein.

도 1은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템의 전체적인 구성을 개략적으로 나타낸 구성도이다.1 is a configuration diagram schematically showing the overall configuration of a system for open reclassification of electronic records according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 통신부(110), 문서 변환부(120), 제어부(130), 분석 모델부(140), 마스크 처리부(150) 및 데이터베이스(DB)(160)를 포함할 수 있다.Referring to FIG. 1, a system 100 for open reclassification of electronic records according to an embodiment of the present invention includes a communication unit 110, a document conversion unit 120, a control unit 130, an analysis model unit 140, It may include a mask processing unit 150 and a database (DB) 160 .

통신부(110)는 통신망을 통하여 외부에 있는 CAMS(Central Archive Management System), 전자기록물 서버(장치) 또는 각급 기관 전자문서 시스템과 통신을 수행하여, 이들로부터 적어도 하나 이상의 전자기록물을 가져와 수집할 수 있다.The communication unit 110 communicates with an external CAMS (Central Archive Management System), an electronic record server (device), or an electronic document system of each level of institution through a communication network, and collects at least one electronic record from them. .

문서 변환부(120)는 하나 이상의 전자기록물 중 대상이 되는 전자기록물을 가져와 텍스트 데이터로 변환할 수 있다. 여기서, 문서 변환부(120)는 내부 구성으로 예시하였으나, 이에 한정되지 않고 별도의 장치로 구현할 수 있는데, 예를 들면, 외부에 설치되어 통신망을 통해 원격으로 통신하는 문서 변환 서버 등으로 구현할 수 있다.The document conversion unit 120 may bring a target electronic record among one or more electronic records and convert it into text data. Here, the document conversion unit 120 is illustrated as an internal configuration, but is not limited thereto and may be implemented as a separate device. For example, it may be implemented as a document conversion server that is installed outside and communicates remotely through a communication network. .

제어부(130)는 하나 이상의 전자기록물물의 재분류 및 공개 과정을 제어할 수 있다. 또한, 제어부(130)는, 비공개 부분을 구분하여 재분류된 전기 기록물을 검수하고 데이터베이스(160)에 저장하여 보관할 수 있다.The control unit 130 may control reclassification and disclosure of one or more electronic records. In addition, the control unit 130 may classify the undisclosed part, inspect the reclassified electrical records, and store and store them in the database 160 .

분석 모델부(140)는 변환된 텍스트 데이터를 학습 데이터로 로딩(loading)하고 코버트(Korea Bidirectional Encoder Representations from Transformers; KOBERT) 모델을 통하여 기계 학습을 수행함과 더불어 패턴 분석을 동시에 수행하여, 비공개 부분을 구분하여 재분류할 수 있다.The analysis model unit 140 loads the converted text data as learning data, performs machine learning through a KOBERT (Korea Bidirectional Encoder Representations from Transformers; KOBERT) model, and simultaneously performs pattern analysis, thereby performing non-public parts. can be classified and reclassified.

분석 모델부(140)는 변환된 텍스트 데이터에 대하여, 기계학습 사전학습 패턴, 정규식 표현 패턴, 정규식 표현 패턴 + DB, 단어사전 + 글자수 패턴, 단어사전 패턴 중 적어도 하나 이상을 포함하는 하이브리드 방식으로 각 개인정보에 대응하는 패턴 분석을 수행하고, 비공개 부분을 구분하여 재분류 할 수 있다.The analysis model unit 140 uses a hybrid method including at least one of a machine learning pre-learning pattern, a regular expression expression pattern, a regular expression expression pattern + DB, a word dictionary + character count pattern, and a word dictionary pattern with respect to the converted text data. Pattern analysis corresponding to each personal information can be performed, and non-public parts can be classified and reclassified.

정규식 표현 패턴은 주소를 포함하고, 업데이트된 주소 DB를 활용하여 검출된 패턴이 주소가 맞는지 검증하는 과정을 거칠 수 있다. 분석 모델부(140)는 정규식 표현 패턴에 해당하는 주소 중 데이터베이스(160)의 공개 대상 단어를 포함하는 문자열을 예외처리 할 수 있다.The regular expression expression pattern may include an address, and a process of verifying whether the detected pattern matches the address may be performed using an updated address DB. The analysis model unit 140 may exceptionally process a character string including a word to be disclosed in the database 160 among addresses corresponding to a regular expression expression pattern.

기계학습 사전학습 패턴은 인명을 포함하고, 분석 모델부(140)는 코버트(KOBERT) 모델을 사용하여 인명 등 원하는 정보를 추출할 수 있다.The machine learning pre-learning pattern includes human names, and the analysis model unit 140 may extract desired information such as human names using a KOBERT model.

정규식 표현 패턴은, 이메일, 여권번호, 주민등록번호, 외국인등록번호, 사업자등록번호, 군번, 운전면허번호, 카드번호, 주소, 소송번호를 포함할 수 있다. 기계학습 사전학습 패턴은 제6호(개인정보) 비공개 단어(인명)를 포함하고, 단어사전 + 글자 패턴은 제6호 비공개 단어(기타 - 성별, 증명서, 생년월일 등) 및 제2호(국가안전보장·국방·통일·외교관계) 비공개 단어를 포함하고, 단어 사전 패턴은 기타 비공개 단어를 포함할 수 있다.The regular expression expression pattern may include e-mail, passport number, resident registration number, alien registration number, business registration number, serial number, driver's license number, card number, address, and lawsuit number. The machine learning dictionary learning pattern includes No. 6 (personal information) private words (persons), and the word dictionary + letter pattern includes No. 6 private words (others - gender, certificate, date of birth, etc.) and No. 2 (national security Security·Defense·Unification·Diplomatic Relations), and the word dictionary pattern may include other undisclosed words.

단어사전은, 변환된 텍스트 데이터에서 제6호(개인정보) 비공개 단어를 포함하고, 비공개단어를 개인정보 유형별과 업무분류 체계별 2가지 기준으로 유형을 분류하며, 기본 비공개 단어에서 글자간의 띄어쓰기를 추가한 단어를 포함할 수 있다. 이를 통해 단어 사전 패턴 또는 단어사전 + 글자수 패턴은 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출할 수 있다.The word dictionary includes private words of No. 6 (personal information) in the converted text data, classifies the private words into two criteria by personal information type and business classification system, and removes spaces between letters in basic private words. May contain additional words. Through this, the word dictionary pattern or the word dictionary + character count pattern can detect even words in which space errors occur during text conversion.

또한, 단어사전은 제2호(국가안전보장·국방·통일·외교관계) 비공개 단어를 포함하고, 기본 비공개 단어에서 글자 간의 띄어쓰기를 추가한 단어를 포함할 수 있다. 이를 통해 단어사전 패턴 또는 단어 사전 + 글자수 패턴은 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출할 수 있다.In addition, the word dictionary may include non-public words of No. 2 (national security/defense/unification/diplomatic relations) and may include words in which spaces are added between letters from basic non-public words. Through this, the word dictionary pattern or the word dictionary + character count pattern can detect even words in which space errors occur during text conversion.

마스크 처리부(150)는 재분류된 전자기록물의 공개 시에 비공개 부분을 가리는 형태로 마스크(mask) 처리할 수 있다.The mask processing unit 150 may mask the undisclosed portion of the reclassified electronic record when it is disclosed.

DB(160)는 적어도 하나 이상의 전자기록물을 저장하거나, 문서 변환부(120)에 의해 변환된 텍스트 데이터를 저장하거나, 분석 모델부(140)에 의해 처리된 학습 데이터 또는 분석 결과 등을 저장한다.The DB 160 stores at least one electronic record, stores text data converted by the document conversion unit 120, or stores learning data or analysis results processed by the analysis model unit 140.

한편, 본 발명에 따른 코버트(kobert) 모델은, BERT + CRF(Conditional Random Field) 기반의 한국어 NER(Name Entity Recognition) 태거(Tagger)를 활용한 PyTorch 모델로서, NER Tagger의 PER(인명), LOC(지명), ORG(기관명), POH(기타), DAT(날짜), TIM(시간), DUR(기간), MNY(통화), PNT(비율), NOH(기타 수량표현)를 포함하는 10개 기본 구분 태그를 사용할 수 있다.On the other hand, the kobert model according to the present invention is a PyTorch model using Korean Name Entity Recognition (NER) Tagger based on BERT + CRF (Conditional Random Field), and PER (person) of NER Tagger, 10 including LOC (place name), ORG (organization name), POH (other), DAT (date), TIM (time), DUR (period), MNY (currency), PNT (rate), NOH (other quantity expression) You can use dog basic delimiter tags.

또한, 코버트(kobert) 모델은, 변환된 텍스트 데이터를 컴퓨터가 이해할 수 있는 수치화를 적용한 문장 단위 임베딩 기법을 사용하여 사전 학습 모델을 기반으로 목적에 맞는 학습 데이터 적용을 통한 자연어로 처리하고, 학습 데이터의 후처리(fine-tuning) 작업을 통하여, 인명, 지명, 기관명, 날짜를 포함하는 개체명을 인식하여 추출할 수 있다.In addition, the Kobert model processes converted text data into natural language through application of training data suitable for the purpose based on a pre-learning model using a sentence-by-sentence embedding technique that applies digitization that a computer can understand, and learns Through data post-processing (fine-tuning), it is possible to recognize and extract the name of an entity including a person's name, place name, organization name, and date.

본 발명은, 비공개 부분이 법령에서 정하는 바에 따라 열람할 수 있는 정보, 사생활의 비밀 또는 자유를 부당하게 침해하지 아니하는 정보, 공개하는 것이 공익이나 개인의 권리 구제를 위하여 필요하다고 인정되는 정보, 직무를 수행한 공무원의 성명·직위, 국가 또는 지방자치단체가 업무의 일부를 위탁 또는 위촉한 개인의 성명·직업 중 하나에 해당되는 경우에 예외처리를 수행할 수 있다. The present invention relates to information whose non-public parts can be viewed as prescribed by laws and regulations, information that does not unreasonably infringe on the privacy or freedom of privacy, information that is deemed necessary for the public interest or relief of individual rights, and duties to be disclosed. Exceptions may be made in the case of one of the names and positions of public officials who performed the duties, or the names and occupations of individuals to whom the state or local governments entrusted or commissioned part of the work.

한편, 마스크 처리부(150)는 본 발명에 따른 전자기록물 공개재분류를 위한 시스템(100)의 내부 구성으로 예시하였으나, 이에 한정되지 않고 외부에 있는 별도의 장치로 구현이 가능하다. 예를 들면, 마스크 처리부(150)는 본 발명에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)에 통신망을 통하여 원격으로 통신이 가능한 마스킹 서버 등으로 구현할 수 있다. 이때, 마스킹 서버는 마스크 처리부(150)와 동일한 기능을 수행하므로, 마스킹 서버(150)와 같이 동일한 도면번호를 부여할 수 있다.Meanwhile, the mask processing unit 150 is illustrated as an internal configuration of the system 100 for open reclassification of electronic records according to the present invention, but is not limited thereto and can be implemented as a separate external device. For example, the mask processing unit 150 may be implemented as a masking server capable of remotely communicating with the system 100 for open reclassification of electronic records according to the present invention through a communication network. At this time, since the masking server performs the same function as the mask processing unit 150, the same reference number as the masking server 150 can be given.

도 2는 본 발명의 실시 예에 따른 전자기록물의 비공개 정보 대상에 대한 분석 과정을 나타낸 도면이다.2 is a diagram illustrating a process of analyzing non-public information objects of electronic records according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, 전자기록물에서 비공개 정보에 대하여, 국가기록원 공개재분류 기준서 등을 참고하여 비공개 정보 대상 유형에 따른 자료를 분석할 수 있다.Referring to FIG. 2 , the system 100 for open reclassification of electronic records according to an embodiment of the present invention, for non-public information in electronic records, refers to the National Archives of Records Disclosure Reclassification Standards, etc. data can be analyzed.

예를 들어, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, 생산 기관, 기록물 제목, 기록물 유형, 대상 기록, 비공개 대상 정보, 공개값 및 호수 등으로 자료를 분석할 수 있다.For example, the system 100 for open reclassification of electronic records according to an embodiment of the present invention analyzes data by production organization, record title, record type, target record, non-disclosed information, public value, number, etc. can do.

본 발명에 따른 전자기록물 공개재분류를 위한 비공개대상정보 필터링 및 마스킹 시스템(100)은, 전자기록물에 대하여 다음과 같은 과정으로 공개재분류를 실행할 수 있다.The non-disclosed subject information filtering and masking system 100 for public reclassification of electronic records according to the present invention may perform public reclassification of electronic records in the following process.

1) 기준서 준비 단계로서, 정보 추출 대상의 유형, 관련 규정, 대상 정보 등을 목록화 할 수 있다. 1) As a standard preparation step, the type of information extraction target, related regulations, target information, etc. can be listed.

2) 기초 분석 단계로서, 추출 정보에 대하여, 유형, 관련 규정 빈도 분석이나, 관련 규정별 대상 정보의 형태소 분석 등 기초 분석을 수행할 수 있다.2) As a basic analysis step, basic analysis such as type, related rule frequency analysis, or morpheme analysis of target information for each related rule can be performed on the extracted information.

3) 자동화·지능화 검토 단계로서, 기초 분석을 통하여 일반적인 규칙으로 가능한 경우인지, 인공지능 적용이 필요한 경우인지 패턴 탐색을 수행할 수 있다.3) As an automation/intelligence review step, through basic analysis, it is possible to search for a pattern whether it is possible as a general rule or a case where artificial intelligence application is necessary.

4) 라벨링 단계로서, 학습 데이터 셋(set) 생성을 위한 라벨링 작업을 수행할 수 있다.4) As a labeling step, a labeling task for generating a training data set may be performed.

한편, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 전자기록물 공개재분류를 위한 비공개 정보(정보공개법 제9조제1항제6호 중심) 기준 도출 및 패턴화를 수행할 수 있다. On the other hand, the system 100 for open reclassification of electronic records according to an embodiment of the present invention derives standards for non-public information (centered on Article 9 (1) 6 of the Information Disclosure Act) and patternizes for reclassifying electronic records open. can do.

즉, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, 기록원 기준서 및 관련 자료, 사례 분석으로 적용 가능한 기준 제시 및 패턴화를 수행하며, 비공개 정보 대상 기준 형태 및 패턴과, 전자문서 및 붙임 자료에 대한 비공개 대상 정보 범위, 규칙, 패턴학습을 수행할 수 있다.That is, the system 100 for open reclassification of electronic records according to an embodiment of the present invention presents and patterns applicable standards by analyzing archivist standards, related data, and case studies, and performs standard forms and patterns for non-disclosed information. It is possible to perform non-disclosed information scope, rules, and pattern learning on electronic documents and attached materials.

또한, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, 대용량의 텍스트로부터 비지도 학습을 통한 양방향의 사전학습과, 다양한 작업의 분석들을 후처리(fine-tuning) 과정을 통해 수행할 수 있다. 즉, 다양한 자연어 처리(텍스트 분석)의 작업들(개체명인식 모델, 문서분류, 대화모델 등)을 사전학습(pre-trainning)과 후처리(fine-tuning) 과정을 통하여 최고 성능의 결과물을 산출하고, 이를 통한 대용량의 사전학습 모델을 통하여, 모든 자연어 처리(텍스트 분석) 작업을 최소한의 노력으로 해결할 수 있다.In addition, the system 100 for open reclassification of electronic records according to an embodiment of the present invention, bi-directional pre-learning through unsupervised learning from a large amount of text, and post-processing (fine-tuning) of analysis of various tasks This can be done through the process. In other words, the highest performance results are produced through pre-training and fine-tuning processes for various natural language processing (text analysis) tasks (entity recognition model, document classification, conversation model, etc.) And, through this large-capacity pre-learning model, all natural language processing (text analysis) tasks can be solved with minimal effort.

또한, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, MLM(Masked Language Model)을 통하여, 문자의 15% 단어를 마스킹 처리하여 예측하는 언어 모델(제일 어울리는 단어를 예측하는 모델)을 구현하여, 15% 중 80%는 마스킹 처리, 10%는 랜덤한 다른 단어, 10%는 그대로 유지할 수 있다.In addition, the system 100 for open reclassification of electronic records according to an embodiment of the present invention uses a masked language model (MLM), a language model that masks and predicts 15% of words of characters (the most matching word) predictive model), 80% of the 15% can be masked, 10% can be another random word, and 10% can be left as is.

또한, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, NSP(Next Sentence Prediction)을 통하여, 다음 문장을 예측하는 모델을 구현하여, 50%는 실제 이어진 문장으로 진행하고, 나머지 50%는 이어지지 않는 문장으로 진행할 수 있다.In addition, the system 100 for open reclassification of electronic records according to an embodiment of the present invention implements a model that predicts the next sentence through NSP (Next Sentence Prediction), so that 50% proceeds to the actual sentence. and the remaining 50% can proceed with non-continuous sentences.

도 3은 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템이 국가기록원으로부터 기록물을 가져와 공개재분류 하는 과정의 일 예를 나타낸 도면이다.3 is a diagram showing an example of a process in which a system for open reclassification of electronic records according to an embodiment of the present invention brings records from the National Archives of Korea and publicly reclassifies them.

도 3을 참조하면, 본 발명의 실시 예에 따른 따른 전자기록물 공개재분류를 위한 시스템(100)은, 개인정보 대상을 선정하고, 처리부서를 지정하며, 처리부서 의견이 등록되면 심사 결과를 반영하는 과정으로 공개재분류 기능을 제공할 수 있다.Referring to FIG. 3 , the system 100 for open reclassification of electronic records according to an embodiment of the present invention selects a subject of personal information, designates a processing department, and reflects the examination result when the opinions of the processing department are registered. In the process of doing so, the public reclassification function can be provided.

먼저, ① 기록물 선택을 한 후 '조회' 버튼을 누르면 해당 범위의 기록물의 리스트가 화면을 통해 제공되고, ② 개인정보를 검출할 기록물의 범위를 선택할 수 있는 화면을 제공하며, 이를 통하여 검출 조건을 선택하면 ③ 개인정보 검출 결과의 화면을 제공할 수 있다.First, ① After selecting a record, if you press the 'Search' button, a list of records in the corresponding range is provided on the screen, ② A screen is provided to select the range of records to detect personal information, and through this, the detection conditions can be set. If selected, ③ Personal information detection result screen can be provided.

또한, ④ 해당 기록물을 클릭하면 화면을 통해 기록물 확인이 가능하고, ⑤ 해당 기록물의 건명을 클릭하면 첨부 전자파일의 리스트가 화면에 표시되어 확인 가능하고, 포함된 개인정보 유형 별 개인정보 개수도 확인할 수 있다.In addition, if you click ④ the record, you can check the record on the screen, and ⑤ if you click the title of the record, the list of attached electronic files is displayed on the screen, and you can also check the number of personal information by type of personal information included. can

또한, ⑥ 전자 파일명을 클릭하면 해당 문서를 화면을 통해 내용 확인이 가능하고, ⑦ 개인정보 검출 기록의 공개재분류 대상으로 선정하고 사유를 등록할 수 있으며, ⑧ 비공개 사유 제6호를 제외한 비공개 기록물 검출을 위해서 솔루션에서 제공되는 멀티 비공개 키워드 검색창을 통해 비공개 키워드를 포함하고 있는지 검출하기 위해 '공개재분류 기준서'를 기반으로 검출하고자 하는 비공개 키워드를 정리하여 키워드 관리 화면에서 비공개 키워드를 입력할 수 있다.In addition, ⑥ By clicking on the electronic file name, you can check the contents of the document through the screen, ⑦ You can select the subject of disclosure and reclassification of the personal information detection record and register the reason, ⑧ Non-disclosure records except for non-disclosure reason No. 6 For detection, in order to detect whether private keywords are included through the multi-private keyword search box provided by the solution, you can organize the private keywords you want to detect based on the 'Public Reclassification Criteria' and enter the private keywords on the keyword management screen. there is.

또한, ⑨ 검수 및 의견등록을 위한 처리부서를 지정할 수 있으며, 이를 통하여 비공개 키워드 검출 내역에 대한 검수를 진행할 수 있으며, ⑩ 개인정보 검출 기록물과 같은 방법으로 전문요원 의견을 등록할 수 있으며, 최종적으로 심의 위원회 검토를 거쳐 확정할 수 있다.In addition, you can designate a processing department for ⑨ inspection and opinion registration, and through this, you can proceed with inspection of private keyword detection details, and you can register expert agent opinions in the same way as ⑩ personal information detection records, and finally It can be confirmed after review by the review committee.

또한, ⑪ 비공개 사유의 호수를 선택할 수 있고, ⑫ 심의 위원회에서 심의가 확정되면 결과 반영 버튼을 클릭하여 DB를 업데이트할 수 있다.In addition, you can select the number of ⑪ reasons for non-disclosure, and when the deliberation is confirmed by the ⑫ Deliberation Committee, you can update the DB by clicking the Reflect Result button.

도 4는 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템에 대한 테스트베드 구축 및 학습 과정을 나타낸 도면이다.4 is a diagram illustrating a test bed construction and learning process for a system for open reclassification of electronic records according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 먼저 대상 기록물, 기준서를 분석하는 단계로서(S410), 공개재분류 기준서 공개재분류 대상 기록물을 분석한다. 이때, 기준서를 통해 대상 기록물의 분류를 정의할 수 있다.Referring to FIG. 4, the system 100 for open reclassification of electronic records according to an embodiment of the present invention first analyzes target records and reference documents (S410), Analyze. At this time, the classification of target records can be defined through the standard document.

이어, 대상 기록물 텍스트 변환하는 단계로서(S420), 공개재분류 대상 기록물의 텍스트 변환 후 저장한다. 즉, 국가기록원으로부터 전달받은 전자기록물을 학습기/분석기로 입력할 수 있는 텍스트 데이터의 유형을 만들기 위해 텍스트 파일로 변환하는 것이다. 이때, 다양한 형태의 기록물을 텍스트로 변환하여 저장할 수 있다.Subsequently, as a step of converting target records into text (S420), records subject to disclosure and reclassification are converted into text and then stored. That is, the electronic records delivered from the National Archives are converted into text files to create the type of text data that can be input to the learner/analyzer. At this time, various types of records may be converted into text and stored.

이어, 대상 기록물을 학습하는 단계로서(S430), 학습기를 통한 학습 수행 및 보정 작업을 수행한다. Subsequently, as a step of learning the target record (S430), learning and correction work are performed through the learner.

또한, 분석 모델(패턴)을 생성하는 단계로서(S440), 학습을 통한 분석 모델(패턴)을 도출한다. 즉, 변환된 기록물의 기계학습을 수행하기 위해 학습기/분석기로 로딩하여 학습을 통해 분석 모델(패턴)을 도출한다. 이때, 학습기를 통한 분류 기준 도출 및 보정을 반복할 수 있으며, 분석 모델(패턴)을 도출하여 검증할 수 있다.In addition, as a step of generating an analysis model (pattern) (S440), the analysis model (pattern) is derived through learning. That is, in order to perform machine learning of the converted record, it is loaded into a learner/analyzer to derive an analysis model (pattern) through learning. At this time, derivation and correction of classification criteria through the learner can be repeated, and an analysis model (pattern) can be derived and verified.

이어, 분석 모델로 대상 기록물을 분류하는 단계로서(S450), 분석 모델을 통해 대상 기록물의 분류를 수행한다. 이때, 대상 기록물의 공개재분류를 수행하는 것이다.Next, as a step of classifying the target records using the analysis model (S450), the target records are classified through the analysis model. At this time, open reclassification of the target records is performed.

이어, 분류 결과를 DB에 저장하는 단계로서(S460), 분류 결과를 비공개 사유와 함께 DB에 저장한다.Next, as a step of storing the classification result in the DB (S460), the classification result is stored in the DB together with the non-disclosure reason.

즉, 최종적으로 만들어진 분석 모델을 활용하여 주어진 기록물의 분류를 진행하고, 이의 결과를 DB에 저장하는 것이다.That is, a given record is classified using the finally created analysis model, and the result is stored in the DB.

이어, 분류 결과를 검수하는 단계로서(S470), 공개재분류의 분류된 결과를 검수한다. 즉, 재분류된 기록물의 검수를 수행하는 것으로서, 비공개/부분공개/공개분류와 사유가 정확한지 검토하는 것이다.Subsequently, as a step of inspecting the classification result (S470), the classified result of open reclassification is inspected. In other words, as an inspection of reclassified records, it is to review whether the nondisclosure/partial disclosure/disclosure classification and reasons are correct.

이어, 검수가 완료되면(S480-Yes), 검수 결과를 DB에 저장한다(S482).Subsequently, when the inspection is completed (S480-Yes), the inspection result is stored in the DB (S482).

그러나, 검수가 완료되지 않으면(S480-No), 결과를 수정하고(S490), 분류 결과를 검수하는 과정(S470)을 반복한다.However, if the inspection is not completed (S480-No), the result is corrected (S490), and the process of inspecting the classification result (S470) is repeated.

그리고, 본 발명에 따른 전자기록물 공개재분류를 위한 시스템(100)은 DB 결과를 받아서 기록물을 공개할 때 비공개 키워드를 마스킹 처리하여 공개할 수 있다.In addition, the system 100 for public reclassification of electronic records according to the present invention may disclose the records by masking the non-public keyword when receiving the DB result and releasing the records.

도 5는 본 발명의 실시 예에 따른 마스크 처리부를 별도의 장치인 마스킹 서버로 구현할 때 마스킹 서버의 처리 흐름도를 나타낸 도면이고, 도 6은 본 발명의 실시 예에 따른 마스킹 서버의 마스킹 편집 화면의 한 예를 나타낸 도면이다. 5 is a diagram showing a processing flow of the masking server when the mask processing unit according to an embodiment of the present invention is implemented as a masking server, which is a separate device, and FIG. 6 is a masking editing screen of the masking server according to an embodiment of the present invention. It is a drawing showing an example.

도 5 및 도 6을 참조하면, 본 발명의 실시 예에 따른 마스킹 서버(150)는, ① 학습기/분석기의 분류된 기록물명과 비공개 키워드 결과를 DB에 등록하고, ② PDF 변환 서버의 DB Demon은 DB에 등록된 파일명과 비공개 키워드를 읽어서 PDF 변환 모듈에 전달하며, ③ 마스킹 모듈은 PDF 변환 모듈에 전달된 DB값을 전송받아 PDF 파일의 비공개 키워드를 검색하여 하이라이트 처리한다.5 and 6, the masking server 150 according to an embodiment of the present invention, ① registers the classified record names and private keyword results of the learner / analyzer in the DB, ② DB Demon of the PDF conversion server DB The file name and private keywords registered in are read and transmitted to the PDF conversion module. ③ The masking module receives the DB value transmitted to the PDF conversion module, searches for the private keywords in the PDF file, and highlights them.

이어, ④ 하이라이트 처리된 마스킹 대상 키워드가 정확하게 검출되었는지 검수하고, ⑤ 잘못 검출된 키워드는 도 6과 같이 PDF Editor로 수정하여 마스킹 모듈에 전달한다.Subsequently, ④ it is inspected whether the highlighted masking target keyword is accurately detected, and ⑤ the erroneously detected keyword is corrected with PDF Editor as shown in FIG. 6 and transmitted to the masking module.

이어, ⑥ 최종 보정된 기록물의 비공개 키워드 정보(비공개 키워드 페이지 및 내역 리스트)를 DB에 저장하고, ⑦ 마스킹 모듈은 비공개 키워드를 마스킹 처리하여 데이터 저장소에 파일로 저장하며, ⑧ 열람 요청한 요청자에게 마스킹(Masking) 된 기록물의 열람 서비스를 제공한다.Then, ⑥ stores the private keyword information (private keyword page and history list) of the final corrected record in the DB, ⑦ the masking module masks the private keywords and stores them as a file in the data storage, Masked records are provided.

도 7은 본 발명의 실시 예에 따른 공개재분류를 위한 학습 대상 기록물의 예들을 나타낸 도면이다. 7 is a diagram showing examples of learning target records for public reclassification according to an embodiment of the present invention.

도 7을 참조하면, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 국가기록원 공개재분류 기준서 등을 참고하여 비공개 정보 대상 유형에 따른 자료 분석을 수행할 수 있다.Referring to FIG. 7 , the system 100 for open reclassification of electronic records according to an embodiment of the present invention may perform data analysis according to types of undisclosed information by referring to the National Archives and Records Administration open reclassification standards.

또한, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 개인정보 유형에 따른 자료와 기록물 상의 내용 비교 분석을 수행할 수 있다. 이때, 개인정보 유형은, 일반 정보(이름, 주민번호, 주소 등), 가족 정보, 교육 및 훈련 정보, 병역 정보, 부동산 정보, 소득 정보, 기타 수익정보, 신용정보, 고용정보, 법적정보, 의료정보, 조직정보, 통신정보, 위치정보, 신체정보, 습관 및 취미정보 등을 포함할 수 있다.In addition, the system 100 for open reclassification of electronic records according to an embodiment of the present invention may perform comparative analysis of data and records according to personal information types. At this time, the types of personal information are general information (name, resident registration number, address, etc.), family information, education and training information, military service information, real estate information, income information, other income information, credit information, employment information, legal information, and medical information. Information, organization information, communication information, location information, body information, habit and hobby information, etc. may be included.

또한, 본 발명에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 기계학습을 위해 국가 기록원 공개재분류 기준서, 공개재분류 전자화 기록물(제6호), 공개재분류 대상 전자기록물을 활용하여 기계학습을 진행하고, 기록물을 분류할 수 있는 패턴 모델을 도출할 수 있다.In addition, the system 100 for open reclassification of electronic records according to the present invention utilizes the National Archives of Records Open Reclassification Standards, open reclassification electronic records (No. 6), and open reclassification subject electronic records for machine learning. Machine learning can be performed and a pattern model capable of classifying records can be derived.

여기서, 비공개 정보 대상의 유형은 다음 표 1 내지 표 4와 같이 업무 분류 체계별(기록관리 공공표준) 비공개 정보 대상 제6호 기록물 유형이 있으며, 업무분류체계 및 비공개정보대상은 국가기록물 관리 기준 또는 국가기록원 업무 필요성에 의하여 수정, 삭제 또는 업데이트될 수 있다.Here, the types of non-public information objects include the type of records for non-public information subject No. 6 by business classification system (records management public standard) as shown in Table 1 to Table 4 below. It may be modified, deleted or updated according to the needs of the National Archives of Korea.

또한, 비공개 정보 대상 제6호 기록물에는 다음 표 5 내지 표 7과 같이 공통되는 개인 정보 유형이 있으며, 국가기록물 관리 기준 또는 국가기록원 업무 필요성에 의하여 수정, 삭제 또는 업데이트될 수 있다.In addition, the 6 records subject to non-disclosed information have common personal information types as shown in Tables 5 to 7 below, and can be modified, deleted, or updated according to the national records management standards or the needs of the National Archives of Korea.

또한, 비공개 정보 대상에는 다음 표 8과 같이 제2호 국가안전보장·국방·통일·외교관계 유형이 있으며, 국가기록물 관리 기준 또는 국가기록원 업무 필요성에 의하여 수정, 삭제 또는 업데이트될 수 있다.In addition, as shown in Table 8 below, non-disclosed information includes Type 2 National Security, National Defense, Unification, and Diplomatic Relations, which can be modified, deleted, or updated according to the national records management standards or the needs of the National Archives of Korea.

도 8은 본 발명의 실시 예에 따른 전자기록물 공개재분류 방법을 설명하기 위한 동작 흐름도를 나타낸 도면이다.8 is a flowchart illustrating an operation method for reclassifying electronic records according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 문서 변환부(120)가 하나 이상의 전자기록물 중 대상이 되는 전자기록물을 가져와 도 10 내지 도 12에 도시된 바와 같이 텍스트 데이터로 변환한다(S810).Referring to FIG. 8 , in the system 100 for open reclassification of electronic records according to an embodiment of the present invention, the document converting unit 120 brings a target electronic record among one or more electronic records and shows them in FIGS. 10 to 12 Converted to text data as shown in (S810).

여기서, 도 10은 아래한글로 작성된 PDF 파일을 텍스트 파일로 변환한 예를 나타낸 도면이고, 도 11은 엑셀(MicroSoft Excel) 파일을 텍스트 파일로 변환한 예를 나타낸 도면이며, 도 12는 파워포인트(MicroSoft Powerpoint) 파일을 텍스트 파일로 변환한 예를 나타낸 도면이다. Here, FIG. 10 is a diagram showing an example of converting a PDF file written in Korean below into a text file, FIG. 11 is a diagram showing an example of converting an Excel (MicroSoft Excel) file into a text file, and FIG. 12 is a PowerPoint ( This is a diagram showing an example of converting a Microsoft Powerpoint) file into a text file.

이어, 분석 모델부(140)는 변환된 텍스트 데이터를 학습 데이터로 로딩(loading)하고, 코버트(KOBERT) 모델을 통하여 기계 학습을 수행함과 더불어 하이브리드 방식으로 패턴 분석을 동시에 수행한다(S820).Subsequently, the analysis model unit 140 loads the converted text data as learning data, performs machine learning through a KOBERT model, and simultaneously performs pattern analysis in a hybrid method (S820).

즉, 분석 모델부(140)는 변환된 텍스트 데이터를 학습 데이터로 로딩(loading)하여 기계 학습을 수행함과 더불어 정규식 표현 패턴, 기계학습 사전학습 패턴, 단어사전 + 글자수 패턴, 단어 사전 패턴 중 적어도 둘 이상을 포함하는 하이브리드 방식으로 패턴 분석을 동시에 수행할 수 있다.That is, the analysis model unit 140 performs machine learning by loading the converted text data as learning data, and at least one of a regular expression expression pattern, a machine learning dictionary learning pattern, a word dictionary + character count pattern, and a word dictionary pattern. Pattern analysis can be performed simultaneously in a hybrid method involving two or more.

이때, 코버트 모델은, 변환된 텍스트 데이터에서 비공개 단어를 제6호(개인정보) 개인정보 유형별과 제2호(국가안전보장·국방·통일·외교관계) 업무분류 체계별의 2가지 기준으로 유형을 분류하되, 변환된 텍스트 데이터에 대하여 제2호 또는 제6호 비공개 단어를 분류하고, 기본 비공개 단어에서 글자 간의 띄어쓰기를 추가하며, 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출하도록 학습할 수 있다.At this time, the Covert model uses two criteria: No. 6 (Personal Information) by type of personal information and No. 2 (National Security/Defense/Unification/Diplomatic Relations) by business classification system. Classify the type, classify the 2nd or 6th hidden words for the converted text data, add spaces between letters in the basic hidden words, and learn to detect even words with spacing errors during text conversion.

본 발명의 실시 예에 따라 제6호와 제2호 비공개 단어를 분석한 결과, 수많은 비공개 정보 종류 및 특성, 또는 패턴을 고려하여 패턴 분석과 기계 학습을 결합하는 것이 바람직하다.As a result of analyzing the 6th and 2nd hidden words according to an embodiment of the present invention, it is preferable to combine pattern analysis and machine learning in consideration of numerous types, characteristics, or patterns of undisclosed information.

이어, 제어부(130)는 기계 학습 및 패턴 분석의 결과에 근거해 대상이 되는 전자기록물에서 비공개 부분을 구분하여 재분류한다(S830).Next, the control unit 130 classifies and reclassifies the non-public parts of the target electronic records based on the results of machine learning and pattern analysis (S830).

그리고, 제어부(130)는, 재분류된 전자기록물의 공개 시에 마스크 처리부(150)를 통하여 비공개 부분을 가려서 마스크(mask) 처리한 후 공개한다(S840).Then, when disclosing the reclassified electronic records, the control unit 130 masks the undisclosed portion through the mask processing unit 150, and then discloses the reclassified electronic records (S840).

도 9는 본 발명의 실시 예에 따른 분석 모델부에서 텍스트 데이터를 기계학습 및 패턴 분석하는 과정을 나타낸 흐름도이다.9 is a flowchart illustrating a process of machine learning and pattern analysis of text data in an analysis model unit according to an embodiment of the present invention.

도 9를 참조하면, 본 발명의 실시 예에 따른 분석 모델부(140)는, 하나 이상의 전자기록물에서 구분할 비공개 단어에 대한 라이브러리를 구축(설정)할 수 있다(S910).Referring to FIG. 9 , the analysis model unit 140 according to an embodiment of the present invention may build (set) a library for undisclosed words to be distinguished from one or more electronic records (S910).

이어, 분석 모델부(140)는 비공개 단어에 대하여 제6호(개인정보) 비공개 단어, 제2호(국가안전보장·국방·통일·외교관계) 비공개 단어, 및 기타 비공개 단어를 포함하는 컨텐츠 파일별 분석 프레임워크를 구축(생성)할 수 있다(S920).Next, the analysis model unit 140 is a content file including the 6th (personal information) hidden words, the 2nd (national security/defense/unification/diplomatic relations) hidden words, and other hidden words with respect to the hidden words. A star analysis framework may be constructed (created) (S920).

이어, 분석 모델부(140)에서 코버트 모델은, 비공개 단어를 입력받아 CRF(Conditional Random Field) 기반의 한국어 NER(Name Entity Recognition) 태거(Tagger)를 활용하여, B(begin)-I(inside)-O(outside) 태깅(Tagging)을 수행한다(S930).Subsequently, in the analysis model unit 140, the covert model receives the closed word and utilizes the Korean Name Entity Recognition (NER) tagger based on CRF (Conditional Random Field), B (begin) -I (inside) )-O (outside) tagging is performed (S930).

이때, 코버트 모델은, CRF 기반의 한국어 NER 태거의 PER(인명), LOC(지명), ORG(기관명), POH(기타), DAT(날짜), TIM(시간), DUR(기간), MNY(통화), PNT(비율), NOH(기타 수량표현)를 포함하는 10개 기본 구분 태그를 사용하여, 학습 데이터에 대한 B-I-O 태깅을 수행할 수 있다.At this time, the Covert model is PER (person name), LOC (place name), ORG (organization name), POH (other), DAT (date), TIM (time), DUR (period), MNY of the CRF-based Korean NER tagger B-I-O tagging of learning data can be performed using 10 basic classification tags including (currency), PNT (ratio), and NOH (other quantity expressions).

이어, 분석 모델부(140)에서 수정된 코버트 모델은, B-I-O 태깅에 기반하여 인명에 대한 개체명을 인식하여 추출한다(S940).Subsequently, the Covert model modified in the analysis model unit 140 recognizes and extracts the entity name of a person based on the B-I-O tagging (S940).

본 발명에 따른 분석 모델부(140)는 기계 학습을 통해 텍스트에서 바로 인식할 수 있는 개체명은 인명, 성별, 장소, 시간 등 다양하며, 제6호 및 제2호 비공개 정보 특성을 분석하였고, 원본파일 또는 PDF 파일을 변환한 텍스트 파일의 샘플 테스트 결과 등을 종합적으로 고려하여 인명 중심의 데이터를 추출하였으며, 필요 시 개체명을 추가하여 비공개정보 판단에 활용 할 수 있다.The analysis model unit 140 according to the present invention has a variety of object names that can be recognized directly from text through machine learning, such as name, gender, place, time, etc., analyzed the characteristics of No. 6 and No. 2 confidential information, and analyzed the original information. Human-centered data was extracted by comprehensively considering sample test results of text files converted from files or PDF files, and if necessary, entity names can be added to determine confidential information.

또한, 분석 모델부(140)는 정규식 형태의 비공개 정보는 정규식 패턴 인식을 통해 비공개 정보 추출 또는 검색할 경우 기계 학습에 비해 상대적으로 높은 인식률을 달성할 수 있다.In addition, the analysis model unit 140 can achieve a relatively high recognition rate compared to machine learning when undisclosed information in the form of a regular expression is extracted or retrieved through regular expression pattern recognition.

또한, 분석 모델부(140)는 정규식 관련 패턴을 설정할 수 있다. 즉, 분석 모델부(140)는 이메일, 여권번호, 주민등록번호, 외국인등록번호, 사업자등록번호, 군번, 운전면허번호, 카드번호, 주소, 소송번호 등 총 10개의 정규식 패턴을 설정할 수 있으며, 다른 유형의 패턴을 추가할 수 있다.Also, the analysis model unit 140 may set a regular expression related pattern. That is, the analysis model unit 140 can set a total of 10 regular expression patterns such as e-mail, passport number, resident registration number, alien registration number, business registration number, serial number, driver's license number, card number, address, and lawsuit number, and different types patterns can be added.

또한, 분석 모델부(140)는 개체명 인식이 가능한 장소의 경우, 주소 정규식 패턴과 주소 DB를 이용할 경우 기계 학습에 비해 상대적으로 상세한 주소에 대해 분석 또는 인식이 가능하다.In addition, the analysis model unit 140 can analyze or recognize a relatively detailed address compared to machine learning when using an address regular expression pattern and an address DB in the case of a place where entity name recognition is possible.

또한, 분석 모델부(140)는 주소 정규식 패턴에서 제외할 단어를 별도로 관리하여, 주소로 잘못 인식되는 패턴을 제외시킬 수 있다.In addition, the analysis model unit 140 may separately manage words to be excluded from the address regular expression pattern to exclude patterns that are mistakenly recognized as addresses.

또한, 분석 모델부(140)는 제6호 및 제2호 비공개 정보의 경우, 단어 사전과 글자수 패턴을 적용할 수 있다.In addition, the analysis model unit 140 may apply a word dictionary and a letter count pattern to the 6th and 2nd confidential information.

즉, 본 발명의 실시 예에 따른 전자기록물 공개재분류시스템(100)은, 제6호(개인정보)에 따른 비공개 단어 사전을 구축할 수 있다. 이때, 제6호 비공개 단어는 개인식별정보와 개인증빙기록으로서, 비공개 단어를 개인정보 유형별과 업무분류 체계별의 2가지 기준으로 유형을 분류하고, 기본 비공개 단어에서 글자 간의 띄어쓰기를 추가 후 사전으로 구축하며, 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출이 가능하도록 구축할 수 있다.That is, the open reclassification system 100 of electronic records according to an embodiment of the present invention may construct an undisclosed word dictionary according to No. 6 (personal information). At this time, No. 6 closed words are personal identification information and personal authentication records, and the types of closed words are classified into two criteria: by personal information type and by business classification system, and after adding spaces between letters in basic non-public words, It is built, and it can be built to detect even words in which space errors occur during text conversion.

또한, 본 발명에 따른 전자기록물 공개재분류시스템(100)은, 제6호 비공개 단어에 대하여, 문자가 짝수 단어이면서 일부 공백인 경우 아래와 같이 글자수 패턴을 적용할 수 있고, 필요 시 글자패턴을 추가할 수 있다.In addition, the electronic records open reclassification system 100 according to the present invention can apply the following pattern of the number of characters to the 6th closed word when the letter is an even word and partially blank, and if necessary, the letter pattern can be added

ㅇ 2글자 + 공백 + 2글자ㅇ 2 letters + space + 2 letters

ㅇ 2글자 + 공백 + 2글자 + 공백 + 2글자ㅇ 2 letters + space + 2 letters + space + 2 letters

ㅇ 2글자 + 공백 + 4글자ㅇ 2 letters + space + 4 letters

ㅇ 3글자 + 공백 + 3글자ㅇ 3 letters + space + 3 letters

ㅇ 4글자 + 공백 + 2글자ㅇ 4 characters + space + 2 characters

ㅇ 2글자 + 공백 + 2글자 + 공백 + 2글자 + 공백 + 2글자ㅇ 2 letters + space + 2 letters + space + 2 letters + space + 2 letters

ㅇ 3글자 + 공백 + 2글자 + 공백 + 3글자ㅇ 3 letters + space + 2 letters + space + 3 letters

ㅇ 4글자 + 공백 + 4글자ㅇ 4 letters + space + 4 letters

ㅇ 3글자 + 공백 + 5글자ㅇ 3 characters + space + 5 characters

ㅇ 3글자 + 공백 + 1글자 + 공백 + 4글자ㅇ 3 letters + space + 1 letter + space + 4 letters

ㅇ 5글자 + 공백 + 3글자ㅇ 5 characters + space + 3 characters

ㅇ 5글자 + 공백 + 5글자ㅇ 5 characters + space + 5 characters

ㅇ 7글자 + 공백 + 3글자ㅇ 7 letters + space + 3 letters

또한, 본 발명에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 제2호 비공개 단어에 대하여, 문자가 홀수 단어이면서 일부 공백인 경우 아래와 같이 글자수 패턴을 적용할 수 있고, 필요 시 글자패턴을 추가할 수 있다In addition, the system 100 for open reclassification of electronic records according to the present invention can apply the following pattern of the number of characters to the second hidden word when the character is an odd word and partially blank, and if necessary, the number of characters patterns can be added

ㅇ 1글자 + 공백 + 2글자ㅇ 1 letter + space + 2 letters

ㅇ 2글자 + 공백 + 3글자ㅇ 2 letters + space + 3 letters

ㅇ 3글자 + 공백 + 2글자ㅇ 3 characters + space + 2 characters

ㅇ 2글자 + 공백 + 2글자 + 공백 + 3글자ㅇ 2 letters + space + 2 letters + space + 3 letters

ㅇ 3글자 + 공백 + 2글자 + 공백 + 2글자ㅇ 3 letters + space + 2 letters + space + 2 letters

ㅇ 3글자 + 공백 + 4글자ㅇ 3 letters + space + 4 letters

ㅇ 4글자 + 공백 + 3글자ㅇ 4 letters + space + 3 letters

ㅇ 5글자 + 공백 + 2글자ㅇ 5 characters + space + 2 characters

ㅇ 2글자 + 공백 + 2글자 + 공백 + 2글자 + 공백 + 3글자ㅇ 2 letters + space + 2 letters + space + 2 letters + space + 3 letters

ㅇ 4글자 + 공백 + 2글자 + 공백 + 3글자ㅇ 4 letters + space + 2 letters + space + 3 letters

ㅇ 5글자 + 공백 + 4글자ㅇ 5 characters + space + 4 characters

ㅇ 6글자 + 공백 + 3글자ㅇ 6 characters + space + 3 characters

한편, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 제2호(국가안전보장·국방·통일·외교관계)에 따른 비공개 단어 사전을 구축할 수 있다. 이때, 제2호 비공개 단어는 총 86개이고, 기본 비공개 단어에서 글자 간의 띄어쓰기를 추가 후 사전으로 구축하고, 텍스트 변환시 띄어쓰기 오류가 발생한 단어까지 검출이 가능하도록 구축할 수 있다.Meanwhile, the system 100 for open reclassification of electronic records according to an embodiment of the present invention may build an undisclosed word dictionary according to No. 2 (national security, defense, unification, and diplomatic relations). At this time, there are a total of 86 undisclosed words, and it is possible to build a dictionary after adding spaces between letters in basic undisclosed words, and to detect even words with spacing errors during text conversion.

또한, 본 발명에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 제2호 비공개 단어에 대하여, 문자가 짝수 단어이면서 일부 공백인 경우 아래와 같이 글자수 패턴을 적용할 수 있다.In addition, the system 100 for open reclassification of electronic records according to the present invention may apply the following pattern of the number of characters to the second hidden word when the letters are even words and some are blank.

ㅇ 2글자 + 공백 + 2글자ㅇ 2 letters + space + 2 letters

ㅇ 2글자 + 공백 + 4글자ㅇ 2 letters + space + 4 letters

ㅇ 3글자 + 공백 + 3글자ㅇ 3 letters + space + 3 letters

ㅇ 4글자 + 공백 + 2글자ㅇ 4 characters + space + 2 characters

ㅇ 4글자 + 공백 + 4글자ㅇ 4 letters + space + 4 letters

ㅇ 3글자 + 공백 + 5글자ㅇ 3 characters + space + 5 characters

ㅇ 5글자 + 공백 + 3글자ㅇ 5 characters + space + 3 characters

ㅇ 5글자 + 공백 + 5글자ㅇ 5 characters + space + 5 characters

ㅇ 7글자 + 공백 + 3글자ㅇ 7 letters + space + 3 letters

ㅇ 1글자 + 공백 + 2글자ㅇ 1 letter + space + 2 letters

ㅇ 2글자 + 공백 + 3글자ㅇ 2 letters + space + 3 letters

ㅇ 3글자 + 공백 + 2글자ㅇ 3 characters + space + 2 characters

ㅇ 3글자 + 공백 + 4글자ㅇ 3 letters + space + 4 letters

ㅇ 4글자 + 공백 + 3글자ㅇ 4 letters + space + 3 letters

ㅇ 5글자 + 공백 + 2글자ㅇ 5 characters + space + 2 characters

ㅇ 5글자 + 공백 + 4글자ㅇ 5 characters + space + 4 characters

ㅇ 6글자 + 공백 + 3글자ㅇ 6 characters + space + 3 characters

도 13은 본 발명의 실시 예에 따른 분석 모델부의 기계학습 절차를 나타낸 도면이다.13 is a diagram illustrating a machine learning procedure of an analysis model unit according to an embodiment of the present invention.

도 13을 참조하면, 본 발명의 실시 예에 따른 분석 모델부(140)는, 학습대상 DB로부터 사전학습 및 전자기록물 학습데이터를 가져와서 PyTorch 학습기를 통한 모델 학습을 수행하여 형태소 분석, Fine-tuning 및 개체명을 인식한다.Referring to FIG. 13, the analysis model unit 140 according to an embodiment of the present invention brings pre-learning and electronic record learning data from the learning target DB and performs model learning through the PyTorch learner to perform morphological analysis and fine-tuning and entity name recognition.

이어, 분석 모델부(140)는 최종 모델과 재분류 대상 문서를 재분류 대상 DB에 저장하고, 이를 PyTorch 학습기로 가져와 데이터 전처리, 형태소 분석, 모델적용 결과 출력 등의 재분류 분석 및 필요한 정보를 추출한다.Next, the analysis model unit 140 stores the final model and documents to be reclassified in the reclassification target DB, imports them to the PyTorch learner, performs reclassification analysis such as data preprocessing, morphological analysis, and model application result output, and extracts necessary information. do.

그리고, 분석 모델부(140)는 정보공개법 제9조제1항제6호에 따라 다음과 같은 경우에 예외처리를 수행할 수 있다. 즉, 법령에서 정하는 바에 따라 열람할 수 있는 정보, 사생활의 비밀 또는 자유를 부당하게 침해하지 아니하는 정보, 공개하는 것이 공익이나 개인의 권리 구제를 위하여 필요하다고 인정되는 정보, 직무를 수행한 공무원의 성명·직위, 및 국가 또는 지방자치단체가 업무의 일부를 위탁 또는 위촉한 개인의 성명·직업 등의 경우에 예외처리를 수행하는 것이다.In addition, the analysis model unit 140 may perform exception processing in the following cases according to Article 9 (1) 6 of the Information Disclosure Act. That is, information that can be viewed as prescribed by law, information that does not unreasonably infringe on privacy or freedom, information that is deemed necessary for the public interest or relief of individual rights to be disclosed, Exception handling is carried out in the case of names, positions, and names and occupations of individuals to whom the state or local government has entrusted or commissioned part of the work.

한편, 본 발명에 따른 분석 모델부(140)는 도 14에 도시된 바와 같이 분석 모델에 대하여 업그레이드를 반복할 수 있다. 도 14는 본 발명의 실시 예에 따른 분석모델 업그레이드 반복 과정을 나타낸 도면이다. Meanwhile, the analysis model unit 140 according to the present invention may repeat upgrading of the analysis model as shown in FIG. 14 . 14 is a diagram illustrating an analysis model upgrade iterative process according to an embodiment of the present invention.

본 발명에 따른 분석 모델부(140)는 데이터 분석 과정에서 텍스트 문서를 분석하고, 비공개 기준을 설정할 수 있다. The analysis model unit 140 according to the present invention may analyze a text document in a data analysis process and set a non-disclosure criterion.

또한, 본 발명에 따른 분석 모델부(140)는 분석 모델을 통하여 비공개 단어 유형별 패턴을 분석하고, 주소DB 업데이트, 단어사전 및 분석모델, 패턴 분석을 업그레이드 할 수 있다.In addition, the analysis model unit 140 according to the present invention can analyze patterns for each private word type through an analysis model, and upgrade an address DB, word dictionary and analysis model, and pattern analysis.

또한, 본 발명에 따른 분석 모델부(140)는 기계학습 모델을 분석하고, 기계학습모델 성능 파악 및 반복 학습을 수행할 수 있다.In addition, the analysis model unit 140 according to the present invention can analyze the machine learning model, determine the performance of the machine learning model, and perform repeated learning.

또한, 본 발명에 따른 분석 모델부(140)는 사용자 인터페이스를 통하여 국가 기록원 니즈를 분석하고, 사용자 요구사항을 반영한 응용서비스를 구현할 수 있다.In addition, the analysis model unit 140 according to the present invention can analyze the needs of the National Archives through a user interface and implement an application service reflecting user requirements.

한편, 본 발명에 따른 분석 모델부(140)는 기계학습 사전학습 모델로 도 15에 도시된 바와 같은 Bert 모델을 적용하여 Kobert 모델을 선정할 수 있다. 도 15는 본 발명의 실시 예에 따른 분석 모델부의 Kobert 모델에 적용하는 bert 모델의 구성을 개략적으로 나타낸 도면이다. Meanwhile, the analysis model unit 140 according to the present invention may select a Kobert model by applying the Bert model as shown in FIG. 15 as a machine learning pre-learning model. 15 is a diagram schematically showing the configuration of the bert model applied to the Kobert model of the analysis model unit according to an embodiment of the present invention.

본 발명에 따른 분석 모델부(140)는 기계학습 사전학습모델로 Kobert 모델을 선정하여, 사전에 학습된 모델을 기반으로 국가기록물 내에 있는 인명을 추출할 수 있다.The analysis model unit 140 according to the present invention selects the Kobert model as a machine learning pre-learning model, and can extract human names in national records based on the pre-learned model.

또한, 분석 모델부(140)는 Kobert 사전학습 데이터 및 선정된 기계학습 베스트 모델은 pytorch-bert-crf-ner 폴더 내에 별도로 저장할 수 있다.In addition, the analysis model unit 140 may separately store the Kobert pre-learning data and the selected machine learning best model in the pytorch-bert-crf-ner folder.

Kobert는 BERT + CRF의 기반의 한국어 NER Tagger를 활용한 PyTorch 모델이다. NER Tagger의 기본 구분 태그는 PER(인명), LOC(지명), ORG(기관명), POH(기타), DAT(날짜), TIM(시간), DUR(기간), MNY(통화), PNT(비율), NOH(기타 수량표현)를 포함하는 10개 기본 구분 태그를 사용한다. Kobert is a PyTorch model using Korean NER Tagger based on BERT + CRF. The basic classification tags of NER Tagger are PER (person name), LOC (place name), ORG (organization name), POH (other), DAT (date), TIM (time), DUR (period), MNY (currency), PNT (rate) ) and NOH (other quantity expressions) are used.

Kobert 모델은 문장 단위 임베딩(텍스트를 컴퓨터가 이해할 수 있는 수치화 적용) 기법을 사용하고, 사전 학습 모델을 기반으로 목적에 맞는 학습 데이터 적용을 통한 자연어 처리 문제를 해결할 수 있다.The Kobert model uses a sentence-by-sentence embedding technique (applying text to computer-understandable digitization) and can solve natural language processing problems by applying training data suitable for the purpose based on a pre-learning model.

Kobert 모델은 사전 학습된 모델을 기반으로 학습 데이터의 후 처리(fine-tuning) 작업을 통한 최종 모델(학습기)을 구축할 수 있다.The Kobert model can build a final model (learner) through fine-tuning of training data based on a pre-trained model.

Kobert 모델은 인명, 지명, 기관명, 날짜 등의 개체명 인식이 가능하나, 비공개 6호 단어에 필요한 인명 중심의 모델링을 구현하여 인명을 추출할 수 있으며, 필요 시 비공개정보 추출을 위해 필요한 지명, 기관명 등의 개체명을 추가적으로 인식하여 비공개정보 추출에 활용할 수 있다.The Kobert model is capable of recognizing entity names such as names of people, places of origin, names of organizations, and dates, but it is possible to extract names of people by implementing the modeling centered on people required for the undisclosed No. 6 word, and, if necessary, placenames and institution names necessary for extracting undisclosed information. It can be used for extracting confidential information by additionally recognizing entity names such as

도 16은 본 발명의 실시 예에 따른 분석 모델부의 하이브리드형 기계학습 프로세스를 나타낸 도면이다. 16 is a diagram illustrating a hybrid machine learning process of an analysis model unit according to an embodiment of the present invention.

도 16을 참조하면, 본 발명에 따른 분석 모델부(140)는, 제6호(개인정보 관련) 비공개 단어, 제6호(인명 등) 비공개 단어, 제2호(국방 관련) 비공개 단어, 기타(직업, 계급 등) 비공개 단어에 대하여, 파이썬 라이브러리를 구축하거나 분석 프레임워크(폴더별/파일별 컨텐츠)를 구축할 수 있다.Referring to FIG. 16, the analysis model unit 140 according to the present invention, No. 6 (personal information related) hidden word, No. 6 (person name, etc.) hidden word, No. 2 (national defense related) hidden word, etc. For private words (occupation, rank, etc.), you can build a Python library or build an analysis framework (contents by folder/file).

또한, 분석 모델부(140)는 다음과 같이 패턴 인식 및 기계 학습을 수행할 수 있다.In addition, the analysis model unit 140 may perform pattern recognition and machine learning as follows.

제6호 비공개 단어(정규표현식 패턴) - 이메일 주소, 여권번호, 주민번호 등No. 6 Private Word (Regular Expression Pattern) - E-mail address, passport number, social security number, etc.

제6호 비공개 단어(기타) - 성별, 증명서, 생년월일 등No. 6 Undisclosed words (others) - Gender, certificate, date of birth, etc.

제2호/기타 비공개 단어 - 북한, 방위산업, 군사계획, 군무원, 장관 등No. 2/Other Undisclosed Words - North Korea, defense industry, military planning, military personnel, minister, etc.

또한, 분석 모델부(140)는 제6호 비공개 단어(인명)에 대하여 코버트 모델을 통하여 CRF(BIO 태깅) - NER(개체명 인식) - 개체명 추출 과정을 수행할 수 있다.In addition, the analysis model unit 140 may perform CRF (BIO tagging) - NER (Entity Recognition) - entity name extraction process with respect to the sixth undisclosed word (person's name) through the Covert model.

또한, 분석 모델부(140)는 Kobert 사전학습모델을 활용하여 국가기록원 문서와 텍스트에 대한 개체명 인식 모델의 최적화(fine-tuning) 작업을 수행할 수 있다.In addition, the analysis model unit 140 may perform fine-tuning of an object name recognition model for documents and texts of the National Archives of Korea by utilizing the Kobert pre-learning model.

도 17은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템을 CAMS(RMS) 및 마스킹 서버와 연계하는 구성도를 개략적으로 나타낸 도면이다.17 is a diagram schematically showing a configuration diagram linking a system for open reclassification of electronic records according to an embodiment of the present invention with a CAMS (RMS) and a masking server.

도 17을 참조하면, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, CAMS/RMS 및 마스킹 서버와 통신망을 통하여 연계할 수 있다. 여기서, 전자기록물 공개재분류를 위한 시스템(100)은 간단히 '공개재분류 서버(100)'라고 약칭하여 설명한다.Referring to FIG. 17 , the system 100 for open reclassification of electronic records according to an embodiment of the present invention can be linked with CAMS/RMS and a masking server through a communication network. Here, the system 100 for open reclassification of electronic records is simply abbreviated as 'open reclassification server 100' and described.

CAMS/RMS는 스토리지, WEB/WAS 서버, DB 서버를 포함하고, 공개재분류 대상 기록물을 분석 모델부(140)의 학습/분석기로 전송한다(①).CAMS/RMS includes a storage, a WEB/WAS server, and a DB server, and transmits records subject to public reclassification to the learning/analyzer of the analysis model unit 140 (①).

이어, 분석 모델부(140)의 학습/분석기는 공개재분류 결과를 CAMS/RMS로 전송한다(②).Subsequently, the learning/analyzer of the analysis model unit 140 transmits the open reclassification result to the CAMS/RMS (②).

또한, CAMS/RMS는 부분 공개 기록물을 마스킹 서버(150)로 전송하면(②), 마스킹 서버(150)는 마스킹 기록물을 CAMS/RMS로 전송한다(③).Also, when the CAMS/RMS transmits the partially open records to the masking server 150 (②), the masking server 150 transmits the masking records to the CAMS/RMS (③).

또한, CAMS/RMS는 업무용 PC 등에서 기록물 열람을 요청하면(①), 기록물 열람 서비스를 업무용 PC로 제공한다(④).In addition, CAMS/RMS provides the records viewing service to the business PC when a request to view records is made on a business PC (①).

즉, CAMS/RMS 및 마스킹 서버(150)는 다음과 같은 기능을 제공한다.That is, the CAMS/RMS and masking server 150 provides the following functions.

1) 기계학습 기반의 기계학습을 통해 비공개 단어를 가지고 있는 기록물을 검출하여, 자동분류를 하는 기능을 제공한다.1) Through machine learning-based machine learning, records with hidden words are detected and automatically classified.

2) 공개재분류 기준서를 기반으로 비공개 분류를 기계학습을 통해 학습모델(패턴)을 생성한다.2) Create a learning model (pattern) through machine learning for non-public classification based on the public reclassification standard.

3) 생성된 학습모델(패턴)을 활용하여 공개재분류 대상 기록물을 자동 분류하고, 비공개 대상 단어를 DB에 저장하며, 이를 CAMS에 전달한다.3) Using the created learning model (pattern), records subject to public reclassification are automatically classified, words subject to non-disclosure are stored in the DB, and these are transmitted to CAMS.

4) 마스킹 서버는 부분 공개 대상 기록물의 열람 요청이 있을 경우, 해당 기록물의 비공개 단어를 마스킹 처리하여 열람 서비스를 제공한다.4) When there is a request to read a record subject to partial disclosure, the masking server provides a browsing service by masking the private words of the record.

5) 마스킹 서버는 PDF 파일만 가능하며, 마스킹 기록물도 PDF 파일로 제공할 수 있다.5) The masking server only accepts PDF files, and masking records can also be provided as PDF files.

한편, 공개재분류 서버(100) 연계 에이전트(Agent)는 CAMS와 RMS의 공개재분류 기능과 API로 연계되어 공개재분류 대상 기록물을 공개재분류 서버(100)에 전송하고, 공개재분류 결과 DB를 CAMS와 RMS의 공개재분류 DB 테이블에 API를 활용하여 전송 및 저장한다.On the other hand, the open reclassification server 100-linked agent is linked with the open reclassification function of CAMS and RMS through API, transmits records subject to public reclassification to the public reclassification server 100, and transmits the public reclassification result DB. is transmitted and stored in the open reclassification DB table of CAMS and RMS using API.

마스킹 서버(150) 연계 Agent는 CAMS와 RMS의 부분공개 기록물을 API를 통하여 마스킹 서버(150)로 전송하고, 학습/분석기의 공개재분류 결과 DB를 마스킹 서버(150)에 API를 통하여 전달하며, 마스킹 서버(150)로부터 마스킹 처리된 기록물을 API를 활용해 전송받아 DB에 기록물을 저장한다.The masking server 150-linked agent transmits CAMS and RMS partial disclosure records to the masking server 150 through API, and transmits the open reclassification result DB of the learning/analyzer to the masking server 150 through API, The masked record is received from the masking server 150 using an API, and the record is stored in the DB.

공개재분류 서버(100)는 CAMS/RMS 공개재분류 Agent를 통하여, CAMS와 RMS의 공개재분류 기능과 연계되어 공개재분류 대상 기록물을 공개재분류 서버(100)에 API를 통하여 전송된 기록물을 수신하여 저장하고, 공개재분류 서버 DB를 CAMS와 RMS의 공개재분류 DB 테이블에 API를 통하여 전송한다.The public reclassification server 100 transfers the records subject to public reclassification to the public reclassification server 100 through the API in connection with the public reclassification function of CAMS and RMS through the CAMS/RMS public reclassification agent. It is received and stored, and the open reclassification server DB is transmitted to the open reclassification DB table of CAMS and RMS through API.

또한, 공개재분류 서버(100)는 텍스트 변환 모듈을 통하여, CAMS/RMS의 API로부터 전송된 기록물을 텍스트로 변환해 저장하고, 텍스트로 변환된 파일을 학습기/분석기에 전송한다.In addition, the open reclassification server 100 converts and stores records transmitted from the CAMS/RMS API into text through a text conversion module, and transmits the text-converted file to the learner/analyzer.

또한, 공개재분류 서버(100)는 학습/분석기를 통하여, 텍스트 변환 모듈로부터 전송받은 텍스트 기록물을 반복 학습을 통하여 학습 모델을 생성하고, 생성된 학습 모델을 통하여 공개재분류 대상 기록물을 분류하며, 분류 결과와 비공개 사유를 학습/분석기 DB에 저장하며, 공개재분류 결과를 CAMS와 RMS의 API 연계 모듈을 통하여 DB에 전송한다.In addition, the open reclassification server 100 generates a learning model through repeated learning of the text records transmitted from the text conversion module through the learning/analyzer, classifies the open reclassification target records through the generated learning model, The classification result and reason for non-disclosure are stored in the learning/analyzer DB, and the open reclassification result is transmitted to the DB through the API linkage module of CAMS and RMS.

마스킹 서버(150)는 CAMS/RMS 연계 마스킹 Agent를 통하여, CAMS/RMS의 부분 공개 열람요청을 받아 해당 기록물을 API를 통하여 마스킹 서버(150)에 기록물을 전송하고, 마스킹 처리된 기록물을 연계 API를 통하여 CAM/RMS에 전송하여 DB에 등록한다. The masking server 150 receives a CAMS/RMS partial disclosure request through the CAMS/RMS linkage masking agent, transmits the record to the masking server 150 through the API, and transmits the masked record through the API. It is transmitted to CAM/RMS through and registered in DB.

또한, 마스킹 서버(150)는 마스킹 모듈을 통하여, PDF파일과 공개재분류 DB 결과값을 수신하고, 결과 값에 해당하는 키워드를 검색하여 검출하고, 키워드 검색 검출 결과 값을 DB에 저장하며, 1차로 검색된 결과를 1차 검수하고, 수정할 사항은 PDF 편집기로 편집하며, 편집 완료된 PDF 파일의 비공개 대상 키워드를 마스킹 처리하고, 마스킹 처리된 파일을 마스킹 API를 통하여 CAMS/RMS로 전송한다.In addition, the masking server 150 receives the PDF file and the public reclassification DB result value through the masking module, searches for and detects a keyword corresponding to the result value, stores the keyword search detection result value in the DB, 1 The primary search result is inspected, the items to be corrected are edited with a PDF editor, the keywords of the edited PDF file are masked, and the masked file is transmitted to CAMS/RMS through the masking API.

한편, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 전자기록물을 텍스트로 변환시 누락되는 글자 및 파일을 방지할 수 있다. Meanwhile, the system 100 for open reclassification of electronic records according to an embodiment of the present invention can prevent missing letters and files when converting electronic records into text.

즉, 공개재분류 서버(100)는 PDF변환 파일을 텍스트로 변환하여 활용하지 않고, 원문 기록물을 텍스트로 변환하여 업무에 적용해야 누락되는 글자 및 파일이 최소화 된다. 이를 검증한 결과, 파워포인트의 특수 문자의 경우도 100%에 가깝게 텍스트 변환 되는 것으로 확인되었다.That is, the open reclassification server 100 does not convert the PDF conversion file into text and uses it, but converts the original document into text and applies it to the work to minimize missing letters and files. As a result of verifying this, it was confirmed that even special characters in PowerPoint were converted into text close to 100%.

또한, 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템(100)은, 비공개하지 않아도 되는 문서에 대하여 도 18에 도시된 바와 같이, 예외처리를 수행할 수 있다. 도 18은 본 발명의 실시 예에 따른 전자기록물 공개재분류를 위한 시스템의 예외처리 과정을 나타낸 도면이다. 도 18에서, 공개재분류 서버(100)는 문서 데이터베이스(160)로부터 비공개 키워드 예외처리 및 재분류 대상 문서를 가져와 비공개 여부를 판단하여, 비공개할 경우에(Yes) 전술한 바와 같이 비공개 분류를 실행하고, 그렇지 않은 경우(No) 분석기를 통하여 예외처리를 수행할 수 있다.In addition, the system 100 for open reclassification of electronic records according to an embodiment of the present invention may perform exception processing as shown in FIG. 18 for documents that do not need to be disclosed. 18 is a diagram showing an exception handling process of a system for open reclassification of electronic records according to an embodiment of the present invention. In FIG. 18, the public reclassification server 100 retrieves the non-public keyword exception processing and reclassification target documents from the document database 160, determines whether they are private, and executes the non-public classification as described above in case of non-public (Yes). If not (No), exception handling can be performed through the analyzer.

공개재분류 서버(100)는 공개재분류 대상 기록물에 해당 키워드 출현시, 비공개로 분류하는 선조치 작업 후 정보 공개 예외처리를 위한 후처리 작업을 진행한다. 그리고 예외처리 패턴화 관리 및 해당되는 속성을 비공개 정보에서 공개 정보로 변환한다.The open reclassification server 100, when a corresponding keyword appears in a record subject to open reclassification, proceeds with post-processing for exception processing of information disclosure after preemptive work of classifying it as undisclosed. Then, exception handling patterning management and corresponding properties are converted from private information to public information.

또한, 본 발명의 실시 예에서 마스킹 서버(150)는 다음과 같은 기능을 제공할 수 있다.In addition, in an embodiment of the present invention, the masking server 150 may provide the following functions.

- 학습기/분석기의 결과값(비공개분류 기록물 건명, 비공개단어 리스트)을 읽어와 해당 PDF 파일의 비공개 단어를 찾아 마스킹 처리할 수 있다.- You can read the results of the learner/analyzer (title of records classified as non-public, list of non-public words), find the non-public words of the PDF file, and process masking.

- 이를 구현하기 위해서는 PDF 파일 Editor와 Making 기능이 제공되며, 학습기/분석기의 결과값을 받기 위한 DB Demon을 제공함.- To implement this, PDF file Editor and Making function are provided, and DB Demon is provided to receive the result of learner/analyzer.

- 마스킹 대상 비공개 단어를 일차적으로 검출하고, 이에 대한 수정 및 최종 검수를 할 수 있도록 구성하며, 좀 더 안정된 운영이 가능하다.- It is configured to primarily detect undisclosed words to be masked, correct them and perform final inspection, and more stable operation is possible.

- CAMS와의 연계는 API(Application Programmable Interface)를 활용하여 연계 구축하며, DB의 결과 값은 학습기/분석기에서 전달받아 마스킹 작업을 수행한다.- Connection with CAMS is established by using API (Application Programmable Interface), and the result value of the DB is received from the learner/analyzer and performs masking.

- 마스킹 서버는 구동 운영하는데, 부하가 높지 않으므로 물리적으로 분리하여 구축하는 것 보다는 논리적으로 분리하여 구축할 수 있다.- The masking server runs, but the load is not high, so it can be logically separated rather than physically separated and built.

- 공개요청이 된 기록물의 부분공개 기록물을 좀 더 많이 공개하기 위한 방안으로 기관의 공개율을 높여 투명행정을 높이는 효과를 제공할 수 있다.- Partial disclosure of records that have been requested for disclosure As a method for disclosing more records, it can provide the effect of enhancing transparent administration by increasing the disclosure rate of institutions.

전술한 바와 같이 본 발명에 의하면, 공공기관에서 공개하는 전자기록물에 대하여 개인정보 등의 비공개 부분을 분류하여 가리고 공개할 수 있도록 하는, 전자기록물 공개재분류를 위한 시스템 및 그 방법을 실현할 수 있다.As described above, according to the present invention, it is possible to realize a system and method for disclosing and reclassifying electronic records, which classifies, hides, and discloses non-public parts, such as personal information, of electronic records disclosed by public institutions.

본 발명이 속하는 기술 분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있으므로, 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.Those skilled in the art to which the present invention pertains should understand that the embodiments described above are illustrative in all respects and not limiting, since the present invention can be embodied in other specific forms without changing the technical spirit or essential characteristics thereof. only do The scope of the present invention is indicated by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

100 : 전자기록물 공개재분류를 위한 시스템
110 : 통신부
120 : 문서 변환부
130 : 제어부
140 : 분석 모델부
150 : 마스크 처리부
160 : 데이터베이스(DB)100: system for open reclassification of electronic records
110: Communication Department
120: document conversion unit
130: control unit
140: analysis model unit
150: mask processing unit
160: database (DB)

Claims

a database (DB) for storing at least one or more electronic records;
a document converting unit for fetching target electronic records from among the one or more electronic records and converting them into text data;
For the converted text data, pattern analysis corresponding to each personal information is performed in a hybrid method including at least two or more of a machine learning dictionary learning pattern, a regular expression expression pattern, a word dictionary + letter count pattern, and a word dictionary pattern, An analysis model unit that separates and reclassifies the non-public part; and
a control unit for controlling reclassification and disclosure of the one or more electronic records;
A system for public reclassification of electronic records that includes.

According to claim 1,
a mask processing unit for masking the non-disclosed portion when the reclassified electronic records are disclosed;
System for public reclassification of electronic records further comprising a.

According to claim 1,
The regular expression expression pattern includes an address,
and the analysis model unit exception-processing a string including a word to be disclosed of the database among addresses corresponding to the regular expression expression pattern.

According to claim 1,
The machine learning pre-learning pattern includes human life,
The system for reclassifying electronic records, comprising machine learning and extracting necessary entity names using a Korea Bidirectional Encoder Representations from Transformers (KOBERT) model for the human life.

According to claim 4,
The kobert model is a PyTorch model using the Korean Name Entity Recognition (NER) Tagger based on BERT + CRF (Conditional Random Field), and the PER (person name) and LOC (place name) of the NER tagger , ORG (Institution Name), POH (Other), DAT (Date), TIM (Time), DUR (Period), MNY (Currency), PNT (Rate), NOH (Other Quantity Expression) A system for open reclassification of electronic records, using

According to claim 5,
The Kobert model processes the converted text data into natural language by applying training data suitable for the purpose based on a prior learning model using a sentence-by-sentence embedding technique to which computer-understandable digitization is applied, and the A system for open reclassification of electronic records that recognizes and extracts necessary entity names such as names of people, place names, and institutions through post-processing (fine-tuning) of learning data.
A system for open reclassification of electronic records that recognizes and extracts necessary entity names such as names of people, place names, and institutions.

According to claim 1,
The word dictionary + number of characters pattern or the word dictionary pattern classifies the private word No. 6 (personal information) in the converted text data into two criteria by personal information type and business classification system, and A system for open reclassification of electronic records, characterized by adding spaces between letters in and detecting even words in which spacing errors occur during text conversion.

According to claim 1,
The word dictionary + number of characters pattern or the word dictionary pattern classifies the second (National Security, Defense, Unification, Diplomatic Relations) private word in the converted text data, adds spaces between letters in the basic private word, , A system for open reclassification of electronic records, characterized in that it detects even words with space errors during text conversion

According to claim 5 or 6,
In the kobert model, the non-public part is information that can be viewed as prescribed by law, information that does not unreasonably infringe on the privacy or freedom of privacy, and disclosure is necessary for the public interest or to rescue individual rights. Disclosure of electronic records, which is subject to exception processing, in cases where it falls under one of the following: information deemed to be relevant, the name and position of public officials who performed their duties, and the names and occupations of individuals to whom the state or local governments have entrusted or commissioned part of their duties. A system for reclassification.

According to claim 1,
The regular expression expression pattern includes e-mail, passport number, resident registration number, alien registration number, business registration number, serial number, driver's license number, card number, address, and lawsuit number,
The machine learning pre-learning pattern includes No. 6 (personal information) private words and human names,
The word dictionary + character count pattern includes No. 6 hidden words and No. 2 (National Security/Defense/Unification/Diplomatic Relations) undisclosed words,
wherein the word dictionary pattern includes other non-public words.

A method for reclassifying electronic records in a system for storing at least one electronic record in a DB and reclassifying electronic records to be disclosed among the one or more electronic records,
(a) converting the electronic records to be disclosed among the one or more electronic records into text data by a document converting unit;
(b) The analysis model unit performs machine learning by loading the converted text data as learning data, and at least two of a regular expression expression pattern, a machine learning dictionary learning pattern, a word dictionary + letter count pattern, and a word dictionary pattern. Simultaneously performing pattern analysis in a hybrid manner comprising; and
(c) classifying and reclassifying non-public parts of the electronic records to be disclosed based on the results of the machine learning and pattern analysis by the control unit;
A method for public reclassification of electronic records, including

According to claim 11,
In the step (b), the analysis model unit for the machine learning,
(b-1) setting a library for undisclosed words to be distinguished from the one or more electronic records;
(b-2) With regard to the above-mentioned closed words, analysis frames for each content file including the 6th (personal information) hidden words, 2nd (National Security/Defense/Unification/Diplomatic Relations) hidden words, and other hidden words creating a work;
(b-3) B (begin) -I (inside) -O (outside) tagging ( tagging); and
(b-4) recognizing and extracting entity names including human names, place names, and institution names through post-processing (fine-tuning) of the learning data based on the BIO tagging;
A method for public reclassification of electronic records, including

According to claim 12,
In the step (b-3), the analysis model unit PER (person name), LOC (place name), ORG (organization name), POH (others), DAT (date), TIM (time) of the CRF-based Korean NER tagger , DUR (period), MNY (currency), PNT (ratio), using 10 basic classification tags including NOH (other quantity expressions) to perform the BIO tagging on the learning data, electronic records disclosure material method for classification.

According to claim 11,
In the step (c), the control unit, when disclosing the reclassified electronic records, masks the undisclosed portion through a mask processing unit, and then discloses the reclassified electronic records.

According to claim 11,
In the step (b), the word dictionary + character count pattern or word dictionary pattern classifies the private word No. 6 (personal information) in the converted text data into two types based on personal information type and business classification system. and classifying the second (National Security, Defense, Unification, Diplomatic Relations) private words in the converted text data, adding spaces between letters in the basic private words, and detecting even words with spacing errors during text conversion , Methods for Disclosure Reclassification of Electronic Records.