KR102319492B1

KR102319492B1 - AI Deep learning based senstive information management method and system from images

Info

Publication number: KR102319492B1
Application number: KR1020200049477A
Authority: KR
Inventors: 박노원; 강지훈; 장희준
Original assignee: 주식회사 컴트루테크놀로지
Priority date: 2020-04-23
Filing date: 2020-04-23
Publication date: 2021-10-29

Abstract

The present invention provides a personal information processing system using AI deep learning and a personal information processing method using the same, wherein the method comprises: a preprocessing step in which a preprocessing unit extracts an image file from a target file, detects an area of interest that is determined to contain personal information in the extracted image file, and performs processing for improving image recognition rate for the region of interest; a personal information extraction step in which a personal information area extraction unit extracts processing target information including personal information from the area of interest, detected by the preprocessing unit, using a CNN neural network; a pattern analysis step in which a pattern analysis unit analytically compares a pattern of the personal information, extracted by the personal information area extraction unit, with a preset personal information pattern to extract personal information of the processing target information; and a de-identification processing step in which a de-identification processing unit de-identifies the personal information extracted by the pattern analysis unit. As described above, it is possible to effectively detect and de-identify the personal information, retained by an individual or an institution, by analyzing text and image information included in the image file using the artificial intelligence neural network, then detecting the personal information (personal identification number information or fingerprint information, etc.) included therein and extracting the same, thereby effectively protecting the personal information.

Description

Personal information processing system using AI deep learning and personal information processing method using the same {AI Deep learning based senstive information management method and system from images }

본 발명은 딥러닝을 이용한 개인정보 처리시스템에 관한 것으로서, 보다 상세하게는 AI인공지능 신경망을 이용하여 이미지파일에 포함되는 텍스트와 이미지정보를 분석한 후 이에 포함된 개인정보를 검출하고 이를 추출함으로써 개인이나 기관에서 보유하고 있는 개인정보들을 효과적으로 검출하고 비식별화할 수 있는 딥러닝을 이용한 개인정보 처리시스템에 관한 것이다. The present invention relates to a personal information processing system using deep learning, and more specifically, by analyzing text and image information included in an image file using an AI artificial intelligence neural network, and then detecting and extracting the personal information included therein. It relates to a personal information processing system using deep learning that can effectively detect and de-identify personal information held by individuals or institutions.

일반적으로 컴퓨터의 발달로 개인 또는 기업 내에서 이루어지는 작업들이 전산화되고 있으며, 이로 인하여 생성되는 데이터들은 네트워크를 통하여 공유 또는 공개된다. 한편, 이러한 데이터들은 해킹에 의하여 또는 기업(기관)의 관리실수 등에 의하여 유출되는 상황이 발생될 경우, 국내외에서 다량의 개인정보가 포함된 파일이 공개되어 기업이나 기관은 대외 신임도에 심각한 영향을 주는 문제가 초래되어, 이에 대한 개인과 기업, 기관에서는 개인정보가 노출되지 않도록 하기 위한 대응조치가 필요하다. In general, due to the development of computers, tasks performed within individuals or companies are being computerized, and the resulting data is shared or disclosed through a network. On the other hand, in the event that such data is leaked by hacking or due to a management mistake of the company (organization), files containing a large amount of personal information are disclosed at home and abroad, and the company or institution has a serious impact on external credibility. There is a problem with giving, and it is necessary for individuals, companies, and institutions to take countermeasures to prevent personal information from being exposed.

이에 상기한 개인정보 노출을 방지하기 위한 기술의 예로 대한민국 특허등록 제10-1734442호는 인터넷 게시판의 게시글을 이용하여 물품을 거래함에 있어서 개인정보를 보호하는 방법으로서, 판매자로부터 작성된 상기 게시글이 상기 인터넷 게시판의 웹서버에 업로드 요청되는 단계로서, 상기 게시글은 게시글 헤더와 게시글 본문을 포함하는, 단계; 상기 웹 서버는 상기 게시글 본문에 포함된 개인정보를 추출하는 단계; 상기 추출된 개인정보는 개인정보 데이터베이스에 저장되는 단계; 및 상기 추출된 개인정보를 링크로 대체한 상기 게시글 본문이 본문 데이터베이스에 저장되는 단계를 포함하는 개인정보 보호 방법이 개시된 바 있다.Accordingly, as an example of a technology for preventing the above-mentioned personal information exposure, Korean Patent Registration No. 10-1734442 is a method of protecting personal information in transacting goods using a posting on an Internet bulletin board. A step of requesting upload to a web server of a bulletin board, wherein the post includes a post header and a post body; extracting, by the web server, personal information included in the body of the post; storing the extracted personal information in a personal information database; and storing the post body in which the extracted personal information is replaced with a link in a body database.

한편, 최근에는 핸드폰의 대중화와 사진 촬영이 일반화되어 텍스트(text)파일 보다는 이미지나 동영상 형태의 파일이 많이 생성되고 있으며, 또한 파일전송측면에서도 파일 용량을 낮추기 위하여 저용량의 이미지파일을 주로 사용하고 있는 추세이다. On the other hand, recently, due to the popularization of mobile phones and the generalization of photography, many files in the form of images or videos are generated rather than text files. is the trend

그런데, 상기한 바와 같이 이미지파일이 증가하는 추세에도 불구하고 종래의 개인정보 노출을 방지하기 위한 기술은, 텍스트를 추출하는 방식이기 때문에 서버상의 이미지파일에 포함된 개인정보를 검출하는데 어려움이 있을 뿐만 아니라 이를 비식별화 하는 것에도 효과적이지 못하여 이미지파일에 포함된 개인정보를 보호하는데 문제점이 있었다. However, as described above, in spite of the increasing trend of image files, the conventional technique for preventing personal information exposure is a method of extracting text, so it is difficult to detect personal information included in the image file on the server. However, it was not effective in de-identifying it, and there was a problem in protecting the personal information included in the image file.

대한민국 특허등록 제10-1734442호Korean Patent Registration No. 10-1734442

본 발명은, 인공지능신경망을 이용하여 이미지파일에 포함된 텍스트와 이미지정보를 분석한 후 이에 포함된 개인정보를 검출하고 이를 추출함으로써 개인이나 기관에서 보유하고 있는 개인정보들을 효과적으로 검출하고 비식별화함으로써 개인정보를 효과적으로 보호할 수 있는 딥러닝을 이용한 개인정보 처리시스템 및 이를 이용한 개인정보 처리방법을 제공하는 것을 목적으로 한다. The present invention uses an artificial intelligence neural network to effectively detect and de-identify personal information possessed by individuals or institutions by analyzing text and image information included in an image file and then detecting and extracting the personal information included therein. The purpose of this is to provide a personal information processing system using deep learning that can effectively protect personal information and a personal information processing method using the same.

본 발명의 일 측면에 의하면, 본 발명은, 대상 파일로부터 이미지파일을 추출하고, 추출된 이미지파일에서 개인정보가 포함될 것으로 판단되는 관심영역을 검출하여 상기 관심영역에 대한 이미지인식률 향상을 위한 처리를 수행하는 전처리부와; 상기 전처리부로부터 검출된 관심영역에서 개인정보가 포함된 처리대상정보를 추출하는 개인정보영역추출부와; 상기 개인정보영역추출부에 의하여 추출된 상기 처리대상정보의 패턴을 기 설정된 개인정보패턴과 비교 분석하여 상기 처리대상정보의 개인정보를 추출하는 패턴분석부와; 상기 패턴분석부에 의하여 추출된 상기 개인정보를 비식별화 처리하는 비식별화처리부를 포함하여 구성됨을 특징으로 하는 딥러닝을 이용한 개인정보 처리시스템을 제공할 수 있다.According to one aspect of the present invention, the present invention extracts an image file from a target file, detects a region of interest that is determined to contain personal information from the extracted image file, and performs processing for improving the image recognition rate for the region of interest. a pre-processing unit to perform; a personal information region extraction unit for extracting processing target information including personal information from the region of interest detected by the pre-processing unit; a pattern analysis unit for extracting personal information of the processing target information by comparing and analyzing the pattern of the processing target information extracted by the personal information area extraction unit with a preset personal information pattern; It is possible to provide a personal information processing system using deep learning, characterized in that it comprises a de-identification processing unit for de-identification processing of the personal information extracted by the pattern analysis unit.

여기서, 상기 전처리부는, 상기 이미지파일을 설정된 템플릿들과 비교하여 개인정보포함 가능성에 따라 이미지를 구분하는 템플릿매칭(Template matching)비교부와; 상기 템플릿매칭비교부에 의해 선별된 상기 이미지파일에서 인공신경망에 의해 개인정보 포함형식이 학습된 영역검출알고리즘을 통하여 상기 관심영역을 검출하는 영역검출부와; 상기 관심영역 이미지를 기하학적으로 보정하는 이미지 보정부;를 포함하여 구성될 수 있다.Here, the pre-processing unit includes: a template matching comparison unit that compares the image file with set templates and classifies the image according to the possibility of including personal information; a region detection unit for detecting the region of interest through a region detection algorithm in which a personal information inclusion format is learned by an artificial neural network from the image file selected by the template matching and comparing unit; and an image correction unit for geometrically correcting the image of the region of interest.

상기 템플릿매칭비교부는, 상기 이미지파일과 상기 템플릿들의 양식일치지수(Template matching index)에 따라 상기 이미지파일을 정형양식과 비정형양식으로 각각 구분하도록 구성되고, 상기 양식일치지수에 따른 매칭비교알고리즘은 양식일치지수 학습을 위한 인공신경망(CNN, Convolution Neural Network)을 통하여 학습될 수 있다.The template matching and comparison unit is configured to classify the image file into a formal format and an atypical format according to a template matching index of the image file and the templates, respectively, and the matching comparison algorithm according to the format matching index is a form It can be learned through an artificial neural network (CNN, Convolution Neural Network) for concordance index learning.

상기 설정된 템플릿들은, 운전면허증, 신분증, 사증 또는 외국인증명서를 포함하는 개인정보가 포함된 정규양식들의 템플릿들로 구성될 수 있다.The set templates may be composed of templates of regular forms including personal information including driver's license, identification card, visa or foreigner's certificate.

상기 영역검출알고리즘은, 상기 이미지파일에 포함된 꼭지점(point), 모서리(corner) 또는 선분(line)을 판단하는 특징점을 기준으로 관심영역을 검출하되, RCL(Rectangular Corner Localization)신경망을 이용하여 상기 특징점에 대한 판단가중치의 최적비율을 찾도록 구성될 수 있다.The region detection algorithm detects a region of interest based on a feature point for determining a vertex, a corner, or a line included in the image file, but using a Rectangular Corner Localization (RCL) neural network. It may be configured to find an optimal ratio of the judgment weights for the feature points.

상기 영역검출알고리즘은, 상기 이미지파일의 픽셀 벨류 매트릭스(Pixel value matrix)에 미분행렬(gradient derivative matrix)를 합성곱하여 박스형태로 검출하도록 구성될 수 있다.The region detection algorithm may be configured to convolutionally multiply a pixel value matrix of the image file with a gradient derivative matrix to detect a box shape.

상기 영역검출알고리즘은, 엔터티(entity)당 사각형의 꼭지점을 예측하되, 상기 관심영역 이미지에 대한 관점의 정합, 회전 또는 꼭지점누락 여부를 판단하여 상기 관심영역을 검출하도록 구성될 수 있다.The region detection algorithm may be configured to predict a vertex of a rectangle per entity, and detect the region of interest by determining whether a point of view is matched, rotated, or missing a vertex with respect to the region of interest image.

싱기 영역검출부는, 2개의 모서리, 3개의 선분 또는 4개의 꼭지점으로 구성된 상기 특징부를 연결하여 박스형태로 검출하도록 구성될 수 있다.The singi region detection unit may be configured to detect a box shape by connecting the features composed of two corners, three line segments, or four vertices.

상기 영역검출부는, 상기 관심영역의 이미지적 특성이 부각되도록 상기 이미지파일에 이미지 모폴로지 필터(image morphological filter)를 이용하여 상기 이미지파일의 윤곽선(contour)을 추출하도록 구성될 수 있다.The region detector may be configured to extract a contour of the image file by using an image morphological filter in the image file so that the image characteristic of the region of interest is emphasized.

상기 이미지 보정부는, 상기 관심영역 이미지의 뷰포인트(view point)가 수직방향에 위치하여 상기 관심영역 이미지가 평면을 이루도록 관점보정(perspective transform)하는 관점보정모듈과, 상기 관심영역 이미지의 기울어진 각도를 회전보정(rotation transform)하는 회전보정모듈을 포함하여 구성될 수 있다.The image compensator includes a viewpoint correction module that performs perspective transformation so that a viewpoint of the region of interest image is positioned in a vertical direction so that the image of the region of interest forms a plane, and an inclined angle of the region of interest image. It may be configured to include a rotation correction module for rotation correction (rotation transform).

상기 전처리부는, 상기 대상파일로부터 추출된 상기 이미지파일들 각각에 대하여 연결식별번호를 부여하는 이미지어레이부(image-array)와; 상기 이미지파일의 품질저해요소를 제거하는 이미지개선부;를 더 포함하여 구성될 수 있다.The pre-processing unit includes: an image-array unit for assigning a connection identification number to each of the image files extracted from the target file; It may be configured to further include; an image improvement unit for removing the quality impairing factors of the image file.

상기 연결식별번호는, 상기 이미지파일의 위치정보 및 변환정보를 확인할 수 있도록 상기 이미지파일 각각에 대하여 파일명, 페이지 및 이미지명을 포함하여 구성될 수 있다.The connection identification number may include a file name, a page, and an image name for each of the image files so that the location information and conversion information of the image file can be confirmed.

상기 이미지개선부는, CNN신경망을 통해 학습된 이미지개선알고리즘을 통하여, 상기 이미지파일의 노이즈(noise), 블러(Blur) 또는 폐색(occlusion)을 포함하는 품질저해요소를 제거하도록 구성될 수 있다.The image improvement unit may be configured to remove quality-deteriorating factors including noise, blur, or occlusion of the image file through an image improvement algorithm learned through a CNN neural network.

상기 개인정보영역추출부는, CNN신경망을 통해 학습된 추출알고리즘을 통하여, 상기 관심영역에서 상기 처리대상정보를 추출하도록 구성될 수 있다.The personal information region extraction unit may be configured to extract the processing target information from the region of interest through an extraction algorithm learned through a CNN neural network.

상기 처리대상정보는, 텍스트로 구성되는 개인정보로 구성될 수 있다.The processing target information may be composed of personal information composed of text.

상기 개인정보영역추출부는, CNN신경망을 통해 학습된 추출알고리즘을 통하여, 상기 관심영역에서 텍스트가 포함된 글유닛을 검출하고, 상기 글유닛으로부터 텍스트 문자열을 추출하도록 구성되되, 상기 관심영역에서 일정 길이와 높이를 갖는 이미지와 공백이 연속되어 있는 영역을 상기 글유닛으로 검출할 수 있다.The personal information area extraction unit is configured to detect a text unit including text in the area of interest through an extraction algorithm learned through a CNN neural network, and extract a text string from the text unit, a predetermined length in the area of interest An image having a height of and a region in which a space is continuous may be detected by the text unit.

상기 개인정보영역추출부는, 상기 글유닛의 해당 위치 및 좌표를 저장하고 상기 관심영역이 포함된 해당 이미지파일의 연결식별번호 데이터를 저장관리하도록 구성될 수 있다.The personal information area extraction unit may be configured to store the corresponding position and coordinates of the writing unit, and to store and manage the connection identification number data of the corresponding image file including the area of interest.

상기 처리대상정보는, 지문이미지 또는 안면이미지로 구성되는 개인정보로 구성될 수 있다.The processing target information may be composed of personal information composed of a fingerprint image or a face image.

상기 개인정보영역추출부는, CNN신경망을 통해 학습된 추출알고리즘을 통하여, 상기 관심영역에서 이미지패턴을 검출하고, 상기 이미지패턴으로부터 지문이미지 또는 안면이미지를 추출하도록 구성되되, 상기 관심영역에서 특징점으로 연결되는 지문패턴 또는 안면패턴을 갖는 패턴영역을 상기 이미지패턴으로 검출하도록 구성될 수 있다.The personal information region extraction unit is configured to detect an image pattern in the region of interest through an extraction algorithm learned through a CNN neural network, and extract a fingerprint image or a facial image from the image pattern, connected to a feature point in the region of interest It may be configured to detect a pattern region having a fingerprint pattern or a face pattern to be used as the image pattern.

상기 패턴분석부는, 상기 텍스트를 주민번호, 운전면허증번호, 여권번호 및 외국인번호 및 지문을 포함하는 기 설정된 개인정보패턴과 비교 분석하는 개인정보패턴분석부와, 상기 텍스트를 주민등록증, 운전면허증, 여권, 거주증명서 및 외국인증명서를 포함하는 기 설정된 문서양식과 비교 분석하는 문서양식분석부를 포함하여 구성되되, 상기 기 설정된 개인정보패턴은 개인정보 생성로직에 의해 생성될 것일 수 있다.The pattern analysis unit, a personal information pattern analysis unit that compares and analyzes the text with a preset personal information pattern including a resident number, a driver's license number, a passport number and a foreigner number and a fingerprint, and a resident registration card, a driver's license, a passport , and a document form analysis unit that compares and analyzes a preset document format including a residence certificate and an alien certificate, wherein the preset personal information pattern may be generated by a personal information generation logic.

상기 개인정보패턴분석부 및 상기 문서양식분석부는, CNN신경망을 통해 학습된 패턴분석알고리즘을 통하여, 상기 처리대상정보를 비교 분석하도록 구성될 수 있다.The personal information pattern analysis unit and the document form analysis unit may be configured to compare and analyze the processing target information through a pattern analysis algorithm learned through a CNN neural network.

상기 문서양식분석부는, 상기 관심영역으로부터 추출된 개인정보 텍스트가 개인정보 패턴과 완전한 일치성이 나타나지 않지만, 유사도가 높은 경우 패턴 매칭을 통해 양식의 일치여부를 재검사되도록 한다.The document form analysis unit, if the personal information text extracted from the region of interest does not show complete correspondence with the personal information pattern, but has a high degree of similarity, rechecks whether the form matches the pattern through pattern matching.

상기 비식별화처리부는, 상기 텍스트를 모자이크처리, 숨김처리, 암호화처리, 가명처리, 총계처리, 데이터삭제, 데이터범주화 또는 마스킹 중 어느 하나 이상의 방식으로 처리하도록 구성될 수 있다.The de-identification processing unit may be configured to process the text in any one or more manners of mosaic processing, hiding processing, encryption processing, pseudonymization processing, total processing, data deletion, data categorization, and masking.

한편, 상기 영역검출부는, 상기 대상파일에 지문이미지 또는 안면이미지의 검출영역이 있다고 판단되는 경우, 상기 관심영역의 검출빈도가 높아지도록 검출알고리즘이 보정치를 적용할 수 있다.Meanwhile, when it is determined that the target file has a detection region of a fingerprint image or a facial image, the region detection unit may apply a correction value to the detection algorithm so that the detection frequency of the region of interest increases.

본 발명의 다른 측면에 의하면, 본 발명은 전처리부가 대상 파일로부터 이미지파일을 추출하고, 추출된 이미지파일에서 개인정보가 포함될 것으로 판단되는 관심영역을 검출하여 상기 관심영역에 대한 이미지인식률을 향상을 위한 처리를 수행하는 전처리단계와; 개인정보영역추출부가 CNN신경망을 이용하여 상기 전처리부로부터 검출된 관심영역에서 개인정보를 포함하는 처리대상정보를 추출하는 개인정보 추출단계와; 패턴분석부가 상기 개인정보영역추출부에 의하여 추출된 상기 개인정보의 패턴을 기 설정된 개인정보패턴과 비교 분석하여 상기 처리대상정보의 개인정보를 추출하는 패턴분석단계와; 비식별화처리부가 상기 패턴분석부에 의하여 추출된 상기 개인정보를 비식별화 처리하는 비식별화 처리단계를 포함하여 구성됨을 특징으로 하는 딥러닝을 이용한 개인정보 처리방법을 제공할 수 있다.According to another aspect of the present invention, a preprocessor extracts an image file from a target file, detects a region of interest that is determined to contain personal information from the extracted image file, and improves the image recognition rate for the region of interest. a pre-processing step of performing processing; a personal information extraction step of extracting, by a personal information region extraction unit, processing target information including personal information from the region of interest detected from the pre-processing unit using a CNN neural network; a pattern analysis step in which a pattern analysis unit compares and analyzes the pattern of the personal information extracted by the personal information area extraction unit with a preset personal information pattern to extract the personal information of the processing target information; It is possible to provide a personal information processing method using deep learning, characterized in that the de-identification processing unit includes a de-identification processing step of de-identifying the personal information extracted by the pattern analysis unit.

상기 전처리단계는, 이미지어레이부(image-array)가 상기 대상파일로부터 추출된 상기 이미지파일들 각각에 대하여 연결식별번호를 부여하는 이미지어레이단계와; 이미지개선부가 상기 이미지파일의 품질저해요소를 제거하는 이미지개선단계와; 템플릿매칭비교부가 상기 이미지파일을 운전면허증, 신분증 또는 외국인증명서를 포함하는 개인정보가 포함된 정규양식들의 템플릿그룹들로 구성되는 설정된 템플릿들과 비교하여 개인정보포함 가능성에 따라 이미지를 구분하는 템플릿비교단계와; 영역검출부가 상기 템플릿매칭비교부에 의하여 선별된 상기 이미지파일에서 인공신경망에 의해 개인정보 포함형식이 학습된 영역검출알고리즘을을 통하여 상기 관심영역을 검출하는단계와; 이미지 보정부가 상기 관심영역 이미지를 관점보정 및 회전보정하는 보정단계;를 포함하여 구성될 수 있다.The pre-processing step includes: an image array step in which an image-array unit assigns a connection identification number to each of the image files extracted from the target file; an image improvement step in which an image improvement unit removes a quality degrading factor of the image file; The template matching and comparison unit compares the image file with set templates composed of template groups of regular forms including personal information including driver's license, identification card or foreigner's certificate, and compares the image according to the possibility of including personal information. step; detecting, by a region detection unit, the region of interest from the image file selected by the template matching and comparing unit, through a region detection algorithm in which a personal information inclusion format is learned by an artificial neural network; The image correction unit may include a correction step of correcting the viewpoint and rotation of the image of the region of interest.

상기 영역검출부는, 상기 템플릿매칭 비교부에 의하여 상기 이미지파일이 정형양식으로 구분되면, 상기 관심영역의 이미지적 특성이 부각되도록 해당 이미지파일을 이미지 모폴로지 필터(image morphological filter)를 이용하여 상기 이미지파일의 윤곽선(contour)을 추출하도록 구성될 수 있다.The region detection unit, when the image file is classified into a fixed format by the template matching and comparing unit, uses an image morphological filter to convert the image file into the image file so that the image characteristic of the region of interest is highlighted. may be configured to extract the contour of

상기 영역검출알고리즘은, 상기 이미지파일에 포함된 꼭지점(point), 모서리(corner) 또는 선분(line)을 포함하는 특징점을을 기준으로 관심영역을 검출하되, RCL(Rectangular Corner Localization)신경망을 이용하여 상기 특징점에 대한 판단가중치의 최적비율을 찾도록 구성될 수 있다.The region detection algorithm detects a region of interest based on a feature point including a vertex, a corner, or a line included in the image file, but using a Rectangular Corner Localization (RCL) neural network. It may be configured to find an optimal ratio of judgment weights for the feature points.

상기 이미지개선부는, CNN신경망을 통해 학습된 이미지개선알고리즘을 통해 상기 이미지파일의 노이즈(noise), 블러(Blur) 또는 폐색(occlusion)을 포함하는 품질저해요소를 제거하되, 상기 품질저해요소 픽셀(pixel)의 제거값을 찾도록 학습될 수 있다.The image improvement unit removes quality-deteriorating factors including noise, blur or occlusion of the image file through an image improvement algorithm learned through a CNN neural network, but the quality-degrading pixel ( pixel) can be trained to find the removal value.

상기 템플릿매칭비교부는, 상기 이미지파일과 상기 템플릿들의 양식일치지수(Template matching index)에 따라 상기 이미지파일을 정형양식과 비정형양식으로 각각 구분하고, 상기 양식일치지수에 따른 매칭비교알고리즘은 양식일치지수 학습을 위한 인공신경망(CNN, Convolution Neural Network)을 통하여 학습될 수 있다.The template matching and comparison unit divides the image file into a formal format and an atypical format according to a template matching index of the image file and the templates, respectively, and the matching comparison algorithm according to the format matching index is a format matching index It can be learned through an artificial neural network (CNN, Convolution Neural Network) for learning.

상기 처리대상정보는, 텍스트로 구성되는 개인정보이고, 상기 개인정보영역추출부는, CNN신경망을 통해 학습된 추출알고리즘을 통해 상기 관심영역에서 텍스트가 포함된 글유닛을 검출하고, 상기 글유닛으로부터 텍스트 문자열을 추출하도록 구성되되, 상기 관심영역에서 일정 길이와 높이를 갖는 이미지와 공백이 연속되어 있는 영역을 상기 글유닛으로 검출되도록 구성될 수 있다.The processing target information is personal information composed of text, and the personal information area extraction unit detects a text unit including text in the area of interest through an extraction algorithm learned through a CNN neural network, and the text from the text unit It may be configured to extract a character string, and it may be configured to detect a region in which an image having a predetermined length and height and a space are continuous in the region of interest as the writing unit.

상기 처리대상정보는, 지문이미지 또는 안면이미지로 구성되는 개인정보이고, 상기 개인정보영역추출부는, CNN신경망을 통해 학습된 추출알고리즘을 통해 상기 관심영역에서 이미지패턴을 검출하고, 상기 이미지패턴으로부터 지문이미지 또는 안면이미지를 추출하도록 구성되되, 상기 관심영역에서 특징점으로 연결되는 지문패턴 또는 안면패턴을 갖는 패턴영역을 상기 이미지패턴으로 검출되도록 구성될 수 있다.The processing target information is personal information composed of a fingerprint image or a facial image, and the personal information region extraction unit detects an image pattern in the region of interest through an extraction algorithm learned through a CNN neural network, and a fingerprint from the image pattern. Doedoe configured to extract an image or facial image, it may be configured to detect a pattern region having a fingerprint pattern or facial pattern connected to a feature point in the region of interest as the image pattern.

상기 패턴분석부는, CNN신경망을 통해 학습된 패턴분석알고리즘을 통하여, 상기 텍스트를 주민번호, 운전면허증번호, 여권번호 및 외국인번호 및 지문을 포함하는 기 설정된 개인정보패턴과 비교 분석하는 개인정보패턴분석부와, 상기 텍스트를 주민등록증, 운전면허증, 여권, 거주증명서 및 외국인증명서를 포함하는 기 설정된 문서양식과 비교 분석하는 문서양식분석부를 포함하여 구성될 수 있다.The pattern analysis unit, through a pattern analysis algorithm learned through a CNN neural network, analyzes the personal information pattern to compare and analyze the text with a preset personal information pattern including resident number, driver's license number, passport number and foreign number and fingerprint and a document form analysis unit that compares and analyzes the text with a preset document form including a resident registration card, a driver's license, a passport, a residence certificate, and an alien certificate.

본 발명에 따른 딥러닝을 이용한 개인정보 처리시스템 및 이를 이용한 개인정보 처리방법은 다음과 같은 효과를 제공할 수 있다.The personal information processing system using deep learning and the personal information processing method using the same according to the present invention can provide the following effects.

첫째, 인공지능신경망을 이용하여 이미지파일에 포함된 텍스트와 이미지정보를 분석한 후 이에 포함된 개인정보(개인고유식별번호 정보 또는 지문정보 등)를 검출하고 이를 추출함으로써 개인이나 기관에서 보유하고 있는 개인정보들을 효과적으로 검출하고 비식별화함으로써 개인정보를 효과적으로 보호할 수 있다.First, after analyzing the text and image information included in the image file using an artificial intelligence neural network, it detects and extracts the personal information (personal identification number information or fingerprint information, etc.) included in the image file. By effectively detecting and de-identifying personal information, personal information can be effectively protected.

둘째, 인공지능신경망을 이용하여 이미지파일을 설정 템플릿의 양식지수 기준에 따라 그 양식을 비교하여 비정형양식 및 정형양식으로 구분하고, 이에 따라 처리함으로써 이미지파일에 포함된 개인정보 검출시간을 단축시킬 수 있다.Second, by using an artificial intelligence neural network, the image file is classified into atypical form and formal form by comparing the form according to the form index standard of the setting template, and processing accordingly can shorten the detection time of personal information contained in the image file. have.

셋째, 인공지능신경망을 이용하여 이미지파일에서 텍스트를 포함하는 관심영역이 있다고 판단되는 표적위치(target of Interest: ROI)를 판단하기 때문에 검출성능을 향상시킬 수 있다. Third, the detection performance can be improved by using the artificial intelligence neural network to determine the target position (ROI), which is determined to have a region of interest including text, in the image file.

넷째, 이미지파일의 기하학적인 특이점을 추출하여 관심영역(Text Box)의 위치를 파악하고, 파악된 관심영역을 설정기준에 따라 이를 변환 매트릭스(Matrix)를 이용하여 정확하게 비식별화할 수 있기 때문에, 개인이나 기관에서 보유하고 있는 개인정보들을 효과적으로 검출하고 비식별화할 수 있다. Fourth, the location of the text box is identified by extracting the geometric singularity of the image file, and the identified area of interest can be accurately de-identified using a transformation matrix according to the setting criteria. However, it can effectively detect and de-identify personal information held by institutions.

도 1은 본 발명의 실시예에 따른 딥러닝을 이용한 개인정보 처리시스템의 환경구성을 나타내는 도면이다.
도 2는 본 발명에 의한 딥러닝을 이용한 개인정보 처리시스템의 구성을 나타내는 블록도이다.
도 3은 본 발명에 의한 개인정보 처리시스템을 구성하는 템플릿매칭비교부가 양식 유사도를 지수점수로서 판단한 경우의 단말기화면을 나타내는 도면이다.
도 4는 본 발명에 의한 개인정보 처리시스템을 구성하는 이미지 보정부에 의한 이미지 보정 및 이에 대한 행렬을 나타내는 도면이다.
도 5는 본 발명에 의한 개인정보 처리시스템을 구성하는 전처리부 처리방법에 대한 실시예를 나타내는 도면이다.
도 6은 본 발명에 의한 개인정보 처리시스템을 구성하는 영역검출부의 영역 검출방식을 나타내는 도면이다.
도 7은 본 발명에 의한 개인정보 처리시스템을 구성하는 전처리부에서 관점보정 및 회전보정 처리과정의 일 예를 나타내는 예시도이다.
도 8은 본 발명에 의한 개인정보 처리시스템을 구성하는 개인정보영역추출부에 의한 텍스트 추출과정을 나타내는 도면이다.
도 9는 본 발명에 의한 개인정보 처리시스템을 구성하는 패턴분석부에 의한 개인정보 패턴추출과정을 나타내는 도면이다.
도 10은 본 발명에 의한 개인정보 처리시스템을 이용한 개인정보 추출 후 비식별화 과정을 나타내는 도면이다.
도 11은 본 발명에 의한 개인정보 처리방법에서 비식별화처리부가 지문에 대한 비식별화 처리 예를 나타내는 도면이다.1 is a diagram showing an environment configuration of a personal information processing system using deep learning according to an embodiment of the present invention.
2 is a block diagram showing the configuration of a personal information processing system using deep learning according to the present invention.
3 is a view showing a terminal screen when the template matching comparison unit constituting the personal information processing system according to the present invention determines the form similarity as an index score.
4 is a diagram showing image correction by the image correction unit constituting the personal information processing system according to the present invention and a matrix therefor.
5 is a diagram showing an embodiment of a pre-processing unit processing method constituting a personal information processing system according to the present invention.
6 is a diagram showing a region detection method of the region detection unit constituting the personal information processing system according to the present invention.
7 is an exemplary view showing an example of the viewpoint correction and rotation correction processing in the pre-processing unit constituting the personal information processing system according to the present invention.
8 is a diagram showing a text extraction process by the personal information area extraction unit constituting the personal information processing system according to the present invention.
9 is a view showing a personal information pattern extraction process by the pattern analysis unit constituting the personal information processing system according to the present invention.
10 is a diagram illustrating a de-identification process after extracting personal information using the personal information processing system according to the present invention.
11 is a diagram illustrating an example of de-identification processing for a fingerprint by a de-identification processing unit in the personal information processing method according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

먼저, 본 발명의 실시예에 따른 딥러닝을 이용한 개인정보 처리시스템(이하 '개인정보 처리시스템'이라 한다)은, 기관이나 개인이 보유하고 있는 개인 PC나 파일서버 등에서 이미지파일을 검사하여 개인정보가 유출(upload/download) 되는 것을 차단하는 것을 목적으로 한다. 한편, 이때 개인들이 사용하는 파일 종류는 PDF, 이미지파일, 문서에디터(word/Hwp)파일 등 다양한데, 본 발명에서는 OLE 이미지파일을 포함하는 다양한 형태의 이미지파일 들을 그 대상으로 한다.First, the personal information processing system (hereinafter referred to as 'personal information processing system') using deep learning according to an embodiment of the present invention examines image files in personal PCs or file servers owned by institutions or individuals to obtain personal information The purpose is to block the leakage (upload/download). Meanwhile, at this time, the types of files used by individuals are various, such as PDF, image files, and document editor (word/Hwp) files. In the present invention, various types of image files including OLE image files are targeted.

우선, 도 1은 본 발명에 따른 이미지 파일 검출을 위한 네트워크/서버(Network/Server) 환경구성을 나타낸다. 도면을 참조하면, 내부의 PC들을 스캔서버가 개인정보 검출 로직 스캔서버가 스캔서버를 돌면서 찾아내는 구조로서, 관공서 같은 기관 내에서 다수의 PC들이 사용되고 있고 내부의 네트워크가 외부의 인터넷(Internet)에 연결되어 있어서 게이트웨이(Gateway) 장비를 거쳐서 외부로 패킷(packet)이 출입되는 구조에 있을 경우에는 웹서버에 질의형태에 따라서 업로드 방향 검출 , 다운로드 방향 검출로 구분할 수 있다. 일반적으로 겟/포스트(get/post) 방식으로 구분되는 파일 업로드/다운로드 방식에 각각 대응한다.First, FIG. 1 shows the configuration of a network/server environment for image file detection according to the present invention. Referring to the drawing, the scan server finds the internal PCs while the scan server runs the scan server with personal information detection logic. A number of PCs are used in institutions such as government offices, and the internal network is connected to the external Internet. In the case of a structure in which packets are sent out through the gateway device, it can be divided into upload direction detection and download direction detection according to the type of query to the web server. In general, each corresponds to a file upload/download method divided into a get/post method.

도 2는 본 발명의 개인정보 처리시스템의 구성을 나타내는 블록도이다. 도면을 참조하면, 상기 개인정보 처리시스템은, 전처리부(100)와, 개인정보영역추출부(200)와, 패턴분석부(300)와, 비식별화처리부(400)를 포함하여 구성될 수 있다.2 is a block diagram showing the configuration of the personal information processing system of the present invention. Referring to the drawings, the personal information processing system may include a pre-processing unit 100 , a personal information area extraction unit 200 , a pattern analysis unit 300 , and a de-identification processing unit 400 . have.

먼저, 상기 전처리부(100)는, 페이지 단위, 또는 이미지 단위(개인정보가 포함되었을 것으로 예상되는 ROI(Region of Interest)로서, 이하에서는 관심영역(Text Box, 또는 Text Box area로 나타냄)에서 노이즈(noise), 블러(blur), 오클루전(occlusion) 등을 제거하는 역할을 하며, 전처리과정은 관심영역(Text Box)에서 개인정보 텍스트를 추출하기 이전까지의 과정을 나타낸다.First, the pre-processing unit 100 is a page unit or an image unit (Region of Interest (ROI) that is expected to contain personal information, hereinafter referred to as a Text Box or Text Box area) It plays a role in removing noise, blur, occlusion, etc., and the preprocessing process represents the process before extracting personal information text from the text box.

즉, 상기 전처리부(100)는, 대상 파일로부터 이미지파일을 추출하고, 추출된 이미지파일에서 개인정보가 포함될 것으로 판단되는 관심영역을 검출하여 상기 관심영역에 대한 이미지인식률 향상을 위한 처리를 수행하는 역할을 한다.That is, the pre-processing unit 100 extracts an image file from the target file, detects a region of interest that is determined to contain personal information from the extracted image file, and performs processing for improving the image recognition rate for the region of interest. plays a role

구체적으로, 상기 전처리부(100)는, 이미지어레이부(image-array,110)와, 이미지개선부(120)와, 템플릿매칭(template matching index)비교부와, 영역검출부(140)와, 기하학적 변환부(150)를 포함하여 구성될 수 있다.Specifically, the pre-processing unit 100 includes an image-array unit 110 , an image improvement unit 120 , a template matching index comparison unit, an area detection unit 140 , and a geometric It may be configured to include a conversion unit 150 .

상기 이미지어레이부(110)는, 상기 대상파일로부터 추출된 상기 이미지파일들 각각에 대하여 연결식별번호(Linking address)를 부여하는 역할을 한다. 구체적으로, 상기 이미지어레이부(110)는 다수의 페이지로 이루어져 있는 이미지 파일에 대하여, 이미지 분리작업을 페이지별로 수행하고, OLE 객체를 분리하여 파일단위/페이지단위/이미지단위(Box region)로 연결식별번호를 부여한다. The image array unit 110 serves to assign a linking address to each of the image files extracted from the target file. Specifically, the image array unit 110 performs an image separation operation for each page of an image file consisting of a plurality of pages, separates the OLE object, and connects it in a file unit/page unit/image unit (Box region) Assign an identification number.

상기 연결식별번호는, 상기 이미지파일의 위치정보 및 변환정보를 확인할 수 있도록 상기 이미지파일 각각에 대하여 파일명, 페이지 및 이미지명을 포함하여 구성될 수 있으며, 개인정보영역추출부(200)의 텍스트추출과정에서 입수되는 추가정보(Text Box위치정보 및 변환각도 정보)가 추가되어 완성되며, 이미지에 대한 위치정보와 변환정보로 구성된다.The connection identification number may be configured to include a file name, a page, and an image name for each of the image files so that the location information and conversion information of the image file can be checked, and the text extraction of the personal information area extraction unit 200 Additional information obtained in the process (Text Box location information and transformation angle information) is added and completed, and it is composed of image location information and conversion information.

상기 이미지개선부(Pre-processing 작업부, 120)는, CNN신경망을 통해 학습된 이미지개선알고리즘을 통하여, 품질저해요소를 제거하도록 구성될 수 있다. 즉, 상기 이미지개선부(120)는, 노이즈제거알고리즘을 통하여 상기 이미지파일의 노이즈(noise)를 제거하고, 블러처리알고리즘을 통하여 블러(Blur)를 제거하고, 폐색처리알고리즘을 통하여 폐색(occlusion)처리할 수 있다. The image improvement unit (pre-processing operation unit, 120) may be configured to remove quality-deteriorating factors through an image improvement algorithm learned through a CNN neural network. That is, the image improvement unit 120 removes noise of the image file through a noise removal algorithm, removes blur through a blur processing algorithm, and occlusions through an occlusion processing algorithm. can be processed

상기 이미지개선부(120)는, 임계(threshold)값에 의하여 페이지 단위의 잡영을 제거하기 위하여 CNN신경망을 통하여 학습된 이미지개선알고리즘을 통하여 품질저해요소를 제거하며, CNN신경망은 상기 품질저해요소 픽셀(pixel)의 제거값을 찾도록 학습될 수 있다.The image improvement unit 120 removes the quality-deteriorating factor through an image improvement algorithm learned through the CNN neural network in order to remove the page-by-page noise according to a threshold value, and the CNN neural network removes the quality-degrading element pixel It can be learned to find the removal value of (pixel).

상기 템플릿매칭비교부(130)는, 상기 이미지파일을 설정된 템플릿들과 비교하여 개인정보포함 가능성에 따라 이미지를 구분하는 역할을 한다.The template matching and comparing unit 130 compares the image file with set templates and classifies the image according to the possibility of including personal information.

상기 템플릿매칭비교부(130)는, 딥러닝 학습을 통하여 기 보유한 등록양식(template)과 비교하여 유사도를 지수화하기 위하여 CNN신경망을 통해 학습된 매칭비교알고리즘을 적용할 수 있다.The template matching comparison unit 130 may apply the matching comparison algorithm learned through the CNN neural network in order to index the similarity by comparing it with an existing registration template through deep learning learning.

상기 템플릿매칭비교부(130)는, 기존의 등록양식을 포함하는 이미지파일의 검출속도를 줄일 수 있도록 이러한 이미지파일에 대하여 정형양식으로 처리한다. 상기 템플릿매칭비교부(130)는, 정형양식 이미지인 경우, 이진화처리부에서 흑백이미지로 변환하기 위하여 CNN신경망을 통해 학습된 매칭비교알고리즘을 적용한 후 관점보정부와 회전보정부로 전달한다. 반면, 상기 템플릿매칭비교부(130)는, 비정형 이미지 파일의 경우 다음 과정인 영역검출부(140)로 전달한다. The template matching and comparing unit 130 processes these image files in a standard form so as to reduce the detection speed of image files including the existing registration form. The template matching and comparison unit 130 applies the matching comparison algorithm learned through the CNN neural network in order to convert the binarization processing unit into a black-and-white image in the case of a stereotypical image, and then transmits it to the viewpoint correction unit and the rotation correction unit. On the other hand, the template matching comparison unit 130 transmits the atypical image file to the area detection unit 140 which is the next process.

상기 템플릿매칭비교부(130)는, 상기 이미지파일과 상기 템플릿들의 양식일치지수(Template matching index)에 따라 상기 이미지파일을 정형양식과 비정형양식으로 각각 구분하도록 구성되고, 상기 양식일치지수에 따른 매칭비교알고리즘은 양식일치지수(template Matching Index) 학습을 위한 인공신경망(CNN, Convolution Neural Network)을 통하여 학습될 수 있다.The template matching and comparing unit 130 is configured to classify the image file into a formal format and an atypical format according to a template matching index of the image file and the templates, respectively, and matching according to the format matching index The comparison algorithm can be learned through a convolutional neural network (CNN) for template matching index learning.

이하에서는, 상기 템플릿매칭비교부(130)에 대하여 구체적으로 살펴보기로 한다. 먼저, 상기 템플릿매칭 비교부는, 검사하고자 하는 대상 이미지 파일이, 예를 들어 운전면허증이 복사된 A4 크기의 이미지라고 미리 알고 있다고 가정하면, 전처리 작업에 대한 준비소요시간을 획기적으로 줄일 수 있도록 한다. 다시 말해, 특정 양식(운전면허증)에 대한 학습활동(deep learning)을 거쳐 완성된 인공신경망에 의하여 해당 형식(운전면허증)의 이미지파일을 input파일로 받아들일 때, 보다 신속하게 추출해야 할 관심영역(Text Box area)을 판단하고, 텍스트 검출작업을 진행할 수 있다. Hereinafter, the template matching and comparing unit 130 will be described in detail. First, assuming that the template matching comparison unit knows in advance that the target image file to be inspected is, for example, an A4-sized image from which a driver's license is copied, the preparation time for the pre-processing operation can be drastically reduced. In other words, when an image file of the corresponding format (driver's license) is accepted as an input file by an artificial neural network completed through deep learning for a specific form (driver's license), the region of interest to be extracted more quickly (Text Box area) can be determined, and text detection can be performed.

도 3은 이러한 템플릿매칭 비교부의 실시예를 나타낸 도면으로, 템플릿매칭비교부(130)의 양식유사도를 양식일치지수 점수로 판단한 경우 단말기화면을 나타낸다. 도면을 참조하면, 4개의 양식을 학습한 인공신경망은, 입력된 이미지파일이 어떤 양식에 해당하는지 유사도를 지수점수(3543.6980/972/1917)로써 판단할 수 있으며, 이 경우 유사도 점수가 가장 높은 template_2 양식(지수점수 6980에 해당)이라고 input된 파일의 양식을 추론할 수 있다. 이때, 해당신경망은 4가지의 양식에 대한 학습데이터(class)를 보유하고 있어서 이미지파일에서 해당양식이 검출되면 "정형양식"으로 판단하고, 4가지 이외의 input 이미지파일의 경우에는 "비정형양식"으로 판단한다. FIG. 3 is a diagram illustrating an embodiment of the template matching and comparing unit, and shows a terminal screen when the form similarity of the template matching and comparing unit 130 is determined by the form matching index score. Referring to the drawing, the artificial neural network that has learned the four forms can determine the similarity of which form the input image file corresponds to as an exponential score (3543.6980/972/1917), and in this case, template_2 with the highest similarity score The format of the input file can be inferred as the format (corresponding to the index score of 6980). At this time, the neural network has learning data (class) for 4 types of forms, so if the corresponding form is detected in the image file, it is judged as “standard form” to be judged as

상기한 바에 따르면, 상기 템플릿매칭비교부(130)는, 기존에 신경망 학습을 통하여 갖고 있는 템플릿(template)을 정형양식으로 구분하여, 이러한 템플릿정보를 사전에 신경망 학습정보의 클라스(class)로 갖고 있지 않는 비정형양식과 비교하여 검출속도를 빠르게 처리할 수 있는 장점이 있다.As described above, the template matching and comparing unit 130 classifies a template that has been previously through neural network learning into a formal form, and has this template information as a class of neural network learning information in advance. Compared to atypical forms that do not have, there is an advantage in that the detection speed can be processed quickly.

1) 상기 영역검출부(140)는, 상기 템플릿매칭비교부(130)에 의해 선별된 상기 이미지파일에서 인공신경망에 의해 개인정보 포함형식이 학습된 관심영역추출 알고리즘을 통하여 상기 관심영역을 검출하는 역할을 한다.1) The region detection unit 140 detects the region of interest through an ROI extraction algorithm in which the personal information inclusion format is learned by an artificial neural network from the image file selected by the template matching and comparing unit 130 . do

상기 관심영역추출 알고리즘은, 상기 이미지파일에 포함된 꼭지점(point), 모서리(corner) 또는 선분(line)을 판단하는 특징점을 기준으로 관심영역을 검출하되, RCL(Rectangular Corner Localization)신경망을 이용하여 상기 특징점에 대한 판단가중치의 최적비율을 찾도록 구성될 수 있다.The region of interest extraction algorithm detects a region of interest based on a feature point for determining a vertex, a corner, or a line included in the image file, but using a Rectangular Corner Localization (RCL) neural network. It may be configured to find an optimal ratio of judgment weights for the feature points.

2) 상기 영역검출부(140)는, 2개의 모서리, 3개의 선분 또는 4개의 꼭지점으로 구성된 상기 특징부를 연결하여 박스형태로 검출하도록 구성될 수 있다.2) The region detection unit 140 may be configured to detect a box shape by connecting the features composed of two corners, three line segments, or four vertices.

3) 상기 영역검출부(140)는, 상기 관심영역의 이미지적 특성이 부각되도록 상기 이미지파일에 이미지 모폴로지 필터(image morphological filter)를 이용하여 상기 이미지파일의 윤곽선(contour)을 추출하도록 구성될 수 있다.3) The region detection unit 140 may be configured to extract a contour of the image file by using an image morphological filter in the image file so that the image characteristic of the region of interest is highlighted. .

4) 상기 영역검출부(140)는, 상기 대상파일에 지문이미지 또는 안면이미지의 검출영역이 있다고 판단되는 경우, 상기 관심영역의 검출빈도가 높아지도록 관심영역 검출알고리즘에 보정치가 적용되도록 할 수 있다. 이는, 대상파일에 지문이미지 또는 안면이미지가 있는 경우 이와 연계된 개인정보가 함께 유출되는 경우, 개인정보유출에 따른 피해가 심각해지므로, 이 경우에는 개인정보에 대한 처리를 더욱 면밀하게 수행되도록 하기 위함이다. 따라서 상기 영역검출부(140)는, 대상파일에 지문이미지 또는 안면이미지의 검출영역이 있다고 판단되면, 관심영역 검출빈도를 늘려 관심영역을 보다 많이 검출되도록 할 수 있다.4) When it is determined that there is a detection region of a fingerprint image or a face image in the target file, the region detection unit 140 may apply a correction value to the region of interest detection algorithm to increase the detection frequency of the region of interest. This is because if there is a fingerprint image or facial image in the target file, if personal information related thereto is leaked together, the damage caused by the personal information leakage will be serious. am. Accordingly, when it is determined that the target file has a detection region for a fingerprint image or a face image, the region detection unit 140 may increase the region of interest detection frequency to detect more regions of interest.

상기한 바에 따르면, 상기 영역검출부(140)는, 개인정보를 추출하기 위해서 ROI(Region of Interest) 영역인 관심영역(Text Box)을 판단하는 작업이 필요한데, 완전한 박스(Box)형태의 윤곽(contour)으로 연결되지 않을 경우, 이미지의 특징점(feature, 점, 모서리, 일정한 굵기의 선분)에 대하여 가중치를 부여하며, 일정한 점수 이상일 경우 관심영역으로로 판단한다. As described above, the region detection unit 140 needs to determine a text box, which is a region of interest (ROI) region, in order to extract personal information, and a complete box-shaped contour is required. ), a weight is given to the feature points (features, points, corners, line segments of a certain thickness) of the image, and if the score is higher than a certain point, it is determined as a region of interest.

5) 한편, 상기 영역검출부(140)에서 관심영역을 추출하기 위한 기본적이고 효과적인 방법은 모서리(corner)를 찾는 작업이다. 모서리(corner)는 기하학적인 형태를 구분하는데 중요한 역할을 하는 특징점이며 형태인식에 있어서 기본이 되는 것이라고 할 수 있다. 이때, 모서리(corner)의 정의는, 두 방향 이상에서 이미지값(pixel value)이 급격하게 변화하는 부분이기 때문에, 해당 모서리의 위치(localization)가 형태인식에 있어서 중요한 단서를 제공할 수 있다. 5) Meanwhile, a basic and effective method for extracting a region of interest by the region detection unit 140 is to find a corner. A corner is a characteristic point that plays an important role in distinguishing geometric shapes, and it can be said that it is the basis for shape recognition. In this case, since the definition of a corner is a part in which an image value (pixel value) changes rapidly in two or more directions, the localization of the corresponding corner can provide an important clue in shape recognition.

6) 상기 영역검출부(140)는 템플릿매칭비교부(130)에서 비정형양식으로 판단되는 이미지 파일을 전달받아, 페이지단위에서 다수의 관심영역(Text Box) 검출을 시도하며, 관심영역 검출의 용이성을 위하여 CNN신경망을 통해 학습된 영영검출알고리즘을 이용하며, CNN신경망을 이용하여 특징점들을 검출한다.6) The region detection unit 140 receives the image file determined as an atypical format from the template matching and comparison unit 130, and attempts to detect a plurality of regions of interest (Text Box) in a page unit, improving the ease of detecting regions of interest. To do this, we use the null detection algorithm learned through the CNN neural network, and detect the feature points using the CNN neural network.

7) 상기 영역검출부(140)는, 검출된 관심영역에 대한 위치정보를 이미지어레이부(110)로 전달하여 연결식별번호(Linking address)정보를 완성토록 한다. 7) The region detection unit 140 transmits location information on the detected region of interest to the image array unit 110 to complete the linking address information.

다음으로, 기하학적 변환부의 보정(변환)과정에 대하여 살피면, Next, looking at the correction (conversion) process of the geometric transformation unit,

1) 상기 기하학적 변환부(150)는, 상기 이미지를 기하학적으로 보정하여, 개인정보 검출의 평행/수직관점을 확보하며, 추출된 관심영역 이미지 각각에 대하여 이미지어레이부(110)에 변환정보를 전달하여 다시 원본위치로 변환할 수 있도록 한다(linking address update 작업).1) The geometric transformation unit 150 geometrically corrects the image to secure a parallel/vertical point of view of personal information detection, and transfers transformation information to the image array unit 110 for each extracted ROI image. so that it can be converted back to the original location (linking address update operation).

2) 상기 기하학적 변환부(150)는, 상기 관심영역 이미지의 뷰포인트(view point)가 수직방향에 위치하여 상기 관심영역 이미지가 평면을 이루도록 관점보정(perspective transform)하는 관점보정모듈과, 상기 관심영역 이미지의 기울어진 각도를 회전보정(rotation transform)하는 회전보정모듈을 포함하여 구성될 수 있다.2) The geometric transformation unit 150 includes a viewpoint correction module that performs perspective transformation such that a viewpoint of the region of interest image is positioned in a vertical direction so that the image of the region of interest forms a plane; It may be configured to include a rotation correction module for rotation-correcting the tilted angle of the region image (rotation transform).

3) 여기서, 상기 기하학적 변환부(150)는, 상기 관점보정모듈의 수평방향 평면상의 수직축 및 수평축을 중심으로 하는 회전보정량 및 상기 수직방향 평면상에 수직축을 중심으로 하는 회전보정량은 보정알고리즘을 통하여 산출될 수 있다.3) Here, the geometric transformation unit 150 calculates the rotation correction amount centered on the vertical axis and the horizontal axis on the horizontal plane of the viewpoint correction module and the rotation correction amount centered on the vertical axis on the vertical plane through a correction algorithm. can be calculated.

도 4는 본 발명에 의한 개인정보 처리시스템을 구성하는 기하학적 변환부에 의한 이미지 보정 및 이에 대한 행렬을 나타낸 것으로, 상기 영역검출부(140)에 의하여 추출된 관심영역의 이미지파일(이미지단위,Box region) 이미지에 대하여 정확도를 높이도록 기하학적인 구도의 변경을 통한 관점보정 및 회전보정의 방법을 나타낸다. 4 shows image correction by the geometric transformation unit constituting the personal information processing system according to the present invention and a matrix therefor. ) shows the method of viewpoint correction and rotation correction by changing the geometric composition to increase the accuracy of the image.

우선, (a)는 관점보정을 나타내는 것으로, 관점보정은, 대상파일이 뷰포인트(View point)를 기준으로 관점이 이미지 수직과 정합되었는지에 따라 기울어진 경우 이를 보정하여, 대상과 90도를 이루는 평면도를 확보하고, 이를 위하여 현재 모델을 메트릭스 변환한다. First, (a) shows viewpoint correction, and viewpoint correction corrects if the target file is tilted according to whether the viewpoint is aligned with the vertical image based on the viewpoint to form a 90 degree angle with the object. A floor plan is obtained, and for this purpose, the current model is transformed into a matrix.

(b)는 회전보정을 나타내는 것으로, 회전보정 복사 등으로 인해, 수평면과 일정한 각도를 이루면서 해당 이미지가 놓여있는 경우, x도 만큼 각도를 전이하여 원래 각도가 없는 평행상태로 변경시킨다.(b) shows rotation correction, and when the image is placed at a constant angle with the horizontal plane due to rotation correction copy, etc., the angle is shifted by x degrees to change to a parallel state without the original angle.

(c)는 이러한 보정에 대하여 (1.0)의 점이 (1, tan@)의 값으로 변경되는 변환대응(matrix transformation)행렬로서 설명한 것으로서, 회전보정도 변경전의 이미지의 위치와 회전 변형된 이미지의 위치 사이의 변환행렬로 동일하게 표현할 수 있으며, 역행렬(Inverse matrix)을 통해서 변형이전의 위치 또는 형태로 복원할 수도 있다. 이와 같은 매트릭스연산의 결과는 각 관심영역(Text Box)마다 저장되어 필요할 경우, 원본 위치로 다시 이동할 수 있다. (c) is described as a matrix transformation matrix in which the point of (1.0) is changed to the value of (1, tan@) for this correction. It can be expressed in the same way as a transformation matrix between The result of such a matrix operation is stored for each text box and, if necessary, can be moved back to the original location.

이하에서 상기 개인정보영역추출부에 의한 처리과정을 살피면,Looking at the processing process by the personal information area extraction unit below,

1) 상기 개인정보영역추출부(200)는, 상기 전처리부(100)로부터 검출된 관심영역에서 개인정보가 포함된 처리대상정보를 추출하는 역할을 한다. 1) The personal information region extraction unit 200 serves to extract processing target information including personal information from the region of interest detected by the pre-processing unit 100 .

2) 그리고 상기 개인정보 영역추출부는, 전처리부(100)에 의하여 추천된 관심영역(Text Box)의 이미지 내에서 텍스트를 추출하는 작업을 수행한다. 2) The personal information area extraction unit extracts text from the image of the area of interest (Text Box) recommended by the preprocessor 100 .

3) 상기 개인정보영역추출부(200)는, CNN신경망을 통해 학습된 추출알고리즘을 통하여, 상기 관심영역에서 상기 처리대상정보를 추출하도록 구성될 수 있다.3) The personal information region extraction unit 200 may be configured to extract the processing target information from the region of interest through an extraction algorithm learned through a CNN neural network.

4) 상기 개인정보영역추출부(200)는, CNN신경망을 통해 학습된 추출알고리즘을 통하여, 상기 관심영역에서 텍스트가 포함된 글유닛을 검출하고, 상기 글유닛으로부터 텍스트 문자열을 추출하도록 구성되되, 상기 관심영역에서 일정 길이와 높이를 갖는 이미지와 공백이 연속되어 있는 영역을 상기 글유닛으로 검출할 수 있다.4) The personal information region extraction unit 200 is configured to detect a text unit including text in the region of interest through an extraction algorithm learned through a CNN neural network, and extract a text string from the text unit, In the region of interest, an image having a predetermined length and height and a region in which blanks are continuous may be detected by the writing unit.

5) 상기 개인정보영역추출부(200)는, 상기 글유닛의 해당 위치 및 좌표를 저장하고 상기 관심영역이 포함된 해당 이미지파일의 연결식별번호 데이터를 저장관리하도록 구성될 수 있다.5) The personal information area extraction unit 200 may be configured to store the corresponding position and coordinates of the writing unit and to store and manage the connection identification number data of the corresponding image file including the area of interest.

6) 상기 개인정보영역추출부(200)는, CNN신경망을 통해 학습된 추출알고리즘을 통하여, 상기 관심영역에서 이미지패턴을 검출하고, 상기 이미지패턴으로부터 지문이미지 또는 안면이미지를 추출하도록 구성되되, 상기 관심영역에서 특징점으로 연결되는 지문패턴 또는 안면패턴을 갖는 패턴영역을 상기 이미지패턴으로 검출하도록 구성될 수 있다.6) The personal information region extraction unit 200 is configured to detect an image pattern in the region of interest through an extraction algorithm learned through a CNN neural network, and extract a fingerprint image or a facial image from the image pattern, the It may be configured to detect a pattern region having a fingerprint pattern or a facial pattern connected to a feature point in the region of interest as the image pattern.

상기 패턴분석부(300)는, 개인정보영역추출부(200)에서 추출된 처리대상정보가 개인정보 패턴에 맞는지 비교하며, 상기 개인정보영역추출부(200)에 의하여 추출된 상기 처리대상정보의 패턴을 기 설정된 개인정보패턴과 비교 분석하여 상기 처리대상정보의 개인정보를 추출하는 역할을 한다.The pattern analysis unit 300 compares whether the processing target information extracted from the personal information area extraction unit 200 matches the personal information pattern, and compares the processing target information extracted by the personal information area extraction unit 200. It serves to extract the personal information of the processing target information by comparing and analyzing the pattern with a preset personal information pattern.

이때, 상기 기 설정된 개인정보패턴은 개인정보 생성로직에 의해 생성된 것일 수 있다. In this case, the preset personal information pattern may be generated by a personal information generation logic.

상기 패턴분석부(300)는, 상기 텍스트를 주민번호, 운전면허증번호, 여권번호 및 외국인번호 및 지문을 포함하는 기 설정된 개인정보패턴과 비교 분석하는 개인정보패턴분석부와, 상기 텍스트를 주민등록증, 운전면허증, 여권, 거주증명서 및 외국인증명서를 포함하는 기 설정된 문서양식과 비교 분석하는 문서양식분석부를 포함하여 구성될 수 있다.The pattern analysis unit 300 includes a personal information pattern analysis unit that compares and analyzes the text with a preset personal information pattern including a resident number, a driver's license number, a passport number, and a foreign number and a fingerprint, It may be configured to include a document form analysis unit that compares and analyzes a preset document form including a driver's license, passport, residence certificate and alien certificate.

상기 개인정보패턴분석부 및 상기 문서양식분석부는, CNN신경망을 통해 학습된 패턴분석알고리즘(양식지수알고리즘)을 통하여, 상기 처리대상정보를 비교 분석하도록 구성될 수 있다.The personal information pattern analysis unit and the document form analysis unit may be configured to compare and analyze the processing target information through a pattern analysis algorithm (form index algorithm) learned through a CNN neural network.

상기 비식별화처리부(400)는, 상기 패턴분석부(300)에 의하여 추출된 상기 개인정보를 비식별화 처리하는 역할을 한다.The de-identification processing unit 400 serves to de-identify the personal information extracted by the pattern analysis unit 300 .

상기 비식별화처리부(400)는, 상기 텍스트를 모자이크처리, 숨김처리, 암호화처리, 가명처리, 총계처리, 데이터삭제, 데이터범주화 또는 마스킹 중 어느 하나 이상의 방식으로 처리하도록 구성될 수 있다.The de-identification processing unit 400 may be configured to process the text in any one or more manners of mosaic processing, hiding processing, encryption processing, pseudonymization processing, total processing, data deletion, data categorization, and masking.

이하에서는, 본 발명의 실시예에 따른 딥러닝을 이용한 개인정보 처리방법(이하'개인정보 처리방법'이라 한다)에 대하여 살펴보기로 한다. Hereinafter, a personal information processing method (hereinafter referred to as 'personal information processing method') using deep learning according to an embodiment of the present invention will be described.

우선, 본 발명의 개인정보 처리방법에 이용되는 개인정보 처리시스템의 구성은 전술한 바와 같으므로, 이에 대한 세부구성 설명은 생략하기로 하며 이하에서는 본 발명의 개인정보 처리방법에 대해서만 구체적으로 살펴보기로 한다.First, since the configuration of the personal information processing system used in the personal information processing method of the present invention is as described above, a detailed configuration description thereof will be omitted, and only the personal information processing method of the present invention will be described in detail below. do it with

우선, 상기 개인정보 처리방법은, 전처리부(100)가 대상 파일로부터 이미지파일을 추출하고, 추출된 이미지파일에서 개인정보가 포함될 것으로 판단되는 관심영역을 검출하여 상기 관심영역에 대한 이미지인식률을 향상을 위한 처리를 수행하는 전처리단계를 수행한다.First, in the personal information processing method, the pre-processing unit 100 extracts an image file from a target file, and detects a region of interest that is determined to contain personal information from the extracted image file, thereby improving the image recognition rate for the region of interest. Perform a pre-processing step to perform processing for

상기한 전처리단계는, 이미지어레이부(image-array,110)가 상기 대상파일로부터 추출된 상기 이미지파일들 각각에 대하여 연결식별번호를 부여하는 이미지어레이단계와, 이미지개선부(120)가 상기 이미지파일의 품질저해요소를 제거하는 이미지개선단계와, 템플릿매칭비교부(130)가 상기 이미지파일을 운전면허증, 신분증 또는 외국인증명서를 포함하는 개인정보가 포함된 정규양식들의 템플릿그룹들로 구성되는 설정된 템플릿들과 비교하여 개인정보포함 가능성에 따라 이미지를 구분하는 템플릿비교단계와, 영역검출부(140)가 상기 템플릿매칭비교부(130)에 의하여 선별된 상기 이미지파일에서 인공신경망에 의해 개인정보 포함형식이 학습된 관심영역추출 알고리즘을 통하여 상기 관심영역을 검출하는단계와, 기하학적 변환부(150)가 상기 관심영역 이미지를 관점보정 및 회전보정하는 보정단계를 포함할 수 있다.The pre-processing step includes an image array step in which an image-array unit 110 assigns a connection identification number to each of the image files extracted from the target file, and an image improvement unit 120 performs the image The image improvement step of removing the quality impeding factors of the file, and the template matching and comparing unit 130 are set consisting of template groups of regular forms containing personal information including driver's license, identification card or foreigner's certificate for the image file. A template comparison step of classifying images according to the possibility of including personal information compared with templates, and a personal information inclusion format by an artificial neural network in the image file selected by the template matching and comparing unit 130 by the area detection unit 140 It may include detecting the region of interest through the learned region of interest extraction algorithm, and a correction step in which the geometric transformation unit 150 corrects the viewpoint and rotation of the region of interest image.

먼저 상기 전처리단계는, 텍스트추출(개인정보추출 과정)과정을 위한 사전작업으로, 관심영역의 이미지파일(Box Region)을 찾아서 개인정보추출과정(2단계)에 넘겨주기 까지의 과정을 도 5를 참조하여 살펴보기로 한다. First, the pre-processing step is a pre-work for the text extraction (personal information extraction process) process, and the process of finding an image file (Box Region) of the region of interest and handing it over to the personal information extraction process (step 2) is shown in FIG. Let's take a look and see.

도 5에 도시된 바와 같이, 상기 전처리과정(1단계)은 이미지어레이부(image-array작업부), 이미지개선부(Pre-processing 작업부), 템플릿매칭비교부(template matching index 비교부), 영역검출부(Box검출부) 및 기하학적 변환부(Perspective Transform/ Rotation Transform 처리부)를 포함하여 구성된다. 5, the pre-processing process (step 1) includes an image array unit (image-array work unit), an image improvement unit (Pre-processing work unit), a template matching index comparison unit (template matching index comparison unit), It consists of a region detection unit (Box detection unit) and a geometric transformation unit (Perspective Transform/Rotation Transform processing unit).

우선, 개인정보검출을 위한 인풋(input)형태의 파일로서 이미지파일은 일반적으로 2가지로 구분될 수 있으며, 이미지파일은 1)순수이미지파일이거나, 2)특정문서에 OLE(Object Linking Embedded)형태로 포함된 문서파일을 포함할 수 있으며, 또는 컴퓨터파일형태 중에서 순수한 이미지파일형태이거나 스캔된 파일형태일 수 있다.First, as an input-type file for personal information detection, image files can be generally divided into two types. An image file is 1) a pure image file, or 2) an OLE (Object Linking Embedded) type in a specific document. It may include a document file included as

도면에서 샘플로 제시된 이미지 파일은 3페이지로 구성되어 있고 각 페이지마다 1개의 이미지를 갖고 있으며 image_1, image_2는 비정형양식에 포함된 이미지, image_3는 정형양식에 포함된 이미지라고 가정 한다.It is assumed that the image file presented as a sample in the drawing consists of 3 pages, each page has one image, image_1 and image_2 are images included in the informal format, and image_3 is the image included in the formatted format.

먼저 이미지 파일(File_name)이 이미지어레이부(110)로 입력되면, 추후에 진행되는 모든 이미지단위의 식별을 위하여 파일단위/페이지단위/이미지단위(Box Region)로 세분화된 연결식별번호(linking address)를 부여된다. 이때, 이미지어레이부(110)는, 연결식별번호를 통해 해당이미지의 위치를 확인할 수 있는 위치정보, 해당 이미지의 변환정보를 확인할 수 있는 변환정보를 각각 구성한다. First, when the image file (File_name) is input to the image array unit 110, for identification of all image units to be processed later, the linking address is subdivided into file units/page units/image units (Box Region). is given At this time, the image array unit 110, through the connection identification number, the location information for confirming the location of the image, the conversion information for checking the conversion information of the image, respectively.

이와 같은 작업을 통해서 OLE문서에 있는 이미지분리작업도 진행한다. 도면에서, File_name이란 이미지가 포함된 파일은 페이지단위로 분리하고, OLE개체를 분리하면, 연결식별번호가 부여되는 경우를 나타낸다. 다만, 현재 단계에서는 페이지에 포함되어 있는 이미지단위는 아직 식별되기 이전이라서 이후의 연결식별번호정보에는 File_name/page_1 / image #??, File_name/page_2 / image #??, File_name/page_3 / image #??와 같이 페이지단위의 정보까지만 포함되어 있다. Image separation in OLE documents is also carried out through this operation. In the drawing, File_name indicates a case in which a file containing an image is separated by page unit and a link identification number is assigned when an OLE object is separated. However, at the current stage, since the image unit included in the page is not yet identified, the subsequent connection identification number information includes File_name/page_1 / image #??, File_name/page_2 / image #??, File_name/page_3 / image #? Like ?, only page-level information is included.

그런 다음, 이미지개선부(Pre-Processing 처리부)에서는 페이지단위로 전처리 작업을 수행하는데, CNN신경망을 통해 학습된 이미지개선알고리즘을 적용하여 페이지단위로 임계(threshold)값을 이용하여, 블러(Blur)처리 작업을 수행한다. 이는 잡영을 제거하여 이후 개인정보 추출(검출)작업에 용이함을 주기 위함이다. Then, the image improvement unit (Pre-Processing processing unit) performs pre-processing on a page-by-page basis. By applying the image enhancement algorithm learned through the CNN neural network, using a threshold value for each page, blur carry out processing This is to remove noise and to facilitate the subsequent extraction (detection) of personal information.

이후 템플릿매칭비교부(Template matching index비교부)에서는 검사시간 단축을 위해서, CNN신경망(Template Recognition CNN)을 통해 학습된 매칭비교알고리즘을 이용하여 양식비교작업을 진행한다. 이러한 작업은 검사하고 있는 페이지가 해당 신경망을 통하여 학습된 양식인지 판단하기 위하여 유사도가 높은(지수점수가 제일 높은) 템플릿(template)양식을 추천하게 되며, 비정형양식일 경우 기보유 양식(template)에 대한 유사도 점수가 모두 현저하게 낮을 경우, 임계치를 기준으로 정형양식/비정형양식을 판단한다. 이때, 정형양식이라고 판단될 경우, 검사시간 단축을 위하여 이진화처리부로 전달하고 비정형양식이라고 판단할 경우, 영역검출부(Box검출부)로 페이지단위로 전달된다. 도면에서 page_3은 정형양식이다.Afterwards, the template matching index comparison unit performs a form comparison operation using the matching comparison algorithm learned through the CNN neural network (Template Recognition CNN) in order to shorten the inspection time. In this work, a template form with high similarity (with the highest index score) is recommended to determine whether the page being inspected is a form learned through the neural network. If the similarity scores for all of them are remarkably low, the formal/atypical form is judged based on the threshold. At this time, if it is determined that the format is a regular format, it is transmitted to the binarization processing unit to shorten the inspection time. In the drawing, page_3 is the standard form.

그런 다음, 영역검출부(Box검출부)에서는, CNN신경망을 통해 학습된 영역검출알고리즘을 이용하여 사각형의 모서리(edge), 일정한 굵기의 선(Line), 점(point)의 3가지 특징점(feature)를 추출한다. 상기 영역검출부(140)는, 관심영역(Text Box)영역을 판단하기 위해서, 선분과 점으로 이루어진 완전한 직사각형 관심영역이 아닐 경우에도 특징점들을 기준으로 항목별 가중치를 부여할 수 있으며, 일정 점수 이상 획득한 경우 관심영역(Text box)으로 추천할 수 있다. 이 과정은 직사각형 외에도 평행사변형과 사다리꼴등 네 개의 변으로 이루어진 모형의 도형을 추출하기 위한 과정에 적용된다. 이는, 일반적인 윤곽(contour)을 찾는 과정이라기보다는 어느 정도 관심영역(text영역)이라고 알려져 있다고 간주되는 각종 신분증, 서식, 증명서 등에서 사용될 수 있는 방법이다. 즉, 2개의 선이 만나는 지점(꼭지점), 또는 외곽선의 일정한 두께, 영역내부와 외부의 불연속 경계 등에 대해서 각각 가중치를 부여한 점수를 이용하여 판단한다. 가능하면 관심영역(Box)당 4개의 점이 추출되는 것이 바람직하지만, 3개의 점과 모서리(edge선)로 구성되어 있어도, 한 개의 관심영역(Box)으로 가정될 수 있다. Then, in the area detection unit (Box detection unit), using the area detection algorithm learned through the CNN neural network, three features of a rectangular edge, a line of constant thickness, and a point are detected. extract In order to determine the region of interest (Text Box) region, the region detection unit 140 may assign weights for each item based on feature points even when the region of interest is not a complete rectangular region of interest made of line segments and points, and obtains a certain score or more In one case, it can be recommended as a text box. This process is applied to the process for extracting figures of a model consisting of four sides, such as a parallelogram and a trapezoid, in addition to a rectangle. This is a method that can be used in various IDs, forms, and certificates, which are considered to be known as areas of interest (text areas) to some extent rather than a process of finding a general contour. That is, it is determined by using a score in which weights are assigned to a point (vertex) where two lines meet, a constant thickness of an outline, a discontinuous boundary inside and outside a region, and the like. If possible, it is desirable to extract four points per box of interest, but even if it consists of three points and an edge line, it can be assumed as one box of interest.

상기 영역검출부(140)는 사각형의 모서리(corner), 일정한 굵기의 선(edge Line), 점(point)의 3가지 특징점(features)를 추출하여 가중치를 부여하되, 세 가지 요소 중 모서리에 대한 가중치를 제일 크게 갖도록 하여 직사각형을 판별할 수 있다. 이에 대한 실시예로, 관심영역(Text Box)을 찾도록 CNN신경망을 통해 학습된 영역검출알고리즘을 적용하여, 기하학적인 관심영역(Text Box)을 결정하는 2개의 모서리(edge) 또는 3개의 선분(line) 또는 4개의 점(point)으로 구성된 직사각형만을 추천될 수 있으며, 평행사변형, 사다리꼴을 포함한 사각형의 검출을 위해서 원본 이미지파일의 픽셀벨류매트릭스(Pixel value matrix)에 dx/dy(gradient derivative matrix)를 합성곱(convolution)하여, 평행사변형, 마름모꼴의 사각형도 관심영격(Text Box)형태로 추천/검출될 수 있다.The area detection unit 140 extracts three feature points of a corner of a rectangle, an edge line of a certain thickness, and a point and assigns weights to them, but weights for the corners among the three elements A rectangle can be identified by making it the largest. As an example of this, by applying the region detection algorithm learned through the CNN neural network to find the region of interest (Text Box), two edges or three line segments ( line) or a rectangle composed of 4 points can be recommended, and for the detection of rectangles including parallelograms and trapezoids, dx/dy (gradient derivative matrix) in the pixel value matrix of the original image file By convolution of , parallelograms and rhombic rectangles can also be recommended/detected in the form of a text box.

상기 영역검출부(140)는, 이렇게 추천된 관심영역의 이미지(Box region)에 대해서 최종적으로 이진화(Binary)처리를 하고, 2단계 작업에서 진행하는 텍스트 검출작업을 용이하게 할 수 있다.The region detection unit 140 may finally perform binary processing on the recommended image (Box region) of the region of interest, and facilitate the text detection operation performed in the second stage operation.

아래의 표 1와 같이 가중치의 총합이 일정한 값 이상이 될 경우 관심영역(Box)으로 판단하는 실시예를 나타낸다.As shown in Table 1 below, when the sum of the weights is equal to or greater than a certain value, an embodiment is shown in which the region of interest (Box) is determined.

항목item 검출숫자detection number 가중치weight 꼭지점 (Point)vertex 44 55 모서리 (Corner) Corner 22 1010 선분 (Edge Line)Edge Line 22 22

도 6을 참조하면, 만약 검출해야 할 관심영역의 사각형이 이미지내의 직사각형 형식이 비교적 큰 변형(뒤틀림 또는 회전) 없이 안정적으로 유지되고 있는 상태라면 도 6에 나타낸 3가지 경우, 즉 2개의 모서리로 이루어지거나, 3개의 선분으로 이루어지거나, 4개의 꼭지점으로 이루어진 경우, 최소한의 기하학적인 직사각형을 만족하는 조건으로 보고, 관심영역(Text Box)으로 추천된다. 또한, 아래와 같이 Gradient derivative matrix를 사용하여 원본 이미지에 대한 합성곱(convolution) 작업을 통해서 해당 이미지의 모서리(corner), 선분(edge)의 형상과 위치를 파악할 수 있는데, 모서리 그리고 이미지에서 이러한 모서리(corner)부분, 선분(edge)을 파악하기 위해서는 이미지파일에서 추출된 원본이미지의 픽셀값(pixel value of image)에 대하여 dx, dy 방향의 kernel filter를 각각 적용하여 처리될 수 있다.Referring to FIG. 6 , if the rectangle of the region of interest to be detected is in a state in which the rectangular shape in the image is stably maintained without relatively large deformation (distortion or rotation), the three cases shown in FIG. In the case of falling, three line segments, or four vertices, it is considered as a condition that satisfies the minimum geometric rectangle, and is recommended as a text box. In addition, the shape and position of corners and edges of the image can be identified through convolution on the original image using the gradient derivative matrix as shown below. corner) part and edge can be processed by applying kernel filters in the dx and dy directions to the pixel value of image extracted from the image file, respectively.

여기서, dx, dy는 각각 x, y 방향의 픽셀값의 변화율을 계산하는 gradient derivatives matrix이다. kernel filter는 convolution 연산을 처리하는 기본필터로서 위의 dx, dy 성분에 대한 합성곱 결과를 이용하여 도 6과 같은 세 가지 조건을 만족하는 기하학적인 구성을 찾을 수 있다. 즉, dx결과 또는 dy결과만을 만족한다면 x축 또는 y축방향의 선분이 일정구간 존재함을 의미하여, dx, dy 값이 동시에 변화되는 구간이 있다면 이 부분은 모서리(corner)부분임을 파악할 수 있다. Here, dx and dy are gradient derivatives matrices that calculate the rate of change of pixel values in the x and y directions, respectively. A kernel filter is a basic filter that processes a convolution operation, and a geometric configuration that satisfies the three conditions as shown in FIG. 6 can be found by using the convolution result for the dx and dy components. That is, if only the dx result or the dy result is satisfied, it means that a line segment in the x-axis or y-axis direction exists in a certain section. .

원본파일의 이미지를 Matrix "M"으로 표시할 경우, 위의 Kernel filter(gradient derivatives matrix)를 적용하여 변환된 이미지의 Matrix형태를 M`으로 표시할 때, 크기(Magnitude)와 각도 '알파(theta)'에 대한 계산은 아래와 같이 표현할 수 있다. When the image of the original file is displayed as Matrix "M", when the Matrix shape of the image converted by applying the Kernel filter (gradient derivatives matrix) above is displayed as M`, the size (Magnitude) and angle 'alpha (theta)' )' can be expressed as follows.

위와 같이 적용될 경우, theta값을 이용하여 직사각형 외에도 평행사변형, 마름모꼴의 사각형의 경우에도 검출할 수 있는 장점이 있다. When applied as above, there is an advantage in that the theta value can be used to detect not only rectangles, but also parallelograms and rhombic rectangles.

도면에서, page_1과 page_2에 각각 1개씩의 이미지가 있다고 하면 영역검출부(140)에서 처리된 비정형양식은 최종적으로, File_name/ page_1 / image_#11, File_name/ page_2 / image_#21 과 같은 연결식별번호(Linking address)를 갖는다. 그리고 이 정보는 바로 이미지어레이부(110)에 전달되어 이미지어레이(Image-array)정보를 갱신(update)한다.In the drawing, if there is one image in each of page_1 and page_2, the atypical form processed by the area detection unit 140 is finally a connection identification number ( linking address). And this information is directly transferred to the image array unit 110 to update the image array (Image-array) information.

영역검출부(140)에 의하여 특징점들이 추출되면 이러한 특징점 기반으로 관심영역(Text Box)을 추천(정의)할 수 있고, 페이지 내의 다수의 추천 받은 관심영역(Text Box)이 포함된 페이지 단위로 이진화처리 작업을 수행한다. 이진화처리 작업을 완성하면 관심영역 내의 이미지 영역은, 개인정보 텍스트추출을 위해서 다음 작업으로 전달된다. 이와 같이 관심영역(Text Box)을 추출할 때는 픽셀(Pixel)의 컬러(color)를 관심영역(Text Box) 이외의 부분은 회색조로(grey scale)로 변경하여 해당 이미지 파일을 단순화시킬 수 있다. 이렇게 함으로써 검출된 특징점으로부터 윤곽(Contour)을 완성하여 관심영역(Text Box)이 확정되며 기타 영역은 검출대상영역에서 완전히 배제된다. When the feature points are extracted by the region detection unit 140, a region of interest (Text Box) can be recommended (defined) based on these feature points, and binarization processing is performed on a page-by-page basis including a plurality of recommended regions of interest (Text Box) in the page. do the work When the binarization process is completed, the image area within the region of interest is transferred to the next task for extracting personal information text. In this way, when extracting a text box, a corresponding image file can be simplified by changing the color of a pixel to a gray scale for parts other than the text box. In this way, a contour is completed from the detected feature points to determine a text box, and other areas are completely excluded from the detection target area.

상기 기하학적 변환부(150)는, 관점보정모듈과, 회전보정모듈에서 영역검출알고리즘에서 추출한 관심영역의 이미지단위(Box region)에 대해서 기하학적 보정작업을 진행한다. The geometrical transformation unit 150 performs geometric correction on the image unit (Box region) of the region of interest extracted by the region detection algorithm in the viewpoint correction module and the rotation correction module.

도 7은 본 발명에 의한 개인정보 처리시스템을 구성하는 전처리부에서 관점보정 및 회전보정 처리과정의 일 예를 나타내는 예시도로, 이미지의 형태에 따른 전처리부(100)의 수행과정을 나타낸다. 도면을 참조하면, 우선 이미지는 A형(기울어진 모습), B형(사다리꼴형, 사진이나 스캔작업 시 정확한 입사각 90도를 유지 못했기 때문에 발생) 또는 정상적인 C형(정상적인 90도 4각형)으로 구분될 수 있다. A형의 경우 검출 후, 회전보정모듈을 통하여 기울어진 각도만큼 보상을 해서, 원래의 평행한 위치로 각도이동을 한 후에 2단계 과정으로 이동하게 된다. B형의 경우 검출 후, 관점보정모듈에 의해, 실제의 기울어진 뷰포인트(View point)를 수직으로 응시할 수 있는 변환을 수행한 후 2단계 과정으로 이동하게 된다. C형태의 경우, 정상적으로 검출된 관심영역의 이미지에 대해서는 추가적인 변환을 진행하지 않고 바로 2단계 과정으로 진행한다. 이전 과정과 동일하게 본 과정에서도 파일단위→페이지단위→이미지단위로 세분화되어서 연결식별번호(linking address)를 갖고 있으며, B,C형의 경우에 대응하는 수학적인 변환 매트릭스(Matrix)를 연결식별번호 정보에 저장한다. 추후 비식별화처리부(400)의 마스킹(Box masking) 시 사용할 수 있도록 연결식별번호를 관리한다. 7 is an exemplary view showing an example of the viewpoint correction and rotation correction processing in the preprocessor constituting the personal information processing system according to the present invention, and shows the execution process of the preprocessor 100 according to the shape of the image. Referring to the drawings, the image is divided into A-type (slanted), B-type (trapezoidal, caused by not maintaining the correct angle of incidence 90 degrees during photography or scanning), or normal C-type (normal 90-degree quadrilateral) can be In the case of type A, after detection, compensation is made as much as the inclination angle through the rotation compensation module, and after moving the angle to the original parallel position, it moves to the second step process. In the case of type B, after detection, the viewpoint correction module converts the actual tilted viewpoint to gaze vertically, and then moves to the second step process. In the case of type C, the normally detected image of the region of interest does not undergo additional transformation and proceeds directly to a two-step process. As in the previous process, this process is subdivided into file unit → page unit → image unit and has a linking address. save the information The connection identification number is managed so that it can be used during the box masking of the de-identification processing unit 400 later.

아래는 연결식별번호를 관리하는 일 실시예를 나타낸 것으로서, 관심영역의 이미지단위(Box region)의 위치를 모서리 포인트 4곳의 좌표를 잡았고, 관심영역(Text Box)이 회전된 경우, 페이지각도를 기준으로 라디안(radian)의 값을 기록하면, File_name/page_1/image_#11/(x11,y11),(x12,y12),(x13,y13),(x14,y14)/(z1 rad)와 File_name/page_2/image_#21/(x21,y21),(x22,y22),(x23,y23),(x24,y24)/(z2 rad)로 나타낼 수 있다.The following shows an embodiment of managing the connection identification number. The coordinates of four corner points are taken for the position of the image unit (Box region) of the region of interest, and when the region of interest (Text Box) is rotated, the page angle is If you record the value in radians as a reference, File_name/page_1/image_#11/(x11,y11),(x12,y12),(x13,y13),(x14,y14)/(z1 rad) and File_name It can be expressed as /page_2/image_#21/(x21, y21), (x22, y22), (x23, y23), (x24, y24)/(z2 rad).

도 7을 참조하면, 도 5의 템플릿매칭비교부(template matching Index비교부)가 없는 경우, 모두 비정형양식으로 간주하고 이미지 전처리 및 개인정보 추출작업을 진행하는 실시예를 나타낸다. A형, B형, C형의 기하학적인 이미지에 대응해서 관점보정모듈, 회전보정모듈에서 페이지기준으로 관심영역(Text Box)이 변환된 매트릭스변환부를 지정하여 관리하며, 마스킹(Masking) 작업 시, 변환매트릭스를 이용하여 원본위치로 변환한 후 작업을 수행한다. Referring to FIG. 7 , when there is no template matching index comparison unit of FIG. 5 , all of them are regarded as atypical forms, and image pre-processing and personal information extraction are performed. In response to A-type, B-type, and C-type geometric images, the perspective correction module and rotation correction module designate and manage the matrix transformation unit in which the area of interest (Text Box) is converted based on the page, and when masking, After converting to the original position using the transformation matrix, the operation is performed.

상기한 전처리과정을 거친 후에는, 개인정보영역추출부(200)가 CNN신경망을 통해 학습된 추출알고리즘을 이용하여 상기 전처리부(100)로부터 검출된 관심영역에서 개인정보를 포함하는 처리대상정보를 추출하는 개인정보 추출단계를 거친다. After going through the pre-processing process, the personal information region extraction unit 200 uses the extraction algorithm learned through the CNN neural network to process target information including personal information in the region of interest detected from the pre-processing unit 100. It goes through the personal information extraction step.

개인정보 추출단계는, 글 unit검출부, Text Box 전처리부, OCR처리부를 포함하여 구성될 수 있으며, 각각의 과정에 CNN신경망을 통해 학습된 추출알고리즘이 적용될 수 있다. 여기서 CNN 신경망을 통해 학습된 추출 알고리즘은 글 영역 검출의 경우 OCR 엔진을 사용할 수 있으며, 지문 데이터의 경우 자체 데이터 생성 기법에 의해 많은 데이터를 만들어 학습될 수 있다.The personal information extraction step may be configured including a text unit detection unit, a text box pre-processing unit, and an OCR processing unit, and the extraction algorithm learned through the CNN neural network may be applied to each process. Here, the extraction algorithm learned through the CNN neural network can use the OCR engine in case of text region detection, and in the case of fingerprint data, it can be learned by creating a lot of data by its own data generation technique.

도 8을 참조하면, 글 Unit검출부(400) 1단계 작업의 결과물인 관심영역(Text Box) 정보(연결식별번호, 이미지)를 제공받은 후, 모든 관심영역(Text Box)의 내부영역에 대해서, 일정한 길이와 높이를 갖고 있는 이미지와 공백이 연속되어 있을 경우, 이를 텍스트이미지 벌크(Bulk)로 인식하여, 해당 이미지단위(Box Region)에 글 Unit가 포함되어 있다고 판단하는 CNN신경망을 적용된다. Referring to FIG. 8 , after receiving the text box information (connection identification number, image), which is the result of the first step of the text unit detection unit 400, the internal area of all areas of interest (Text Box), When an image with a certain length and height and a space are continuous, it is recognized as a text image bulk and a CNN neural network that determines that a text unit is included in the corresponding image unit (Box Region) is applied.

Box전처리부는, 관심영역(Text Box) 내에서 글 Unit이 검출된 후, OCR처리부단계 이전에 텍스트검출의 용이성을 위하여 CNN신경망을 적용하여 전처리 과정을 수행하며, 글 Unit이 검출된 관심영역(Text Box)에 대해서, 임계(threshold)값을 적용하여 블러(Blur)처리를 수행하고, 품질저해요소 픽셀(Pixel)을 제거한다. The box preprocessing unit performs preprocessing by applying a CNN neural network for ease of text detection before the OCR processing unit step after a text unit is detected in the text box, and the text unit is detected in the area of interest (Text). Box), blur processing is performed by applying a threshold value, and quality-degrading pixels are removed.

OCR 처리부는, 전처리된 관심영역(Text Box)에 대하여 텍스트 검출을 위한 추출알고리즘을 이용하여 문자열을 검출한다.The OCR processing unit detects a character string by using an extraction algorithm for text detection with respect to the preprocessed region of interest (Text Box).

글 unit검출망신경망(400)은 추천된 여러 관심영역(Text Box)들 중에서, Box내부의 텍스트를 추출하기 전에, 추천된 박스내에서 문자열의 존재 여부를 1차로 확인하는 과정을 담당하는 CNN 신경망을 사용한다. CNN신경망은 일정한 높이의 단어유닛, 글 유닛사이의 공백등을 찾는 작업을 수행한다. 현재로서는 기하학적인 선과, 이미지 일정한 높이만의 확인한 상태이기 때문에 문자검출이 아닌 글 Unit이라고 부른다. 즉, 도 8에 도시된 바와 같이, 1단계 작업을 통해서 기하학적인 보정단계를 거친, 텍스트가 포함되었을 것으로 예상되는 관심영역(Text Box) 이미지가 추천되어 글unit검출 신경망(400)으로 전달된다. 해당 관심영역(Text Box)은 글자(Text)를 포함하는 경우도 있고, 글자(Text)를 포함하지 않는 경우도 있다. Text Box 의 정보(연결식별번호, 이미지)를 제공받아서 해당 관심영역(Text Box)에서 일정한 길이와 높이를 갖고 있는 이미지와 공백이 연속되어 있을 경우, 이를 텍스트가 포함된 이미지로 인식하여, 해당 관심영역에 글 Unit가 포함되어 있다고 판단한다. 도면에서, 1단계에서 추천된 3개의 관심영역(Text Box) 중에서 글Unit 검출신경망(400)에서 1개의 Text Box에서만 글유닛이 검출되어, 해당 Text Box에서만 문자열이 포함되어 있는 것으로 판단되었다(여기서는 문자열 " AB E "). The text unit detection network neural network 400 is a CNN neural network responsible for the process of first checking the existence of a string in the recommended box before extracting the text inside the box among several recommended areas of interest (Text Box). use The CNN neural network performs the task of finding word units of a certain height and spaces between text units. Currently, it is called a text unit, not a character detection, because only geometric lines and a certain height of the image are checked. That is, as shown in FIG. 8 , an image of a region of interest (Text Box) expected to contain text that has undergone a geometric correction step through the first step is recommended and transmitted to the text unit detection neural network 400 . The corresponding region of interest (Text Box) may include text or may not include text. When the text box information (connection identification number, image) is provided and there are consecutive images and spaces with a certain length and height in the text box, it is recognized as an image containing text and the corresponding interest It is judged that the text unit is included in the area. In the figure, a text unit was detected in only one Text Box in the text unit detection neural network 400 among the three areas of interest (Text Box) recommended in step 1, and it was determined that the text box contained only the text box (here, string " AB E ").

이후, 해당 관심영역(Text Box)내의 정확한 문자열 검출을 위해서, 해당 관심영역(Text Box)에 대해서만 전처리 작업을 수행한다. Box전처리부에서는 임계값을 이용하여 이미지의 블러(blur)부분이나 필요하지 않는 픽셀(Pixel)들을 제거할 수 있는 CNN신경망을 통해 학습된 이미지개선알고리즘을 사용한다. 이후 Box내의 해당문자열에 대한 정확한 Text 추출을 위해서 OCR신경망에 전처리가 끝난 Box이미지( " AB E" 문자열 포함 이미지)가 입력된다. OCR신경망에서는 글유닛에 대하여 1.정확한 문자열을 추출하고( "AB E"), 2. 연결식별번호(Linking address)에 검출된 문자열의 해당 위치(좌표)를 저장하며, 마스킨(Masking)처리나 원본이미지 확인 작업을 위해서 데이터를 관리한다. Thereafter, in order to accurately detect a character string in the corresponding region of interest (Text Box), a preprocessing operation is performed only on the corresponding region of interest (Text Box). The box preprocessor uses an image enhancement algorithm learned through a CNN neural network that can remove the blur part of the image or unnecessary pixels by using a threshold value. Afterwards, the pre-processed Box image (image including the "AB E" string) is input to the OCR neural network to extract the correct text for the corresponding string in the Box. In the OCR neural network, 1. the correct character string is extracted from the text unit (“AB E”), 2. the corresponding position (coordinate) of the detected character string is stored in the linking address, and masking processing is performed. I manage data for original image verification work.

이후, 패턴분석부(300)가 상기 개인정보영역추출부(200)에 의하여 추출된 상기 개인정보의 패턴을 기 설정된 개인정보패턴과 비교 분석하여 상기 처리대상정보의 개인정보를 추출하는 패턴분석단계를 거친다.Then, a pattern analysis step in which the pattern analysis unit 300 compares and analyzes the pattern of the personal information extracted by the personal information area extraction unit 200 with a preset personal information pattern to extract the personal information of the processing target information go through

도 9을 참조하면, 이전의 OCR신경망을 통해서 검출된 문자열(도면에서 문자열 "AB E")은 1. 개인정보패턴검출과정, 2.문서양식검출과정을 각각 거친다. 개인정보패턴검출단계와 별도로 이와 같은 문서양식검출과정을 거치는 이유는 다음과 같다. 예를 들어 개인정보패턴검출과정에서 검출된 12자리의 숫자가 있다고 할 때(원래 주민번호는 6+7자리=13자리 숫자임), 해상도의 문제로 검출하지 못한 1개의 숫자로 인해서, 패턴검출에서 실패할 경우가 발생할 수 있다. 그러나 이렇게 패턴검출에 실패하더라도 문서양식이 주민등록증임을 파악할 수 있다면, 해당 관심영역(Box area)에서 12개가 검출된 것은 원래 포함되어야 할 1개의 숫자가 탐지되지 못하여 미탐지였다고 결론 내릴 수 있다. 따라서 이와 같이 개인정보패턴검출부(310)와 문서양식검출부(330)는 상호보완적인 역할을 수행한다. Referring to FIG. 9 , the character string (the character string "AB E" in the drawing) detected through the previous OCR neural network goes through 1. a personal information pattern detection process and 2. a document form detection process, respectively. The reason for the document form detection process separate from the personal information pattern detection step is as follows. For example, if there are 12 digits detected in the personal information pattern detection process (the original resident number is 6+7 digits = 13 digits), due to one number that was not detected due to a problem of resolution, the pattern was detected failure may occur. However, if it can be determined that the document form is a resident registration card even if the pattern detection fails in this way, it can be concluded that the 12 detected numbers in the box area were not detected because one number that should be included was not detected. Accordingly, the personal information pattern detection unit 310 and the document form detection unit 330 perform complementary roles as described above.

본 발명에서는 이와 같은 문서양식추출이 가능도록 CNN신경망(Template Recognition Convolution Neural Network)을 통해 학습된 패턴분석알고리즘을 통하여, 1단계에서 사용한 템플릿매칭비교부(130)의 CNN신경망을 통해 학습된 매칭비교알고리즘을 사용할 수 있다. In the present invention, through the pattern analysis algorithm learned through the CNN neural network (Template Recognition Convolution Neural Network) to enable such document form extraction, the matching comparison learned through the CNN neural network of the template matching and comparing unit 130 used in step 1 algorithms can be used.

상기 패턴분석부(300)에 의한 개인정보패턴분석과정(3단계 작업)은, 개인정보패턴검출부(310), 템플릿매칭(template matching index) 등록양식검출부로 구성될 수 있다. 개인정보패턴검출부(310)에서는 Text추출과정의 OCR처리부에서 추출된 문자열에 개인정보 패턴양식을 적용하여 주민등록번호, 운전면허증번호, 외국인번호, 여권번호가 포함되었는지 판단한다. 등록양식검출부는, 전처리단계(1단계)의 특정한 문서/증명서 양식을 판단할 수 있는 CNN(Template Recognition CNN)을 통해 학습된 패턴분석알고리즘을 사용하여, 개인정보패턴검출부(310)에서 특정패턴의 개인정보 추출에 실패한 경우, 개인정보패턴검출부(310)의 결과와 비교하여 상호보완적으로 개인정보 포함여부를 유추할 수 있다. The personal information pattern analysis process (three-step operation) by the pattern analysis unit 300 may include a personal information pattern detection unit 310 and a template matching index registration form detection unit. The personal information pattern detection unit 310 applies the personal information pattern form to the character string extracted from the OCR processing unit of the text extraction process to determine whether resident registration number, driver's license number, foreigner number, and passport number are included. Registration form detection unit, using a pattern analysis algorithm learned through CNN (Template Recognition CNN) that can determine a specific document / certificate form of the pre-processing step (step 1), in the personal information pattern detection unit 310 of a specific pattern When personal information extraction fails, it can be inferred whether or not personal information is included in comparison with the result of the personal information pattern detection unit 310 complementary to each other.

상기한 과정 이후 비식별화처리부(400)는 상기 패턴분석부(300)에 의하여 추출된 상기 개인정보를 비식별화 처리하는 비식별화 처리단계를 거친다.After the above process, the de-identification processing unit 400 undergoes a de-identification processing step of de-identifying the personal information extracted by the pattern analysis unit 300 .

이에 대하여 살펴보면, 개인정보 추출 후, 발견된 개인정보에 대하여 , 개인정보가 허가 없이 사용되거나, 사용기간이 지났음에도 보관되고 있다면, 개인정보에 해당하는 부분은 제3자가 확인할 수 없도록 비식별화 조치가 필요하다. 비식별화 처리에는 가명처리/총계처리/데이터삭제/데이터 범주화/마스킹등의 처리로 나눌 수 있는데 본 발명에서는 일반적인 개인정보가 검출되었을 경우 마스킹하는 과정을 설명하기로 한다.In this regard, regarding personal information found after extraction of personal information, if the personal information is used without permission or is stored even after the period of use has elapsed, measures to de-identify the personal information so that a third party cannot verify it is needed The de-identification processing can be divided into pseudonymization processing/total processing/data deletion/data categorization/masking, etc. In the present invention, the process of masking when general personal information is detected will be described.

3단계 개인정보 패턴추출과정을 종료하면서, 개인정보가 발견될 경우, 포함된 관심영역(Text Box)내의 개인정보문자열에 대한 비식별화 옵션을 확인하여 비식별화(마스킹)하는 과정에 대한 실시예를 도 10을 참조하여 살펴보기로 한다. Step 3 When personal information is found while completing the personal information pattern extraction process, the process of de-identifying (masking) by checking the de-identification option for the personal information string in the included area of interest (Text Box) An example will be described with reference to FIG. 10 .

도면을 참조하면, 먼저 1) 3단계 과정에서 개인정보가 검출되었을 경우, 관련 정보 (이미지데이터 / 위치정보)가 Masking처리부(800)로 전달된다. 그런 다음 2) Masking 처리부에서는 검출된 Text Box내의 개인정보문자열의 해당 이미지 정보를 처리옵션에 따라서 masking처리한다. 이후 3) 비식별화 대상이 되는 이미지의 위치정보( Linking address )를 이미지어레이부(110)에 질의하여 변환정보 유무를 확인한다. 이때 4) 변환정보가 있을 경우, 변환 Matrix정보를 이용하여 원래위치로 변환하여 이미지파일을 저장하고, 5) Linking address정보를 갱신한다. Referring to the drawings, first 1) when personal information is detected in step 3, related information (image data / location information) is transmitted to the masking processing unit 800 . Then, 2) Masking processing unit masks the image information of the detected personal information string in the text box according to the processing option. Thereafter, 3) the image array unit 110 is queried for the location information (Linking address) of the image to be de-identified to check whether there is conversion information. At this time, 4) If there is conversion information, it is converted to the original location using conversion matrix information, and the image file is saved, and 5) Linking address information is updated.

한편, 지문의 경우도 개인을 식별할 수 있는 개인정보에 포함되기 때문에 처리대상정보가 지문이미지 또는 안면이미지로 구성되는 개인정보가 될 수 있다. 도 11는 처리대상정보가 지문이미지인 경우 모자이크 처리/숨김처리하는 경우를 나타낸다. On the other hand, since the fingerprint is also included in the personal information that can identify an individual, the processing target information may be personal information composed of a fingerprint image or a face image. 11 shows a case of mosaic processing/hidden processing when processing target information is a fingerprint image.

상기한 바에 따르면, 본 발명의 딥러닝을 이용한 개인정보 처리방법은, 이미지가 들어오면(다양한 형태: png, jpg, tif, multipage tif, pdf, ole (docx 안 이미지 같은 것들, 등등), 이러한 이미지들을 처리 가능한 이미지 형태로 변환한다.According to the above, the personal information processing method using deep learning of the present invention, when an image comes in (various formats: png, jpg, tif, multipage tif, pdf, ole (such as images in docx, etc.), these images converted into a processable image format.

상기에서 추출된 이미지에 관하여 우선 'form'에 관한 존재 여부를 파악한다. 여기서 'form'이란 사각형 모양의 형태를 의미합니다. 대부분의 신분증, 통장 스캔, 여권 스캔, 특정 서식은 사각형 모양을 띄기 때문에 처리의 정확도를 높이기 위해 사각형을 찾아서 따로 추출하는 작업을 하는 것으로, 여기서 사각형이란 각이 4개가 존재하는 아무런 형태를 의미한다.With respect to the image extracted above, it is first checked whether there is a 'form'. Here, 'form' means a rectangular shape. Most ID cards, passbook scans, passport scans, and certain forms have a rectangular shape, so in order to increase the accuracy of processing, a square is found and extracted separately.

상기 'form'에 관한 추출은 CNN신경망(Rectangular Corner Localization 신경망)을 통해서 이루어지며, 이러한 rectangular corner localizer는 사각형에 다양한 관점이 들어가거나 회전되거나 꼭지점 하나가 보이지 않더라도 추출될 수 있도록 구성된다. 한편, 사각형 영역을 찾았다면, perspective transform을 하여 그 사각형 영역을 따로 평면상의 이미지로 추출해 낸다. 이때 perspective transform을 하기 위해 사용했던 homography 행렬은 추후에 후처리를 할때 역행렬을 사용해서 마스킹이나 미리보기 같은 것을 만들 때 다시 사용될 수 있다.The extraction of the 'form' is made through a CNN neural network (Rectangular Corner Localization neural network), and this rectangular corner localizer is configured to be extracted even if various viewpoints are entered into the rectangle, rotated, or one vertex is not visible. On the other hand, if a rectangular area is found, perspective transform is performed and the rectangular area is extracted as a separate flat image. At this time, the homography matrix used for perspective transform can be used again when making things like masking or preview using the inverse matrix when doing post-processing later.

이후, 사각형 영역에서 추출된 이미지와 전체 이미지를 회전보정 신경망을 통해 회전 정도를 완화한다. 그리고 글 영역을 검출에 있어, 회전이 완전히 방지(invariant) 되도록 회전보정 신경망의 학습이 수행되도록 할 수도 있으나, 어느 정도 회전이 허용되고 과도한 회전이 방지되도록 규격화(standardization)되도록 회전보정 신경망을 학습할 수 있다.Thereafter, the degree of rotation of the image extracted from the rectangular region and the entire image is mitigated through a rotation correction neural network. In addition, in detecting the text area, the rotation correction neural network can be trained so that rotation is completely invariant, but the rotation correction neural network can be trained to be standardized to allow some rotation and prevent excessive rotation. can

한편, 사각형 영역이 추출된 이미지가 90도, 180, 270등 잘못 회전되어 있을 수 있는 확률을 감안하여, 정방향으로 회전시키는 신경망이 용이하다. On the other hand, considering the probability that the image from which the rectangular region is extracted may be incorrectly rotated by 90 degrees, 180, 270, etc., a neural network that rotates in the forward direction is easy.

이러한 신경망은 단순히 이미지 인풋을 받으면 수평에서 반시계방향으로 몇 도가 회전되어 있는지 그 각도를 예측해주는 신경망이다. 그런 다음 회전 완화된 사각형 영역 추출된 이미지와, 본래 전체 이미지에서 '글 영역'을 검출한다. 구체적으로, 글 영역 검출은 2가지 방법으로 할 수 있으며, 서버에서 돌릴 수 있는 것들은 무거운 신경망으로 API형태로 제공하여 그 API의 호출로 신경망을 돌리는 방법과 (PV, 서버스캔 등) PC 정보보안 같이 로컬에 깔리는 솔루션들을 위해서 가벼운 백본(backbone)신경망을 쓰는 신경망을 가지고 직접 솔루션에 내재하여 신경망을 돌리는 방법이다. Such a neural network is a neural network that simply receives an image input and predicts how many degrees it is rotated counterclockwise from the horizontal. Then, the 'text area' is detected in the image extracted from the rotation-relaxed rectangular area and the original whole image. Specifically, text area detection can be done in two ways, and those that can be run on the server are provided in the form of an API as a heavy neural network, and the method of running the neural network by calling the API (PV, server scan, etc.) For local solutions, it is a method of running a neural network directly embedded in a solution with a neural network that uses a light backbone neural network.

글 영역 검출은 철자 단위 개념을 이해하는 신경망을 사용하여 수행되며, 철자들을 localize하고 두 철자가 한 단어일 확률을 같이 estimate하여 철자와 단어를 검출한다. 이외 추가적으로 줄 단위로 뽑게끔 하여 사용할 수도 있지만 미리 보기나 후처리 같은 것들을 위해서는 단어 unit으로 글 영역을 검출하는 것이 용이하다. 상기한 글 영역이 추출된 좌표는 후처리 중 미리보기, 마스킹 등에 다시 사용된다.Text region detection is performed using a neural network that understands the concept of spelling units, and it detects spellings and words by localizing the letters and estimating the probability that both letters are one word. In addition to this, it can be used by drawing additional lines by line, but for things such as preview and post-processing, it is easy to detect the text area by word unit. The coordinates from which the text area is extracted are used again for preview and masking during post-processing.

그 다음, 추출된 글 영역을 각각의 따른 이미지로 생각하여, 이를 OCR(Text Recognition) 신경망을 통해 '글'을 추출한다. 구체적으로, 신경망 기반으로 Text Region에 대하여 Text Recognition을 한다. 그 전에 전처리를 통해 노이즈를 줄이고 글과 배경을 분리하여 부각시키는 작업을 실시한다. 이때, 이미지 morphological filter 등을 사용하고 이미지 이진화를 하는 작업 등을 할 수 있다.Then, the extracted text area is considered as an image following each, and 'text' is extracted through the OCR (Text Recognition) neural network. Specifically, text recognition is performed on a text region based on a neural network. Before that, preprocessing is performed to reduce noise and separate the text from the background to make it stand out. In this case, it is possible to use an image morphological filter or the like and perform image binarization.

이후, 공통필터 속 문서 필터를 사용하여 패턴 매칭을 통한 개인정보 패턴을 검출하고, 어떠한 사각형 또는 이미지에서 나오는 모든 '글'들을 가지고 자연어 처리를 통해 그 사각형이나/ 이미지의 성격을 파악한다. Thereafter, the personal information pattern is detected through pattern matching using the document filter in the common filter, and the nature of the rectangle or image is identified through natural language processing with all 'texts' from any rectangle or image.

예를 들어 '주민등록'이라는 철자가 검출되고 밑에 있는 박스 중에 하나가 숫자를 12자리 검출하여 주민번호 패턴인 "6자리-7자리"+ 체크섬에 검출되지 않았다면, 이러한 숫자 12자리는 OCR의 미/오탐지에 의해 13자리를 다 못 뽑은 확률이 높고, 주민등록이라는 글과 그 숫자의 위치가 이 12자리는 주민번호일 확률이 높다는 것을 알려준다. 이렇게 entity recognition을 하면 OCR오탐에 의해 문서필터에 걸려지지 않는 패턴을 더 검출 할 수 있다. 이외 신분증 같은 경우 '이름', '주소' 같은 것을 파악할 수 있고, 다양한 응용이 가능하다.For example, if the spelling of 'resident registration' is detected, and one of the boxes below detects 12 digits and is not detected in the resident number pattern "6-7 digits" + checksum, these 12 digits of the OCR There is a high probability that all 13 digits were not selected due to false positives, and the text of resident registration and the location of the number indicate that these 12 digits are highly likely to be resident numbers. With this entity recognition, it is possible to detect more patterns that are not caught by the document filter by OCR false positives. In the case of other IDs, such as 'name' and 'address' can be identified, and various applications are possible.

이후, 추출된 개인정보 패턴들에 대하여 다양한 후처리를 할 수 있다. 가령, 개인정보 마스킹 또는, 글 영역 검출의 좌표와 관점보정의 homography 등을 사용해서 본래 이미지에서 그 개인정보 패턴의 위치를 localize 할 수 있으며, 이 영역을 보이지 않게 색을 칠하면 마스킹을 할 수 있다. 또한, 개인정보 미리보기, 검출 결과를 미리보기 하기위하여 작업을 하고 패턴을 가두고 있는 사각형 영역을 사용자에게 보기 쉽게 부각시켜 나타낼 수 있다. 또한, 특정 키워드 검출은, 이미지 내에 어떤 특정 키워드가 있는지 파악할 수 있으며, 특정 form 검출, entity recognition을 통해 어떤 글들의 조합은 주민등록증을 표현한다. 은행 서류를 표현한다, 여권을 표현한다 등의 개념을 학습하여 이미지가 특정 form 을 가지고 있는지 확인 할 수 있다.Thereafter, various post-processing may be performed on the extracted personal information patterns. For example, you can localize the location of the personal information pattern in the original image by using personal information masking or homography of point correction and coordinates of text area detection, and masking can be done by coloring this area invisibly. In addition, it is possible to highlight and display a rectangular area in which a pattern is confined and easy to see by the user in order to preview the personal information and the detection result. In addition, specific keyword detection can determine which specific keywords are in the image, and certain combinations of texts through specific form detection and entity recognition express a resident registration card. By learning concepts such as representing bank documents and representing passports, you can check whether an image has a specific form.

이후, 추가적으로 지문의 위치를 파악하는데, 지문 또한 개인정보로 사용될 수 있기 때문에, 지문을 검출하는 신경망을 사용하여 지문의 위치를 파악한다.Thereafter, the location of the fingerprint is additionally determined. Since the fingerprint can also be used as personal information, a neural network detecting the fingerprint is used to determine the location of the fingerprint.

한편, foreground, background, traps이렇게 3가지 compartment로 이미지를 나누어, background에 foregrounds와 trap이미지들이 다양한 회전 각도, 위치, 크기, 색 농도, 밝기 등을 가지게 하여 새로운 이미지를 생성할 수 있으며, 특히 실생활에서 발견할 수 있는 지문 데이터는 주민등록증 뒷면 지문이 대다수이기 때문에 이를 맞추어 자동으로 다양한 rolled-fingerprints를 주민등록증의 지문 부분에 생성하여, 다양한 노이즈(noise)가 존재하는 주민등록증 뒷면으로 만들어 배경에 추가하는 작업을 할 수 있다.On the other hand, by dividing the image into three compartments, foreground, background, and traps, a new image can be created by allowing foregrounds and trap images to have various rotation angles, positions, sizes, color depths, and brightness in the background, especially in real life. Since most of the fingerprint data that can be found is the fingerprint on the back of the resident registration card, it automatically creates various rolled-fingerprints on the fingerprint part of the resident registration card, making it the back side of the resident registration card with various noises and adding it to the background. can

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 다른 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의하여 정해져야 할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, which are merely exemplary, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom. Accordingly, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100 : 전처리부 110 : 이미지어레이부
120 : 이미지개선부 130 : 템플릿매칭비교부
140 : 영역검출부 150 : 기하학적 변환부
200 : 개인정보영역추출부 300 : 패턴분석부
400 : 비식별화처리부100: pre-processing unit 110: image array unit
120: image improvement unit 130: template matching comparison unit
140: area detection unit 150: geometric transformation unit
200: personal information area extraction unit 300: pattern analysis unit
400: de-identification processing unit

Claims

a pre-processing step of extracting an image file from the target file by a pre-processing unit, detecting a region of interest that is determined to contain personal information from the extracted image file, and performing processing to improve the image recognition rate for the region of interest;
a personal information extraction step of extracting, by a personal information region extraction unit, processing target information including personal information from the region of interest detected from the pre-processing unit using a CNN neural network;
a pattern analysis step in which a pattern analysis unit compares and analyzes the pattern of the personal information extracted by the personal information area extraction unit with a preset personal information pattern to extract the personal information of the processing target information;
The de-identification processing unit is performed including a de-identification processing step of de-identifying the personal information extracted by the pattern analysis unit:
The pre-processing step is
an image array step in which an image-array unit assigns a connection identification number to each of the image files extracted from the target file;
an image improvement step in which an image improvement unit removes a quality degrading factor of the image file;
The template matching and comparison unit compares the image file with set templates consisting of template groups of regular forms including personal information including driver's license, identification card or foreigner's certificate, and compares the image according to the possibility of including personal information. step;
detecting, by a region detection unit, the region of interest from the image file selected by the template matching and comparing unit, through a region detection algorithm in which a personal information inclusion format is learned by an artificial neural network;
A method of processing personal information using AI deep learning, characterized in that it includes; a geometrical transformation unit correcting the viewpoint and rotation of the region of interest image.

delete

The method of claim 1,
The area detection unit,
When the image file is classified into a stereotyped format by the template matching and comparison unit, the contour of the image file is calculated using an image morphological filter to highlight the image characteristics of the region of interest. extract,
The geometric transformation unit,
Personal information processing method using AI deep learning, characterized in that correcting the image file from which the outline is extracted.

The method of claim 1,
The region detection algorithm is
A region of interest is detected based on feature points including a vertex, a corner, or a line included in the image file, but using a Rectangular Corner Localization (RCL) neural network to determine the weight of the decision on the feature point. Personal information processing method using AI deep learning, characterized in that it is configured to find the optimal ratio.

The method of claim 1,
The image improvement unit,
Personal information processing method using AI deep learning, characterized in that it removes quality-impairing factors including noise, blur, or occlusion of the image file through an image improvement algorithm learned through a CNN neural network .

The method of claim 1,
The template matching comparison unit,
According to the template matching index of the image file and the templates, the image file is divided into a formal format and an atypical format, respectively, and a matching comparison algorithm according to the format matching index is an artificial neural network ( A personal information processing method using AI deep learning, characterized in that it is learned through CNN, Convolution Neural Network).

The method of claim 1,
The processing target information is
Personal information consisting of text,
The personal information area extraction unit,
It is configured to detect a text unit containing text in the region of interest through an extraction algorithm learned through a CNN neural network, and extract a text string from the text unit, wherein an image and a space having a certain length and height in the region of interest are Personal information processing method using AI deep learning, characterized in that the continuous area is detected by the writing unit.

The method of claim 1,
The processing target information is
Personal information consisting of a fingerprint image or facial image,
The personal information area extraction unit,
It is configured to detect an image pattern in the region of interest through an extraction algorithm learned through a CNN neural network, and extract a fingerprint image or face image from the image pattern, having a fingerprint pattern or facial pattern connected to a feature point in the region of interest. Personal information processing method using AI deep learning, characterized in that the pattern area is detected as the image pattern.

8. The method of claim 7,
The pattern analysis unit,
A personal information pattern analysis unit that compares and analyzes the text with a preset personal information pattern including a resident number, driver's license number, passport number and foreign number and fingerprint through a pattern analysis algorithm learned through CNN neural network;
Personal information processing method using AI deep learning, characterized in that it comprises a document form analysis unit that compares and analyzes the text with a preset document form including a resident registration card, driver's license, passport, residence certificate and alien certificate.

8. The method of claim 7,
The de-identification processing unit,
Personal information processing method using AI deep learning, characterized in that it is configured to process the text in any one or more ways of mosaic processing, hiding processing, encryption processing, pseudonymization processing, total processing, data deletion, data categorization, and masking.