KR20180080408A

KR20180080408A - Structured data and unstructured data extraction system and method

Info

Publication number: KR20180080408A
Application number: KR1020170000679A
Authority: KR
Inventors: 유경찬; 박승훈
Original assignee: 주식회사 페이스시스템
Priority date: 2017-01-03
Filing date: 2017-01-03
Publication date: 2018-07-12
Also published as: KR101942468B1

Abstract

Disclosed are a structured and unstructured data extraction system and a method thereof. According to an embodiment of the present invention, the structured and unstructured data extraction system includes: a data imaging module for generating image data to image financial data requiring a handwriting input; and a data extraction module for classifying a template after extracting full text data from the image data, extracting template data based on the classified template, and converting the extracted template data into verification data. Accordingly, the present invention can increase accuracy while maintaining work productivity.

Description

[0002] Structured data and unstructured data extraction system and method [

본 발명은 정형 및 비정형 데이터 추출 시스템 및 방법에 관한 것이다.The present invention relates to a system for and a method for extracting structured and unstructured data.

일반적으로 자산관리 솔루션사에서는 인적 자원을 활용하여 주기적으로 갱신되는 금융자료(예컨대, 운용지시서)에서 기준가에 해당하는 정보를 찾아 수기 운용지시 데이터로 등록하는 업무를 수행하고 있다. In general, the asset management company performs the task of finding information corresponding to the base price in the financial data (for example, the operation instructions) periodically updated by utilizing the human resources and registering the data as the operating instruction data.

하지만, 각종 금융자료는 다양하고 비정형의 템플릿을 가지고 있어 오입력 사고가 빈번하게 발생하고 있으며, 방대한 양의 시장지시가 PDF 파일로 제공됨에 따라 입력 누락 사고가 발생할 가능성이 높다. 그리고 이러한 인적 오류(human error)로 인해 배상책임 보험료가 인상되고 고객 신뢰도가 저하되는 문제점이 발생하고 있다. However, since various financial data have various unstructured templates, misinformation frequently occurs, and a large amount of market instructions are provided as PDF files, so that there is a high possibility of omission. In addition, the human error increases the compensation liability insurance premium and lowers the reliability of the customer.

또한, 이러한 사고를 방지하기 위해 2차 혹은 3차에 걸친 부서별 아이 체킹(eye checking)을 할 수 밖에 없는 구조적 한계를 가지고 있고, 이로 인해 업무효율이 저하되는 문제점도 있다. 따라서, 이러한 인적 오류를 보완할 방안이 요구되는 실정이다. In addition, in order to prevent such an accident, there is a structural limitation that can be done by the second or the third departmental eye checking. Therefore, there is a need for measures to compensate for such human errors.

한국공개특허 제10-2008-0066729호 (공개일 2008년7월16일) - 목록 데이터의 온라인 데이터 검증Korean Patent Laid-Open No. 10-2008-0066729 (published on July 16, 2008) - Online data verification of list data

본 발명은 수기 입력 데이터의 정합성 검증을 위해 다양한 정형/비정형 컨텐츠인 수기 입력용 데이터에서 정확성 높게 데이터를 검출하는 정형 및 비정형 데이터 추출 시스템 및 방법을 제공하기 위한 것이다. An object of the present invention is to provide a system and a method for extracting data of high accuracy from data for handwritten input, which is various types of regular / irregular contents, in order to verify consistency of handwritten input data.

본 발명의 다른 목적들은 이하에 서술되는 바람직한 실시예를 통하여 보다 명확해질 것이다.Other objects of the present invention will become more apparent through the following preferred embodiments.

본 발명의 일 측면에 따르면, 수기 입력이 요구되는 금융 자료를 이미지화한 이미지 데이터를 생성하는 자료 이미지화 모듈; 상기 이미지 데이터에서 전문 데이터를 추출한 후 템플릿을 분류하고 상기 분류된 템플릿에 기초하여 템플릿 데이터를 추출하여 검증용 데이터로 변환하는 데이터 추출 모듈을 포함하는 정형 및 비정형 데이터 추출 시스템이 제공된다. According to an aspect of the present invention, there is provided a data imaging module for generating image data in which financial data requiring handwriting input is imaged; And a data extraction module for extracting the professional data from the image data, classifying the template, and extracting template data based on the classified template and converting the template data into verification data.

상기 데이터 추출 모듈은, 상기 이미지 데이터에 대해 광학적 문자 판독 방식을 이용하여 텍스트에 해당하는 전문 데이터를 추출하는 전문 데이터 추출부; 사전 정의된 템플릿 분류 키워드 테이블에 등록된 하나 이상의 템플릿 키워드가 상기 전문 데이터 내에 모두 포함되어 있는 경우, 상기 전문 데이터가 상기 템플릿 분류 키워드 테이블에 상응하는 템플릿에 해당하는 것으로 분류하는 템플릿 분류부; 상기 분류된 템플릿에 대해 설정된 데이터 추출 정의 테이블에 기초하여 상기 전문 데이터 중에서 템플릿 데이터를 추출하는 템플릿 데이터 추출부; 상기 템플릿 데이터를 파싱 및 정제하여 상기 검증용 데이터를 생성하는 데이터 파싱부를 포함할 수 있다.Wherein the data extraction module comprises: a professional data extracting unit for extracting specialized data corresponding to text by using an optical character reading method on the image data; A template classifying unit classifying the expert data as corresponding to a template corresponding to the template classification keyword table when at least one template keyword registered in the predefined template classification keyword table is included in the expert data; A template data extracting unit for extracting template data from the specialized data based on a data extraction definition table set for the classified template; And a data parsing unit parsing and refining the template data to generate the verification data.

상기 금융 자료 별 템플릿을 등록하는 템플릿 등록 모듈을 더 포함하되, 상기 템플릿 등록 모듈은, 템플릿을 분류하기 위한 키워드를 수집하고 테이블화하여 상기 템플릿 분류 키워드 테이블을 생성하는 템플릿 분류 키워드 수집부와; 상기 분류된 테이블에서 상기 템플릿 데이터를 추출하기 위한 기준 키워드 및 n번째 데이터 추출 범위(여기서, n은 1 이상의 자연수)를 설정하는 데이터 추출 정의 테이블을 생성하는 데이터 추출 정의부를 포함할 수 있다.And a template registration module for registering the template for each financial data item, wherein the template registration module comprises: a template classification keyword collection unit for collecting and classifying keywords for classifying the templates to generate the template classification keyword table; And a data extraction definition unit for generating a data extraction definition table for setting a reference keyword and an n-th data extraction range (where n is a natural number equal to or greater than 1) for extracting the template data from the classified table.

상기 n번째 데이터 추출 범위는 행과 열 단위로 구분되는 상기 전문 데이터에 대해, 기준 키워드를 검색 포지션으로 하여 n1 라인 만큼 스킵하고 n2 단어 만큼 포지션 이동한 위치에 상응하는 단어를 상기 템플릿 데이터로 추출하게 하는 데이터 추출 규칙일 수 있다.The nth data extraction range skips the n1 line by using the reference keyword as a search position and extracts a word corresponding to a position shifted by n2 words with the template data for the specialized data classified by row and column units Lt; / RTI >

상기 템플릿 데이터 추출부는 기준 키워드를 기준으로 라인 스킵 및 단어 이동을 통해 포지션을 이동하여 데이터를 추출하는 텍스트 처리 방식으로 상기 템플릿 데이터를 추출하되, 상기 텍스트 처리 방식에 따라 T()는 키워드 기준 라인과 단어 위치만큼 이동하여 데이터를 추출하는 방식을 나타내며, 시작 위치 키워드값, 종료 위치 키워드값, 스킵 키워드, 추출기준 키워드가 설정되고, L(n)은 라인 단위 스킵을 나타내고, W(n)은 워드 단위 이동을 나타내며, C(n), N(n), N(n).F(n)은 추출 데이터 속성정보를 나타내고, X(n)은 단건 또는 복수 건 추출 여부를 나타내며, V(n)은 유효한 데이터 추출 처리 여부 및 추출 처리 건수를 나타내고, S(n)은 동일한 파일 내에 동일한 수식어 1개 이상일 경우 처리 옵션을 나타낼 수 있다.Wherein the template data extracting unit extracts the template data by a text processing method for extracting data by moving a position through a line skip and a word movement based on a reference keyword, and according to the text processing method, T () (N) indicates a line-by-line skip, and W (n) indicates a method of extracting data from a word (N) represents the extracted data attribute information, X (n) represents whether to extract single or multiple keys, and V (n) S (n) represents a processing option when there are one or more identical modifiers in the same file.

또는 상기 템플릿 데이터 추출부는 첫번째 키워드 만족 이후 두번째 또는 두번째 이후 n번째 까지 키워드 값이 만족할 때 기준 라인 스킵과 단어 이동을 통해 위치를 이동하여 데이터를 추출하는 확장 처리 방식으로 상기 템플릿 데이터를 추출하되, 상기 확장 처리 방식에 따라 E()는 첫번째 키워드 만족 이후 두번째 키워드가 만족할 경우 이후 설정위치로 이동하여 데이터를 추출하는 방식을 나타내며, 시작 위치 키워드값, 종료 위치 키워드값, 스킵 키워드가 설정되고, L(n)은 라인 단위 스킵을 나타내고, W(n)은 워드 단위 이동을 나타내며, C(n), N(n), N(n).F(n)은 추출 데이터 속성정보를 나타낼 수 있다.Or the template data extracting unit extracts the template data by an extension processing method for extracting data by moving a position through a reference line skip and a word movement when a keyword value is satisfied from the second or the second to the nth after the first keyword is satisfied, According to the extension processing method, E () indicates a method of moving to the next set position and extracting data when the second keyword is satisfied after the first keyword is satisfied. The start position keyword value, the end position keyword value, and the skip keyword are set and L N (n), N (n) .F (n) may represent extracted data attribute information.

상기 금융 자료는 메일 혹은 메신저로 수신된 수기 운용 지시 자료, 블룸버그 단말화면, 대용량 PDF 파일 타입의 시장 지시 자료 중 하나 이상을 포함하며, 상기 자료 이미지화 모듈은 상기 수기 운용 지시 자료 및 상기 시장 지시 자료는 인쇄 및 스캔 방식의 P2I 엔진을 이용하여 이미지화하고, 상기 블룸버그 단말화면은 화면 캡쳐 방식의 S2I 엔진을 이용하여 이미지화하할 수 있다.The financial data may include at least one of a handwritten operation instruction data received via e-mail or a messenger, a Bloomberg terminal screen, and a market instruction data of a large capacity PDF file type, and the data imaging module transmits the handwriting instruction data and the market instruction data The image of the Bloomberg terminal screen can be converted into an image by using the S2I engine of the screen capture method.

한편 본 발명의 다른 측면에 따르면, 수기 입력이 요구되는 금융 자료를 이미지화한 이미지 데이터를 생성하는 단계; 상기 이미지 데이터에서 전문 데이터를 추출하는 단계; 상기 전문 데이터에 상응하는 템플릿을 분류하는 단계; 상기 분류된 템플릿에 기초하여 템플릿 데이터를 추출하는 단계; 상기 템플릿 데이터를 검증용 데이터로 변환하는 단계를 포함하는 정형 및 비정형 데이터 추출 방법이 제공된다. According to another aspect of the present invention, there is provided a method of generating image data, the method comprising: generating image data in which financial data requiring handwriting input is imaged; Extracting specialized data from the image data; Classifying a template corresponding to the specialized data; Extracting template data based on the classified template; And converting the template data into data for verification.

상기 템플릿을 분류하는 단계는, 사전 정의된 템플릿 분류 키워드 테이블에 등록된 하나 이상의 템플릿 키워드가 상기 전문 데이터 내에 모두 포함되어 있는 경우, 상기 전문 데이터가 상기 템플릿 분류 키워드 테이블에 상응하는 템플릿에 해당하는 것으로 분류할 수 있다.The step of classifying the template may include the steps of: when the at least one template keyword registered in the predefined template classification keyword table is included in the specialized data, the specialized data corresponds to the template corresponding to the template classification keyword table Can be classified.

상기 템플릿 데이터를 추출하는 단계는, 상기 분류된 템플릿에 대해 설정된 데이터 추출 정의 테이블에 기초하여 상기 전문 데이터 중에서 템플릿 데이터를 추출할 수 있다.The step of extracting the template data may extract the template data from the specialized data based on the data extraction definition table set for the classified template.

템플릿을 분류하기 위한 키워드를 수집하고 테이블화하여 상기 템플릿 분류 키워드 테이블을 생성하는 단계와; 상기 분류된 테이블에서 상기 템플릿 데이터를 추출하기 위한 기준 키워드 및 n번째 데이터 추출 범위(여기서, n은 1 이상의 자연수)를 설정하는 데이터 추출 정의 테이블을 생성하는 단계가 선행될 수 있다.Collecting a keyword for classifying the template and creating a table to generate the template classification keyword table; A step of generating a data extraction definition table for setting a reference keyword and an n-th data extraction range (where n is a natural number equal to or greater than 1) for extracting the template data from the classified table may be preceded.

상기 템플릿 데이터를 추출하는 단계는 기준 키워드를 기준으로 라인 스킵 및 단어 이동을 통해 포지션을 이동하여 데이터를 추출하는 텍스트 처리 방식 혹은 첫번째 키워드 만족 이후 두번째 또는 두번째 이후 n번째 까지 키워드 값이 만족할 때 기준 라인 스킵과 단어 이동을 통해 위치를 이동하여 데이터를 추출하는 확장 처리 방식으로 상기 템플릿 데이터를 추출할 수 있다.The step of extracting the template data may include a text processing method for extracting data by moving a position through a line skip and a word movement based on a reference keyword or a text processing method for extracting data from a reference line The template data can be extracted by an extension processing method for extracting data by moving a position through skipping and word movement.

전술한 것 외의 다른 측면, 특징, 이점이 이하의 도면, 특허청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages will become apparent from the following drawings, claims, and detailed description of the invention.

본 발명의 실시예에 따르면, 수기 입력 데이터의 정합성 검증을 위해 다양한 정형/비정형 컨텐츠인 수기 입력용 데이터에서 정확성 높게 데이터를 검출하고, 검출한 데이터와 수기 입력 데이터 간의 정합성을 검증하여 인적 오류를 보완함으로써 업무 생산성을 유지하고 정확성을 개선시킨 효과가 있다. According to the embodiment of the present invention, in order to verify consistency of handwritten input data, data is detected with high accuracy in handwritten input data, which is various regular / irregular contents, and the consistency between detected data and handwritten input data is verified, Thereby maintaining business productivity and improving accuracy.

또한, 데이터 검증 기능 강화로 수기 운용지시 입력 오류를 방지하고 기준가 오류를 대폭 감소시킬 수 있으며, 기준가 오류 건 감소로 인한 전문인 배상 책임보험료를 절감하고, 검증 리포트에서 오류 데이터만을 크로스 체킹하도록 하여 단순 검증 작업량을 감소시키며, 업무생산성 향상에 따른 타 업무 집중도를 상승시키는 효과가 있다. In addition, by strengthening the data verification function, it is possible to prevent errors in the operation instruction input, greatly reduce the error in the reference price, reduce the professional liability insurance premium due to the reduction in the reference price error, and simply check the error data in the verification report It reduces workload and increases the concentration of other tasks as work productivity increases.

도 1은 본 발명의 일 실시예에 따른 수기 입력 데이터 정합성 검증 시스템의 개략적인 구성 블록도,
도 2는 데이터 추출 모듈의 구성 블록도,
도 3은 정합성 검증 모듈의 구성 블록도,
도 4는 템플릿 등록 모듈의 구성 블록도,
도 5는 템플릿 분류 키워드 테이블의 생성 예시,
도 6은 템플릿 데이터 추출 정의 테이블의 생성 예시,
도 7은 수기 입력 데이터 정합성 검증 방법의 순서도,
도 8은 데이터 추출 모듈에서 템플릿 데이터를 추출하는 과정을 나타낸 순서도,
도 9는 데이터 추출 방식의 종류를 나타낸 테이블,
도 10은 텍스트 처리 방식의 정의 테이블,
도 11은 텍스트 처리 방식의 예시도,
도 12는 확장 처리 방식의 정의 테이블,
도 13은 확장 처리 방식의 예시도,
도 14 내지 도 16은 템플릿 데이터 추출의 적용 예시도. 1 is a schematic block diagram of a handwriting input data consistency verification system according to an embodiment of the present invention;
2 is a block diagram of a configuration of a data extraction module,
3 is a block diagram of a configuration verification module,
4 is a block diagram showing the configuration of the template registration module,
5 shows an example of generation of a template classification keyword table,
6 shows an example of generation of a template data extraction definition table,
7 is a flowchart of a handwritten input data consistency verification method,
8 is a flowchart showing a process of extracting template data from the data extraction module,
9 is a table showing the types of data extraction schemes,
Fig. 10 shows a definition table of a text processing method,
11 shows an example of a text processing method,
12 shows a definition table of the extension processing method,
13 is an example of an extended processing method,
Figs. 14 to 16 are application examples of template data extraction; Fig.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

본 명세서에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, the terms "comprises" or "having" and the like refer to the presence of stated features, integers, steps, operations, elements, components, or combinations thereof, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

또한, 각 도면을 참조하여 설명하는 실시예의 구성 요소가 해당 실시예에만 제한적으로 적용되는 것은 아니며, 본 발명의 기술적 사상이 유지되는 범위 내에서 다른 실시예에 포함되도록 구현될 수 있으며, 또한 별도의 설명이 생략될지라도 복수의 실시예가 통합된 하나의 실시예로 다시 구현될 수도 있음은 당연하다.It is to be understood that the components of the embodiments described with reference to the drawings are not limited to the embodiments and may be embodied in other embodiments without departing from the spirit of the invention. It is to be understood that although the description is omitted, multiple embodiments may be implemented again in one integrated embodiment.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일하거나 관련된 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Reference will now be made in detail to the preferred embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

도 1은 본 발명의 일 실시예에 따른 수기 입력 데이터 정합성 검증 시스템의 개략적인 구성 블록도이고, 도 2는 데이터 추출 모듈의 구성 블록도이고, 도 3은 정합성 검증 모듈의 구성 블록도이며, 도 4는 템플릿 등록 모듈의 구성 블록도이고, 도 5는 템플릿 분류 키워드 테이블의 생성 예시이며, 도 6은 템플릿 데이터 추출 정의 테이블의 생성 예시이고, 도 7은 수기 입력 데이터 정합성 검증 방법의 순서도이다. FIG. 1 is a schematic block diagram of a handwriting input data consistency verification system according to an embodiment of the present invention, FIG. 2 is a block diagram of a data extraction module, FIG. 3 is a block diagram of a consistency verification module, 4 is a block diagram of a template registration module, FIG. 5 is an example of generating a template classification keyword table, FIG. 6 is an example of generating a template data extraction definition table, and FIG. 7 is a flowchart of a handwriting input data consistency verification method.

본 발명의 일 실시예에 따른 수기입력 데이터 정합성 검증 시스템(1)은 다양한 타입의 수기 입력용 데이터에서 자동 검출하여 정제한 데이터를 수기 입력 데이터와 비교하여 정합성을 검증함으로써 수기 입력에 따른 인적 오류를 방지할 수 있는 것을 특징으로 한다. The handwritten input data consistency verification system 1 according to an embodiment of the present invention compares human handwritten data with handwritten input data to verify the consistency by automatically detecting and refining data from various types of handwritten input data, Can be prevented.

도 1을 참조하면, 본 실시예에 따른 수기 입력 데이터 정합성 검증 시스템(1)은 자료 이미지화 모듈(10) 및 데이터 추출 모듈(20)을 포함하는 정형 및 비정형 데이터 추출 시스템을 포함한다. 정형 및 비정형 데이터 추출 시스템은 템플릿 등록 모듈(40)을 더 포함할 수 있다. 또한, 수기 입력 데이터 정합성 검증 시스템(1)은 실시예에 따라 정합성 검증 모듈(30)을 포함할 수 있다. 1, a handwritten input data consistency verification system 1 according to the present embodiment includes a structured and unstructured data extraction system including a data imaging module 10 and a data extraction module 20. [ The stereotyped and unstructured data extraction system may further include a template registration module 40. In addition, the handwriting input data consistency verification system 1 may include a consistency verification module 30 according to an embodiment.

자료 이미지화 모듈(10)은 수기 입력이 요구되는 금융 자료를 이미지화한다. The data imaging module 10 implements the financial data required for handwriting input.

금융 자료는 메일 혹은 메신저로 수신된 수기 운용 지시 자료, 블룸버그 단말화면의 데이터, 대용량 PDF 파일에서 복수개의 템플릿과 데이터를 추출하는 시장지시 자료 중 하나일 수 있다. 수기 운용 지시 자료는 PDF, EXCEL 파일 등의 오피스 문서 타입으로 이루어져 있거나 레포팅툴, XML 형식, 이메일 본문 내용으로 이루어져 있거나 채권, 파생, ELS/ELW, 현금성 자산, 수익증권, 주식 등의 다양한 템플릿을 가지는 파일 형식의 데이터일 수 있다. 블룸버그 단말화면의 데이터는 블룸버그 단말기에서 화면에 출력하는 주식수익증권, 채권, 선물 등의 각종 데이터일 수 있다. 시장지시 자료는 증권거래소에서 게시하는 대용량 PDF 파일 형식을 가지며, 1개의 파일에서 복수 개의 템플릿이 추출될 수 있고, KOSPI, KOSDAQ, KONEX 등 다양한 서식이 등록될 수 있다. The financial data may be one of the handwriting instruction data received via e-mail or IM, data on the Bloomberg terminal screen, and market instruction data for extracting a plurality of templates and data from a large capacity PDF file. The handwritten operation instruction data is composed of office document type such as PDF, EXCEL file, or is composed of reporting tool, XML format, e-mail contents, or having various templates such as bonds, derivatives, ELS / ELW, cash assets, beneficiary certificates, And may be data in a file format. The data on the Bloomberg terminal screen can be various data such as stock beneficiary certificates, bonds, futures, etc., which are output on the screen from the Bloomberg terminal. The market instruction data has a large PDF file format posted on the stock exchange, and a plurality of templates can be extracted from one file, and various forms such as KOSPI, KOSDAQ, and KONEX can be registered.

자료 이미지화 모듈(10)은 수기 운용 지시 자료를 인쇄하여 스캔(P2I, Print to Image)하는 P2I 엔진을 포함할 수 있다. 또는 자료 이미지화 모듈(10)은 블롬버그 단말화면을 캡쳐(S2I, Screen to Image)하는 S2I 엔진을 포함할 수도 있다. P2I 엔진에 의한 인쇄물의 스캔이나 S2I 엔진에 의한 화면 캡쳐는 당업자에게 자명한 사항으로 상세한 설명은 생략하기로 한다. The data imaging module 10 may include a P2I engine that prints and manages the hand operation instruction data (P2I, Print to Image). Alternatively, the data imaging module 10 may include an S2I engine for capturing a Bloomberg terminal screen (S2I). The scan of the printed matter by the P2I engine and the screen capture by the S2I engine are obvious to those skilled in the art, and a detailed description thereof will be omitted.

데이터 추출 모듈(20)은 자료 이미지화 모듈(10)에서 이미지화된 데이터에서 전문 인식을 통해 문자(텍스트)만을 추출하여 데이터베이스에 저장한다. 그리고 데이터를 소정의 데이터 처리 알고리즘에 따라 파싱 및 정제하여 정합성 검증을 위한 검증용 데이터로 변환한다. The data extraction module 20 extracts only characters (text) from the image data imaged by the data imaging module 10 through professional recognition and stores them in a database. Then, the data is parsed and refined according to a predetermined data processing algorithm and converted into verification data for verification of consistency.

도 2를 참조하면, 데이터 추출 모듈(20)은 전문 데이터 추출부(22), 템플릿 분류부(24), 템플릿 데이터 추출부(26), 데이터 파싱부(28)를 포함한다. Referring to FIG. 2, the data extraction module 20 includes a specialized data extraction unit 22, a template classification unit 24, a template data extraction unit 26, and a data parsing unit 28.

전문 데이터 추출부(22)는 자료 이미지화 모듈(10)에서 이미지화된 데이터(이미지 데이터) 중에서 전문 데이터를 추출한다. 전문 데이터는 광학적 문자 판독(OCR, Optical Character Recognition) 방식에 의해 이미지 데이터 중에서 문자(텍스트)에 해당하는 부분을 모두 인식함으로써 생성되는 데이터이다. The professional data extracting unit 22 extracts professional data from the data (image data) imaged by the data imaging module 10. The specialized data is data generated by recognizing all the portions corresponding to characters (text) in the image data by the optical character recognition (OCR) method.

추출된 전문 데이터는 별도의 데이터베이스에 저장될 수 있다. 전문 데이터는 복수의 행과 열로 이루어진 스프레드 시트 형태로 저장될 수 있다. The extracted professional data can be stored in a separate database. The specialized data may be stored in a spreadsheet form comprising a plurality of rows and columns.

템플릿 분류부(24)는 추출된 전문 데이터를 분석하여 해당 자료에 해당하는 템플릿을 분류한다. 템플릿 분류에는 템플릿 등록 모듈(40)에 의해 사전 정의된 템플릿 분류 키워드 테이블(43)이 이용된다. 특정 템플릿에 대해 템플릿 분류 키워드 테이블(43)에 등록된 하나 이상의 템플릿 키워드가 전문 데이터 추출부(22)에 의해 추출된 전문 데이터에 모두 포함되어 있는 경우, 해당 전문 데이터는 특정 템플릿에 해당하는 것으로 분류될 수 있다. The template classifying unit 24 analyzes the extracted professional data and classifies the template corresponding to the corresponding data. The template classification keyword table 43 previously defined by the template registration module 40 is used for the template classification. When at least one template keyword registered in the template classification keyword table 43 for a specific template is included in all of the specialized data extracted by the specialized data extraction unit 22, the corresponding specialized data is classified as corresponding to a specific template .

템플릿 데이터 추출부(26)는 분류된 템플릿에 상응하여 설정된 데이터 추출 정의 테이블(45)에 기초하여 전문 데이터 중에서 파싱 및 정제가 필요한 템플릿 데이터를 추출한다. 추출할 템플릿 데이터는 템플릿 등록 모듈(40)에 의해 정의될 수 있으며, 후술하는 텍스트 처리 방식 혹은 확장 처리 방식으로 처리 가능하게 규정될 수 있다. The template data extracting unit 26 extracts template data that needs parsing and refinement from the specialized data based on the data extraction definition table 45 set corresponding to the classified template. The template data to be extracted can be defined by the template registration module 40 and can be defined to be processable by a text processing method or an extended processing method described later.

데이터 파싱부(28)는 템플릿 데이터 추출부(26)에 의해 추출된 템플릿 데이터를 파싱한 검증용 데이터를 생성한다. 검증용 데이터는 추후 수기 입력 데이터와의 정합성 검증에 이용될 수 있도록 정제될 수 있다. 검증용 데이터는 행과 열을 가지는 스프레드 시트 형태로 테이블화되어 있을 수 있다. The data parsing unit 28 generates verification data in which the template data extracted by the template data extracting unit 26 is parsed. The verification data can be refined so that it can be used for verification of compatibility with handwritten input data. The verification data may be tabulated in the form of a spreadsheet having rows and columns.

정합성 검증 모듈(30)은 수기 입력이 요구되는 금융 자료에 대해 수기 입력된 수기 입력 데이터와 검증용 데이터를 서로 비교하여 데이터 정합성을 검증한다. 데이터 정합성은 자동 추출된 검증용 데이터와 수기 입력 데이터의 데이터 간의 일치성을 의미한다. The consistency verification module 30 verifies the data consistency by comparing the handwritten input data and the verification data to the financial data for which handwriting input is required. The data consistency means the correspondence between the automatically extracted data for verification and the data of handwritten input data.

본 실시예에서 설명하고 있는 정합성 검증 모듈(30)은 하나의 실시예로서, 이외에도 본 발명이 속하는 기술분야에서 통상의 기술자가 데이터베이스 매칭 기법을 통해 수기 입력 데이터와 검증용 데이터를 비교하여 데이터 정합성을 검증할 수도 있을 것이다. The conformance verification module 30 described in the present embodiment is one embodiment. In addition, in a technical field to which the present invention belongs, a common descriptor compares handwritten input data with verification data through a database matching technique, You can also verify.

데이터 정합성 검증 결과는 보고서로 작성되어 제공될 수 있으며, 데이터 정합성 검증 결과에 기초하여 수기 등록의 오류를 방지하고 데이터의 신뢰성을 향상시킬 수 있게 된다. The results of the data consistency verification can be provided as a report, and it is possible to prevent errors in handwriting registration and improve the reliability of the data based on the result of the data consistency verification.

도 3을 참조하면, 본 실시예에 따른 정합성 검증 모듈(30)은 데이터 수신부(32), 정합성 검증부(34), 레포팅부(36)를 포함한다. Referring to FIG. 3, the consistency verification module 30 according to the present embodiment includes a data receiving unit 32, a consistency verification unit 34, and a reporting unit 36.

데이터 수신부(32)는 데이터 추출 모듈(20)에서 생성된 검증용 데이터를 수신한다. 그리고 데이터 수신부(32)는 검증 대상이 되는 수기 입력 데이터를 수신한다. The data receiving unit 32 receives the verification data generated by the data extraction module 20. The data receiving unit 32 receives the handwriting input data to be verified.

정합성 검증부(34)는 검증용 데이터와 수기 입력 데이터를 서로 비교하여 그 일치성을 판정한다. The consistency verification unit 34 compares verification data and handwritten input data with each other and determines the correspondence.

이 경우 수기 입력 데이터에 대해서도 앞서 설명한 것과 같이 자료 이미지화 과정 및 데이터 추출 과정이 수행될 수 있음은 물론이다. 이 경우 템플릿 등록 모듈(40)은 수기 입력 데이터에 대한 템플릿을 별도 등록하여, 수기 입력 데이터에 대해서 검증하기 위한 데이터의 위치를 알려주어 필요로 하는 문자 데이터만이 파싱 및 정제되도록 할 수 있다. 이에 의해 수기 입력 데이터로 행과 열을 가지는 스프레드 시트 형태로 테이블화되어, 검증용 테이블과 행과 열 단위로 비교될 수 있다. In this case, it is a matter of course that the data imaging process and the data extraction process can be performed on the handwritten input data as described above. In this case, the template registration module 40 may separately register a template for the handwritten input data, and notify the position of the data for verifying the handwritten input data so that only necessary character data is parsed and refined. As a result, it can be tabulated in the form of a spreadsheet having rows and columns as handwritten input data, and can be compared with the verification table in rows and columns.

정합성 검증부(34)에 의한 검증 결과는 레포팅부(36)에 의해 사용자가 확인 가능하도록 출력될 수 있다. 레포팅부(36)는 검증 결과를 화면에 출력하거나 프린터를 이용하여 인쇄되도록 할 수 있다. The verification result by the consistency verification unit 34 can be output by the reporting unit 36 so that the user can confirm it. The reporting unit 36 may output the verification result to the screen or may be printed using the printer.

검증 결과를 레포팅할 때 일치하지 않는 데이터와 일치하는 데이터를 구분하여 표시함으로써, 사용자가 일치하지 않는 데이터를 용이하게 확인 가능하게 할 수 있다. 이를 위해 일치하지 않는 데이터에 대해서는 폰트, 색상, 크기 중 하나 이상을 다르게 설정하거나 하이라이트, 밑줄 등을 부가하여 표시할 수 있다. 또는 일치하지 않는 데이터만을 별도로 추출하여 집합적으로 표시할 수도 있을 것이다. When the verification result is reported, data corresponding to inconsistent data is separately displayed, thereby enabling the user to easily confirm unmatched data. To do this, one or more of font, color, and size may be set differently for inconsistent data, or highlighted or underlined. Alternatively, only data that do not match can be extracted and displayed collectively.

검증자는 검증 레포트에서 오류 데이터만을 크로스 체킹하면 충분하며, 이로 인해 단순 검증 작업량이 감소하게 된다. It is sufficient for the verifier to cross-check only the error data in the verification report, which reduces the amount of simple verification work.

템플릿 등록 모듈(40)은 데이터 추출 모듈(20)에서 이미지화된 데이터에서 문자를 추출한 이후 필요한 문자만을 선정하여 검증용 데이터로 변환하는 과정에서, 각 자료가 가지는 템플릿 구성에 따라 필요로 하는 데이터가 달라지는 바 이를 구분하고 정확한 데이터만을 선정하여 변환하는 것이 가능하도록 하기 위해, 각 금융 자료 별 템플릿을 등록한다. The template registration module 40 extracts a character from the imaged data in the data extraction module 20 and then selects only necessary characters and converts it into verification data. In the process of converting the data into verification data, In order to distinguish the bar and select only the correct data, it is necessary to register templates for each financial data.

등록되는 템플릿은 타 템플릿과의 구분을 위한 템플릿 분류 키워드를 가지고 있고, 해당 템플릿에 대해 추출할 데이터가 특정 규칙에 따라 정의될 수 있다. The template to be registered has a template classification keyword for distinguishing it from other templates, and data to be extracted for the template can be defined according to a specific rule.

템플릿 등록 모듈(40)은 다양한 비정형 템플릿도 등록이 가능하며, 템플릿 별 버전 관리도 가능하다. 또한, 템플릿 등록 시 중복 등록을 방지할 수도 있다. The template registration module 40 can register various unstructured templates, and can also manage the version of each template. It is also possible to prevent duplicate registration at the time of template registration.

다양한 형식의 데이터 추출에 대한 세부 정의가 가능하고, 추출대상 항목별 코드 및 문자 치환도 가능하다. 추출대상 항목별 수식 적용을 위한 세부 설정도 가능하고, 기간계 시스템과 데이터 정합성 체크를 위한 항목별 세부 정의도 가능할 수 있다. It is possible to define details of data extraction in various formats, and code and character substitution for each item to be extracted is also possible. Details can be set for applying the formulas for each item to be extracted, and detailed definitions can be made for the data base system and the data consistency check.

도 4를 참조하면, 템플릿 등록 모듈(40)은 템플릿 분류 키워드 수집부(42) 및 데이터 추출 정의부(44)를 포함한다. Referring to FIG. 4, the template registration module 40 includes a template classification keyword collection unit 42 and a data extraction definition unit 44.

템플릿 분류 키워드 수집부(42)는 전문 데이터 추출부(22)에서 추출된 전문 데이터를 분석하여 해당 자료에 상응하는 템플릿을 분류하기 위한 키워드를 수집하여 템플릿 분류 키워드 테이블(43)을 생성한다. The template classification keyword collecting unit 42 analyzes the professional data extracted by the professional data extracting unit 22 and collects keywords for classifying the templates corresponding to the data to generate a template classification keyword table 43. [

도 5에 예시된 것과 같은 수기 운용 지시서(50)에 대해서 타 금융 자료와 구분하기 위한 키워드를 템플릿 분류 키워드로 테이블화하여 데이터베이스에 저장할 수 있다. 예컨대, 도 5의 수기 운용 지시서(50)에서는 "① 주가연계증권/주식워런트증권 평가가격 통보서, ② 삼성증권 주식운용팀, ③ 증권의 명칭, ④ 기초자산가격의, ⑤ 중도매매가격과도" 와 같이 해당 금융 자료에서만 사용하는 단어 혹은 문구를 템플릿 분류 키워드로 규정하여 해당 템플릿에 대한 템플릿 분류 키워드 테이블(43)에 등록시킬 수 있다. For the handwritten operating instructions 50 as illustrated in FIG. 5, keywords for distinguishing from other financial data may be tabulated with template classification keywords and stored in a database. For example, in the handwritten operation instruction sheet 50 shown in FIG. 5, "① stock price-linked securities / stock warrants evaluation price notification, ② Samsung securities stock management team, ③ name of securities, ④ base asset price, A word or phrase used only in the corresponding financial data may be defined as a template classification keyword and registered in the template classification keyword table 43 for the corresponding template.

데이터 추출 정의부(44)는 분류된 템플릿에서 데이터를 추출할 기준 키워드와 n번째 데이터 추출 범위(n은 1 이상의 자연수) 등을 설정하여 데이터 추출 정의 규칙으로 테이블화하여 데이터베이스에 저장할 수 있다. 예컨대, 도 6에 예시된 것과 같은 수기 운용 지시서(50)에 대해서 '예탁딜코드'를 데이터 추출을 위한 기준 키워드로 설정하고, 첫번째 추출 데이터 범위는 기준 키워드인 '에탁딜코드'에서 아래로 1라인 스킵한 후 마지막 라인까지의 문자 데이터이고, 두번째 추출 데이터 범위는 기준 키워드인 '예탁딜코드'에서 아래로 1라인 스킵하고 우측으로 3단어만큼 이동한 후 마지막 라인까지의 문자 데이터이다. The data extraction defining unit 44 may set a reference keyword for extracting data from the classified template and an n-th data extracting range (n is a natural number equal to or greater than 1), and may store the data in a database in the form of data extraction definition rules. For example, the 'deposit deal code' for the handwritten operating instructions 50 illustrated in FIG. 6 is set as a reference keyword for data extraction, and the first extracted data range is set to 1 Character data up to the last line after the line skipping, and the second extracted data range is character data from the reference keyword 'deposit deal code' to the last line after skipping one line downward and shifting by three words to the right.

도 8은 데이터 추출 모듈에서 템플릿 데이터를 추출하는 과정을 나타낸 순서도이고, 도 9는 데이터 추출 방식의 종류를 나타낸 테이블이며, 도 10은 텍스트 처리 방식의 정의 테이블이고, 도 11은 텍스트 처리 방식의 예시도이며, 도 12는 확장 처리 방식의 정의 테이블이고, 도 13은 확장 처리 방식의 예시도이다. FIG. 8 is a flowchart showing a process of extracting template data from the data extraction module, FIG. 9 is a table showing the types of data extraction methods, FIG. 10 is a definition table of a text processing method, Fig. 12 is a definition table of the extension processing method, and Fig. 13 is an example of the extension processing method.

템플릿 분류부(24)에 의한 템플릿 분류가 완료된 이후, 템플릿 데이터 추출부(26)는 전문 데이터에서 템플릿 데이터를 추출한다. 이를 위해 다음과 같은 과정을 수행한다. After the template classification by the template classifying unit 24 is completed, the template data extracting unit 26 extracts template data from the specialized data. To do this, follow the procedure below.

우선 템플릿 등록 모듈(40)에 의해 등록된 템플릿에서 정의된 데이터 추출 정의 테이블(45)에 기초하여 기준 키워드가 전문 데이터 중 어디에 있는지를 확인한다(단계 S105). First, based on the data extraction definition table 45 defined in the template registered by the template registration module 40, it is checked whether the reference keyword is in the specialized data (step S105).

기준 키워드를 찾으면, 기준 키워드의 위치를 검색 포지션의 기본 위치로 설정한다(단계 S110). If the reference keyword is found, the position of the reference keyword is set as the basic position of the search position (step S110).

데이터 추출 정의 테이블(45)에 정의된 첫번째 추출 데이터 범위에 기초하여 검색 포지션(기본 위치)에서 n1 라인만큼 스킵한다(단계 S115). 스킵은 아래 방향으로 n1 라인 수만큼 이동하는 것을 의미한다. n1은 0 이상의 정수이다. (Basic position) is skipped by n1 lines based on the first extracted data range defined in the data extraction definition table 45 (step S115). Skip means moving in the downward direction by n1 lines. n1 is an integer of 0 or more.

만약 필요하다면 n2 단어로 포지션 이동할 수 있다(단계 S120). 포지션 이동은 우측 방향으로 n2 단어 수만큼 이동하는 것을 의미한다. n2는 0 이상의 정수이다.If necessary, the position can be moved to n2 words (step S120). Position movement means moving by n2 words in the right direction. n2 is an integer of 0 or more.

라인 스킵 및 단어 이동이 완료되면, 이동된 포지션에 상응하는 단어를 추출한다(단계 S125). 추출된 단어는 해당 템플릿에 대한 템플릿 데이터 중 첫번째 추출 데이터가 된다. When the line skip and the word movement are completed, a word corresponding to the moved position is extracted (step S125). The extracted word is the first extracted data among the template data for the corresponding template.

데이터 추출 정의 테이블(45)에 두번째 이상의 추출 데이터 범위가 정의되어 있다면, 단계 S115 내지 S125 과정을 반복 수행한다. If a second or more extracted data range is defined in the data extraction definition table 45, steps S115 to S125 are repeated.

추출된 단어(들)은 템플릿 데이터로서, 미리 정해진 파싱/정제 규칙에 따라 파싱 및 정제되어 검증용 데이터로서 기능하게 된다. The extracted word (s) are parsed and refined as template data according to predetermined parsing / refining rules to function as verification data.

본 실시예에서 단계 S115 및 S120에서 라인 스킵 및 포지션 이동과 같은 데이터 추출 방식은 도 9에 도시된 것과 같은 데이터 처리 방식 중 텍스트 처리 방식에 해당한다. In this embodiment, the data extraction method such as line skip and position shift in steps S115 and S120 corresponds to the text processing method among the data processing methods shown in Fig.

텍스트 처리 방식은 기준 키워드를 기준으로 라인 스킵 및 단어(컬럼) 이동을 통해 포지션을 이동하여 데이터를 추출하는 방식으로, T(s_kw^e_kw, skip_kw1^skip_kw2^…, pick_kw1^pick_kw2^…), L(n), W(n), C(n) 또는 N(n) 또는 N(n).F(n), X(n), V(n) 또는 V(999), {S(n)}과 같은 표시 형식을 따를 수 있다. T (s_kw ^ e_kw, skip_kw1 ^ skip_kw2 ^ ..., pick_kw1 ^ pick_kw2 ^ ...), L (s_kw1 ^ pick_kw2 ^ ...) is a method of extracting data by moving a position through line skip and word (n), W (n), C (n) or N (n) or F (n) And the like.

도 10을 참조하면, 텍스트 처리 방식에 대한 각 표기 형식에 대한 설명 및 표기 예시가 테이블로 정리되어 있다. Referring to FIG. 10, explanations and notation examples of each notation format for the text processing method are listed in a table.

T()는 키워드 기준 라인과 단어 위치만큼 이동하여 데이터를 추출하는 방식(텍스트 처리 방식)을 나타내며, 시작 위치 키워드값, 종료 위치 키워드값, 스킵 키워드, 추출기준 키워드 등이 설정된다. 그리고 L(n)은 라인 단위 스킵을 나타내고, W(n)은 워드 단위 이동을 나타내며, C(n), N(n), N(n).F(n)은 추출 데이터 속성정보를 나타내고, X(n)은 단건 또는 복수 건 추출 여부를 나타내며, V(n)은 유효한 데이터 추출 처리 여부 및 추출 처리 건수를 나타내고, S(n)은 동일한 파일 내에 동일한 수식어 1개 이상일 경우 처리 옵션을 나타낸다. T () denotes a method (text processing method) for extracting data by moving by a keyword reference line and a word position, and a start position keyword value, an end position keyword value, a skip keyword, an extraction reference keyword and the like are set. C (n), N (n), N (n) .F (n) represent extracted data attribute information, and L (n) X (n) represents whether or not to extract single or multiple keys, V (n) represents a valid data extraction process and the number of extraction processes, and S (n) represents a processing option when there are one or more modifiers in the same file.

예컨대, 도 11을 참조하면, 수기 운용 지시서에 대해 데이터 추출 정의가 ① T("예탁딜코드",""), ② L(1), ③ W(3)과 같이 정의된 경우에 데이터 추출 처리 과정은 다음과 같다. For example, referring to FIG. 11, when the data extraction definition for the handwritten operation instruction is defined as T ("deposit deal code"), (2) L (1) The process is as follows.

전문 데이터에서 ① "예탁딜코드' 단어가 있을 경우 검색 포지션 위치를 설정한다. ② 그리고 검색 포지션에서 1라인 스킵한다. ③ 검색 포지션에서 3번째 단어로 포지션 이동하여 단어를 추출한다. 이를 통해 "Price" 항목의 데이터를 추출해 낼 수 있게 된다. ① In case of "Deposit Deal Code" word, set the search position ② Skip the search line by one line ③ Move the position from the search position to the third word and extract the word, "Data of the item can be extracted.

텍스트 처리 방식 이외에도 확장(Extend) 처리 방식이 있을 수 있으며, 확장 처리 방식은 첫번째 키워드 만족 이후 두번째 또는 두번째 이후 n번째 까지 키워드 값이 만족할 때 기준 라인 스킵과 단어 이동을 통해 위치를 이동하여 데이터를 추출하는 방식이다. E(s_kw^e_kw, kw1^kw2…., skip_kw1^skip_kw2^...), L(n), W(n), C(n) 또는 N(n) 또는 N(n).F(n)와 같은 표시 형식을 따를 수 있다. In addition to the text processing method, there may be an extension processing method. In the extended processing method, when the keyword value is satisfied from the second or the second to the n-th after the first keyword is satisfied, the position is moved through the reference line skip and word movement, . L (n), W (n), C (n) or N (n) or N (n) .F (n) And the like.

도 12를 참조하면, 확장 처리 방식 중에서 시장지 파싱 처리용의 경우가 예시되어 있다. Referring to FIG. 12, the case of the marketplace parsing process is exemplified among the extended processing methods.

E()는 첫번째 키워드 만족 이후 두번째 키워드가 만족할 경우 이후 설정위치로 이동하여 데이터를 추출하는 방식을 나타내며, 시작 위치 키워드값, 종료 위치 키워드값, 스킵 키워드 등이 설정된다. 그리고 L(n)은 라인 단위 스킵을 나타내고, W(n)은 워드 단위 이동을 나타내며, C(n), N(n), N(n).F(n)은 추출 데이터 속성정보를 나타낸다. If the second keyword is satisfied after the first keyword is satisfied, E () indicates a method of moving to the next set position and extracting data. A start position keyword value, an end position keyword value, a skip keyword, and the like are set. C (n), N (n), and N (n) .F (n) represent extracted data attribute information, and L (n) represents a line-by-line skip.

도 13을 참조하면, 확장 처리 방식의 예시가 도시되어 있다. Referring to FIG. 13, an example of an extended processing method is shown.

데이터 추출 정의가 ① E("A011930","공급계약체결","계약기간","시작일"), ② L(0), ③ W(1)과 같이 정의된 경우에 데이터 추출 처리 과정은 다음과 같다. If the data extraction definition is defined as ① E ("A011930", "supply contract", "contract period", "start date"), ② L (0), ③ W (1) Respectively.

전문 데이터에서 ① "A011930", "공급계약체결", "계약기간", "시작일" 등의 조건에 만족하는 연속적인 단어가 있을 경우 검색 포지션 위치를 설정한다. ② 검색 포지션에서 0라인 스킵한다. ③ 검색 포지션에서 1번째 단어로 포지션 이동하여 단어를 추출한다. 본 예시에서는 "2013-09-23" 이 추출될 것이다. ① In case of continuous words satisfying the conditions such as "A011930", "Supply contract", "Contract period", "Start date", etc., set the search position. ② Skips 0 line at search position. ③ Move the position from the search position to the first word to extract the word. In this example, "2013-09-23" will be extracted.

도 14 내지 도 16은 템플릿 데이터 추출의 적용 예시도이다. 빨간색 네모가 데이터 추출 항목을 나타내고 있다. Figs. 14 to 16 are application examples of template data extraction. The red square represents the data extraction item.

도 14의 (a)는 파일형식 특정 항목 전체 데이터 추출과 첫번째 항목 데이터만 추출할 경우의 예시이고, (b)는 파일형식 최초 항목 추출 이후 라인 스킵처리를 통한 데이터 추출과 맨 마지막 데이터만 추출할 경우의 예시이며, (c)는 파일형식 항목별 데이터 추출 및 소계 부분 스킵 처리를 통한 데이터 추출의 다양한 예시이고, (d)는 파일형식이 다양한 위치의 항목별 데이터 추출 예시이다. FIG. 14A is an example of extracting the entire data of the file format specific item and extracting only the first item data, FIG. 14B is an example of extracting data by line skipping processing after extracting the file format first item and extracting only the last data (C) shows various examples of data extraction by data extraction by a file format item and data extraction by a subtotal partial skip process, and (d) is an example of data extraction by item at various positions of a file format.

도 15의 (a)는 블룸버그 주식수익증권 화면 항목별 데이터 추출범위이고, (b)는 블룸버그 채권 화면 항목별 데이터 추출범위이며, (c)는 블룸버그 선물 화면 항목별 데이터 추출범위이고, (d)는 블룸버그 종가 화면 항목별 데이터 추출범위를 나타내고 있다. FIG. 15 (a) is a data extraction range for each Bloomberg stock-beneficiary screen item, (b) is a data extraction range for each Bloomberg bond screen item, (c) is a data extraction range for each Bloomberg gift screen item, Shows the data extraction range for each Bloomberg end screen item.

도 16의 (a)는 시장지시 내에서 표 형식 공시사항 추출범위이고, (b)는 시장지시 내에서 번호 형식 공시사항 추출범위이며, (c)는 이메일 본문에서 표 형식 데이터 추출범위이고, (d)는 이메일 본문에서 반복되는 표 형식 데이터 추출범위를 나타내고 있다. FIG. 16A is a table format disclosure range extraction within a market instruction, FIG. 16B is a range extraction format of a number format disclosure within a market instruction, FIG. 16C is a table format data extraction range in an email body, d) represents the tabular data extraction range that is repeated in the body of the e-mail.

상술한 본 발명에 따른 수기 입력 데이터 정합성 검증 방법 및/또는 데이터 추출 방법은 컴퓨터로 읽을 수 있는 기록 매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래쉬 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다. The handwriting input data consistency verification method and / or data extraction method according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording media storing data that can be decoded by a computer system. For example, it may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device, or the like. In addition, the computer-readable recording medium may be distributed and executed in a computer system connected to a computer network, and may be stored and executed as a code readable in a distributed manner.

본 실시예에 따른 수기 입력 데이터 정합성 검증 방법 및 시스템에 의하면, 기준가 오류가 축소되고 전문인 배상 책임 보험료가 절감되며, 단순 검증 작업량을 감소시켜 업무 생산성을 향상시킬 수 있을 것이다. According to the method and system for verifying the handwritten input data consistency according to the present embodiment, it is possible to reduce the error of the reference value, reduce the professional liability insurance premium, and reduce the amount of simple verification work, thereby improving work productivity.

또한, EXCEL, PDF, E-MAIL, 블룸버그 화면, 텍스트 문서, 레포팅툴 인쇄물 등 다양한 형식의 정형/비정형 서식 등록기능을 제공하고, 서식등록은 템플릿 별 1회 등록으로 분류처리가 가능하며, 버전관리 기능을 제공한다. In addition, it provides formal / irregular format registration functions in various formats such as EXCEL, PDF, E-MAIL, Bloomberg screen, text document, and reporting tool printout. Form registration can be classified into one registration per template, Function.

사용자가 원하는 데이터를 추출하기 위한 세부적인 기능 정의를 제공하며, 다수의 페이지로 구성된 파일에 대해서도 별도 처리 없이 연속적인 데이터 추출이 가능하다. 연속적인 데이터 추출 시 불필요한 데이터 또는 라인에 대해서는 스킵도 가능하다. Detailed function definition for extracting data desired by the user is provided, and continuous data extraction is possible for a file composed of a plurality of pages without any additional processing. It is also possible to skip unnecessary data or lines when extracting consecutive data.

추출된 데이터에 대해 다양한 형식으로 변환 또는 맵핑 처리할 수 있으며, 데이터 형식변환에 대한 세부적인 기능 정의를 제공한다. 데이터가 숫자형 일 경우 별도 수식 적용이 가능하도록 하는 세부 설정 기능도 제공한다. The extracted data can be transformed or mapped to various formats, and detailed function definitions for data type conversion are provided. If the data is numeric, it also provides a detailed setting function that allows the application of a separate formula.

증권거래소에서 게시하는 시장지시 형식(대용량 PDF 파일)에 대해서도 데이터 추출이 가능하고, 동일 파일에서 다수의 템플릿에 의한 동시 추출기능도 제공할 수 있다. It is also possible to extract data from the market instruction format (large-capacity PDF file) posted on the stock exchange and to provide simultaneous extraction of multiple templates from the same file.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the appended claims. It will be understood that the invention may be varied and varied without departing from the scope of the invention.

1: 수기 입력 데이터 정합성 검증 시스템
10: 자료 이미지화 모듈 20: 데이터 추출 모듈
22: 전문 데이터 추출부 24: 템플릿 분류부
26: 템플릿 데이터 추출부 28: 데이터 파싱부
30: 정합성 검증 모듈 32: 데이터 수신부
34: 정합성 검증부 36: 레포팅부
40: 템플릿 등록 모듈 42: 템플릿 분류 키워드 수집부
44: 데이터 추출 정의부1: Handwritten input data consistency verification system
10: Data imaging module 20: Data extraction module
22: professional data extracting unit 24: template classification unit
26: Template data extracting unit 28: Data parsing unit
30: consistency verification module 32: data reception unit
34: consistency verification unit 36:
40: template registration module 42: template classification keyword collection unit
44: Data extraction definition section

Claims

A data imaging module for generating image data in which financial data requiring manual input is imaged; And
And a data extraction module for extracting specialized data from the image data, classifying the template, and extracting template data based on the classified template and converting the template data into verification data.

The method according to claim 1,
Wherein the data extraction module comprises:
A specialized data extracting unit for extracting specialized data corresponding to a text by using an optical character reading system for the image data;
A template classifying unit classifying the expert data as corresponding to a template corresponding to the template classification keyword table when at least one template keyword registered in the predefined template classification keyword table is included in the expert data;
A template data extracting unit for extracting template data from the specialized data based on a data extraction definition table set for the classified template;
And a data parsing unit parsing and refining the template data to generate the verification data.

3. The method of claim 2,
And a template registration module for registering the template for each financial data,
The template registration module may include a template classification keyword collection unit for collecting and classifying keywords for classifying a template to generate the template classification keyword table; And a data extraction definition unit for generating a data extraction definition table for setting a reference keyword and an nth data extraction range (where n is a natural number equal to or greater than 1) for extracting the template data from the classified table. And unstructured data extraction systems.

The method of claim 3,
The nth data extraction range skips the n1 line by using the reference keyword as a search position and extracts a word corresponding to a position shifted by n2 words with the template data for the specialized data classified by row and column units Wherein the data extracting rule is a data extracting rule for extracting the data from the extracted data.

3. The method of claim 2,
Wherein the template data extracting unit extracts the template data by a text processing method for extracting data by moving a position through line skipping and word movement based on a reference keyword,
The start position keyword value, the end position keyword value, the skip keyword, and the extraction criterion keyword are set, and L (n (n)) is set according to the text processing method. (N) represents extracted data attribute information, X (n) represents a line-by-line skip, W (n) (N) represents a processing option when there are one or more identical modifiers in the same file, and V (n) represents a valid data extraction process and the number of extraction processes, And unstructured data extraction systems.

3. The method of claim 2,
Wherein the template data extracting unit extracts the template data by an extension processing method for extracting data by moving a position through a reference line skip and a word movement when a keyword value is satisfied for a second,
According to the extension processing method, E () indicates a method of moving to the next set position and extracting data if the second keyword is satisfied after the first keyword is satisfied, and a start position keyword value, an end position keyword value and a skip keyword are set and L (n) represents a line-by-line skip, W (n) represents word-by-word movement, and C (n), N (n), and N (n) A systematic and unstructured data extraction system.

The method according to claim 1,
The financial data may include one or more of handwriting instruction data received by mail or messenger, Bloomberg terminal screen, market indication data of a large capacity PDF file type,
Wherein the data imaging module implements the handwriting instruction data and the market instruction data by using a P2I engine of a printing and scanning type and the Bloomberg terminal screen is imaged by using a S2I engine of a screen capture method. And unstructured data extraction systems.

Generating image data in which financial data requiring handwriting input is imaged;
Extracting specialized data from the image data;
Classifying a template corresponding to the specialized data;
Extracting template data based on the classified template;
Converting the template data into verification data
And extracting the input data.

9. The method of claim 8,
The step of classifying the template may include the steps of: when the at least one template keyword registered in the predefined template classification keyword table is included in the specialized data, the specialized data corresponds to the template corresponding to the template classification keyword table And extracting the extracted data.

10. The method of claim 9,
Wherein the extracting of the template data extracts template data from the specialized data based on a data extraction definition table set for the classified template.

11. The method of claim 10,
Collecting a keyword for classifying the template and creating a table to generate the template classification keyword table;
And generating a data extraction definition table for setting a reference keyword and an n-th data extraction range (where n is a natural number equal to or greater than 1) for extracting the template data from the classified table. Data extraction method.

11. The method of claim 10,
The step of extracting the template data may include a text processing method for extracting data by moving a position through a line skip and a word movement based on a reference keyword or a text processing method for extracting data from a reference line Wherein the template data is extracted by an extension processing method for extracting data by moving a position through a skip and a word movement.

9. The method of claim 8,
The financial data may include one or more of handwriting instruction data received by mail or messenger, Bloomberg terminal screen, market indication data of a large capacity PDF file type,
Wherein the image data generation step images the handwriting instruction data and the market instruction data using a print and scan type P2I engine and the Bloomberg terminal screen is imaged using a screen capture type S2I engine Method for extracting stereotyped and unstructured data.