KR101985612B1

KR101985612B1 - Method for manufacturing digital articles of paper-articles

Info

Publication number: KR101985612B1
Application number: KR1020180005598A
Authority: KR
Inventors: 김학선
Original assignee: 김학선
Priority date: 2018-01-16
Filing date: 2018-01-16
Publication date: 2019-06-03

Abstract

The present invention relates to a method for digitizing paper documents, comprising the steps of: obtaining a paper document to be digitized; scanning the obtained paper document to obtain a scanned image of the paper document; dividing the scanned image into a text area and an image area, forming, by using pixels, the text area into unit segments, which are the minimum unit segments connected to each other, character segments formed by the combination of one or more of the unit segments, character row segments formed by the combination of one or more of the character segments, and block segments formed by the combination of one or more of the character row segments, and sequentially forming group segments formed by the combination of one or more of the block segments; converting the contents of the paper document into a text file by extracting characters from the segmented scanned image, and converting the image area into an image file; and generating an extensible markup language (XML) by using the text file. Thus, it is easy to manage and use paper newspapers.

Description

[0001] METHOD FOR MANUFACTURING DIGITAL ARTICLES OF PAPER-ARTICLES [0002]

본 발명은 종이로 발행된 종이문서를 디지털화하는 방법에 관한 것으로, 특히 종이신문을 디지털화하는 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of digitizing paper documents issued on paper, and more particularly to a method of digitizing paper newspapers.

최근, 인터넷의 발달로 종이신문을 전자신문화하여 발행하고 있어 종래의 전통적인 종이신문이 사문화되고 있다.In recent years, as the Internet has developed, paper newspapers have been published as electronic newspapers, and conventional traditional paper newspapers are being sourced.

또한, 전자신문화되기 전에는 종이신문에 의존하였었는데, 과거의 방대한 정보를 디지털화함으로써 통계 및 분석에 활용하고자 하는 필요성이 대두되고 있다.In addition, before it became electronic newspapers, it relied on paper newspapers, and it is becoming necessary to utilize the vast amount of past information in statistics and analysis.

또한, 한자가 많은 고문서의 경우에도 이를 제대로 디지털화할 수 있는 소프트웨어 또는 툴(tool)이 개발되어 있지 않아, 고문서의 경우에 디지털화에 어려움이 있다. In addition, even in the case of an ancient document with many Chinese characters, there is no software or tool capable of properly digitizing it, which makes it difficult to digitize an old document.

뿐만 아니라, 종래에는 한글이라 하더라도 가로쓰기 뿐만 아니라 세로쓰기도 사용되었었고, 가로쓰기도 좌에서 우로 쓰거나, 우에서 좌로 쓰여지는 문서도 있었기 때문에 이러한 문서들을 디지털화하기 위해서는 정밀한 검사 및 분석이 필요하다.In addition, in the past, even in Korean, horizontal writing was used as well as vertical writing. Also, since there was a document written from left to right or from right to left, detailed examination and analysis are necessary to digitize such documents.

그러나, 이에 대한 연구개발이 전혀 이루어지고 있지 않은 실정이다.However, there is no research and development on this.

본 발명은 전술한 문제 및 다른 문제를 해결하는 것을 목적으로 한다. 또 다른 목적은 종이문서를 디지털화하는 방법을 제공하는 것이다. The present invention is directed to solving the above-mentioned problems and other problems. Another object is to provide a method of digitizing paper documents.

상기 또는 다른 목적을 달성하기 위해 본 발명의 일 측면에 따르면, 디지털화하고자 하는 종이문서를 획득하는 단계; 상기 획득한 종이문서를 스캔(scan)하여 상기 종이문서의 스캔 이미지를 획득하는 단계; 상기 스캔 이미지를 텍스트 영역 및 이미지 영역으로 구분하고, 상기 텍스트 영역을 화소(pixel)를 이용하여 서로 연결된 최소 단위의 세그먼트인 단위 세그먼트(segment), 하나 이상의 상기 단위 세그먼트의 조합에 의해 형성되는 문자 세그먼트, 하나 이상의 상기 문자 세그먼트의 조합에 의해 형성되는 문자행 세그먼트, 하나 이상의 상기 문자행 세그먼트의 조합에 의해 형성되는 블록 세그먼트를 형성하고, 하나 이상의 상기 블록 세그먼트의 조합에 의해 형성되는 그룹 세그먼트를 순차적으로 형성하는 세그먼트화(segmentation)하는 단계; 상기 세그먼트화된 상기 스캔 이미지에서 문자를 추출함으로써 상기 종이문서의 내용을 텍스트 파일로 변환하고, 상기 이미지 영역을 이미지 파일로 변환하는 단계; 및 상기 텍스트 파일을 이용하여 확장형 마크업 언어(Extensible Markup Language, XML)를 생성하는 단계를 포함하는 것을 특징으로 하는 종이문서의 디지털화 방법이 제공될 수 있다.According to an aspect of the present invention, there is provided a method for digitizing a paper document, the method comprising: obtaining a paper document to be digitized; Scanning the acquired paper document to obtain a scanned image of the paper document; The scan image is divided into a text area and an image area, and the text area is divided into a unit segment, which is a minimum unit segment connected to each other by using pixels, a character segment formed by a combination of at least one unit segment, A character row segment formed by a combination of one or more of the character segments, a block segment formed by a combination of one or more character row segments, and group segments formed by combinations of one or more of the block segments are sequentially A segmentation step of forming a segment; Converting the contents of the paper document into a text file by extracting characters from the segmented scanned image, and converting the image area into an image file; And generating an extensible markup language (XML) using the text file. The method for digitizing a paper document can be provided.

본 발명의 일 측면에 따르면, 상기 단위 세그먼트로 세그먼트화하는 단계는, 소정의 사각형 영역 내에서 임의의 화소를 기준으로 상, 하, 좌, 우의 4근방에 위치한 인접 화소를 포함하는 화소 덩어리를 얻는 단계; 상기 화소 덩어리의 화소수가 기설정된 개수 이상의 화소를 갖는 경우 유효한 화소 덩어리로 판단하고, 다수의 상기 유효 화소 덩어리를 구하여 화소 덩어리 집합을 얻는 단계; 및 상기 화소 덩어리 집합의 길이가 기설정된 길이 이상인 경우 상기 화소 덩어리 집합을 단위 세그먼트로 판단하는 단계를 포함할 수 있다.According to an aspect of the present invention, the step of segmenting into the unit segments includes obtaining a pixel block including neighboring pixels located in four neighborhoods of upper, lower, left, and right with reference to a certain pixel within a predetermined rectangular area step; Determining a valid pixel mass when the pixel mass of the pixel mass has a predetermined number or more of pixels and obtaining a plurality of the effective pixel masses to obtain a set of pixel mass; And determining the set of pixel blocks as a unit segment if the length of the set of pixel blocks is equal to or greater than a predetermined length.

본 발명의 일 측면에 따르면, 상기 문자 세그먼트로 세그먼트화하는 단계는, 상기 단위 세그먼트들로 이루어지는 단위 세그먼트들의 집합 내에서 제1 단위 세그먼트 및 상기 제1 단위 세그먼트와 인접한 제2 단위 세그먼트를 선택하는 단계; 상기 제1 단위 세그먼트의 면적을 S₁이라 하고, 상기 제2 단위 세그먼트의 면적을 S₂라 하고, 상기 제1 단위 세그먼트 및 제2 단위 세그먼트의 겹침 면적을 S_I라 하고, 상기 제1 및 제2 단위 세그먼트가 합쳐진 상태의 면적을 S_u라 할 때, 하기 조건식 (1) 내지 (4) 중 하나 이상을 만족하면 동일한 문자 세그먼트로 판단할 수 있다.According to an aspect of the present invention, the step of segmenting into the character segment includes a step of selecting a first unit segment and a second unit segment adjacent to the first unit segment in the set of unit segments including the unit segments ; The area of the first unit segment is S ₁ , the area of the second unit segment is S ₂ , the overlapping area of the first unit segment and the second unit segment is S _I , When the area of the state in which the two unit segments are combined is S _u , if one or more of the following conditional expressions (1) to (4) are satisfied, the same character segment can be determined.

S_I≥k₁S₁ -------------------(1)S _I ≥k ₁ S ₁ ------------------- (1)

S_I≥k₁S₂ -------------------(2) _{_{_{S I ≥k 1 S 2 ------------------- (}}} 2)

S₁≥k₂S_u -------------------(3)S _1? K ₂ S _u ------------------- (3)

S₂≥k₂S_u -------------------(4)S _2? K ₂ S _u ------------------- (4)

상기 식에서 k₁ 및 k₂는 글자체에 따라 정해지는 기설정된 상수이다.In the above equation, k ₁ and k ₂ are preset constants determined according to the typeface.

본 발명의 일 측면에 따르면, 상기 문자행 세그먼트로 세그먼트화하는 단계는, 다수의 상기 문자 세그먼트들의 크기에 따라 문자 세그먼트의 기준 폭(w₀) 및 기준 높이(h₀)를 구하는 단계; 다수의 상기 문자 세그먼트들을 제1 방향으로 정렬하는 단계; 상기 문자 세그먼트들 중 폭(width) 및 높이가 상기 기준 폭(w₀) 및 기준 높이(h₀)를 고려하여 기설정된 크기 범위 내의 문자 세그먼트들을 선별하여 문자행 세그먼트를 구성하는 단계; 및 순서상으로 서로 인접한 문자 세그먼트들 사이의 간격이 상기 제1 방향을 따라 기설정된 간격 이상이거나, 문자 세그먼트가 상기 문자행 세그먼트와 상기 제1 방향과 수직인 제2 방향을 따라 폭 또는 높이가 겹치지 않을 경우에는 새로운 문자행 세그먼트로 구성하는 단계를 포함할 수 있다.According to an aspect of the present invention, segmenting into the character row segment includes: obtaining a reference width (w ₀ ) and a reference height (h ₀ ) of the character segment according to a size of the plurality of character segments; Aligning the plurality of character segments in a first direction; Constructing a character row segment by selecting character segments within a predetermined size range in consideration of the width and height of the character segments considering the reference width w ₀ and the reference height h ₀ ; And the widths or heights of the character segments are overlapped along a second direction in which the intervals between the adjacent character segments are not less than a predetermined interval along the first direction or the character segments are perpendicular to the character line segment and the first direction And if so, a new character line segment.

본 발명의 일 측면에 따르면, 상기 블록 세그먼트로 세그먼트화하는 단계는, 상기 문자 세그먼트의 폭 및 높이에 따라 문자 세그먼트를 서로 다른 서식으로 저장하는 단계; 상기 문자 세그먼트의 폭 및 높이가 기설정된 범위 내이면서 기설정된 범위 내에서 동일한 길이를 갖는 문자행 세그먼트를 합치는 것을 반복하는 단계; 상기 문자행 세그먼트를 합치는 단계에서 이웃하는 문자행 세그먼트와 상기 제2 방향을 따라 기설정된 간격 이상으로 이격된 문자행 세그먼트는 다른 블록 세그먼트를 형성하도록 하는 단계; 및 상기 문자 세그먼트의 폭 및 높이가 다른 문자행 세그먼트는 다른 블록으로 처리하는 단계를 포함할 수 있다.According to an aspect of the present invention, the step of segmenting into the block segments comprises: storing character segments in different formats according to the width and height of the character segments; Repeating combining character string segments having the same length within a predetermined range while the width and height of the character segments are within a predetermined range; The step of combining the character line segments and the character line segments spaced apart from the neighboring character line segments by more than a predetermined interval along the second direction form another block segment; And processing the character line segments having different widths and heights of the character segments into different blocks.

본 발명의 일 측면에 따르면, 상기 그룹 세그먼트로 세그먼트화하는 단계는, 서로 인접한 블록 세그먼트들 중 문자 세그먼트의 폭 및 높이가 가장 큰 제1 문자 세그먼트를 포함하는 제1 블록 세그먼트의 하부에 위치하고, 상기 제1 블록 세그먼트와 제2 방향을 따라 적어도 일부가 겹쳐지는 블록 세그먼트들을 동일한 그룹으로 형성하는 단계; 및 상기 동일 그룹으로 형성된 블록 세그먼트와 상기 제1 방향 또는 상기 제2 방향을 따라 기설정된 간격 이상으로 이격되거나, 상기 제1 문자 세그먼트를 포함하는 제2 블록 세그먼트 또는 구분선과 교차되는 블록 세그먼트는 다른 그룹으로 그룹화하는 단계를 포함할 수 있다.According to an aspect of the present invention, the step of segmenting into the group segments is performed on a lower portion of a first block segment including a first character segment having a largest width and height of character segments among adjacent block segments, Forming block segments that are at least partially overlapping the first block segment along a second direction into the same group; And a block segment which is spaced apart from the block segment formed in the same group by a predetermined interval or spaced along the first direction or the second direction, or a block segment intersecting with a second block segment or a dividing line including the first character segment, . &Lt; / RTI >

본 발명의 일 측면에 따르면, 상기 그룹화된 블록 세그먼트들 사이에 위치한 이미지 세그먼트를 동일한 그룹으로 그룹화하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, the method may further include grouping the image segments located between the grouped block segments into the same group.

본 발명의 일 측면에 따르면, 상기 세그먼트화하는 단계 이후에, 상기 스캔 이미지가 기울어지게 스캔된 경우, 상기 스캔 이미지를 정위치로 회전시켜 스캔 이미지를 수정하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, after the segmenting step, if the scan image is scanned at an angle, the step of correcting the scan image by rotating the scan image to a predetermined position may be further included.

본 발명의 일 측면에 따르면, 상기 스캔 이미지를 수정하는 단계는, 상기 단위 세그먼트 중 폭 또는 높이가 일정하며, 폭 및 높이 중 어느 하나가 다른 하나의 길이보다 기설정된 크기보다 긴 경우에는 이를 선분으로 판단하는 단계; 상기 선분이 상기 스캔 이미지의 제1축 및 제1축과 수직인 제2축 중 어느 축에 인접하여 형성되는지 판단하는 단계; 상기 선분과 상기 제1축 및 제2축 중 상기 스캔 이미지상에서 상기 선분과 인접한 축 사이의 각을 산출하는 단계; 및 상기 스캔 이미지를 상기 산출된 각도만큼 회전시켜 상기 선분과 상기 인접한 축과 일치시키는 단계를 포함할 수 있다.According to an aspect of the present invention, the step of modifying the scanned image may include the steps of: if the width or the height of the unit segment is constant and if any one of the width and height is longer than a predetermined length, ; Determining whether the line segment is formed adjacent to a first axis of the scan image and a second axis perpendicular to the first axis; Calculating an angle between the line segment and an axis adjacent to the line segment on the scan image of the first axis and the second axis; And rotating the scan image by the calculated angle to match the line segment and the adjacent axis.

본 발명의 일 측면에 따르면, 상기 단위 세그먼트 중 폭 및 높이가 기설정된 크기보다 큰 경우에는 이미지 세그먼트에 포함하고 문자행 세그먼트에서 제외하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, when the width and the height of the unit segment are larger than a predetermined size, the unit segment may be included in the image segment and excluded from the segment.

본 발명의 일 측면에 따르면, 상기 종이문서는 종이신문일 수 있다.According to an aspect of the present invention, the paper document may be a paper newspaper.

본 발명의 일 측면에 따르면, 상기 단위 세그먼트, 문자 세그먼트, 문자행 세그먼트, 블록 세그먼트 및 그룹 세그먼트는 좌표 정보를 가지며, 상기 그룹 세그먼트는 상기 좌표 정보에 의해 다각형(polygon)을 형성할 수 있다.According to an aspect of the present invention, the unit segment, the character segment, the character row segment, the block segment, and the group segment have coordinate information, and the group segment may form a polygon according to the coordinate information.

본 발명의 일 측면에 따르면, 상기 그룹 세그먼트는 상기 인접한 블록 세그먼트의 최외곽을 따라 이동하면서 중복되지 않는 경로에 의해 형성되는 폐쇄 영역으로 형성될 수 있다.According to an aspect of the present invention, the group segment may be formed as a closed region formed by a path that does not overlap while moving along the outermost portion of the adjacent block segment.

본 발명의 일 측면에 따르면, 상기 텍스트 파일은 제목, 기사분류, 본문 및 바이라인(byline)을 포함하고, 상기 텍스트 파일을 이용하여 확장형 마크업 언어(Extensible Markup Language, XML)를 생성하는 단계는 상기 제목, 기사분류, 본문 및 바이라인이 각각의 구분자에 의해 구분되며, 상기 확장형 마크업 언어 및 기본 환경정보를 결합하여 NewsML(News Markup Language)을 생성하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, the text file includes a title, an article classification, a body text, and a byline, and the step of generating an extensible markup language (XML) using the text file The method may further include generating a news mark (News Markup Language) by combining the title, the article classification, the body text, and the byline with the delimiter markup language and basic environment information.

본 발명의 일 측면에 따르면, 상기 스캔 이미지를 PDF 파일로 변환하는 단계; 상기 PDF 이미지 상에서 상기 그룹 세그먼트의 좌표를 생성하는 단계; 및 상기 NewsML 내의 파일 중 상기 생성된 그룹 세그먼트의 좌표에 해당하는 텍스트 파일을 선택하여 등록하는 단계를 포함할 수 있다.According to an aspect of the present invention, there is provided a method of converting a scanned image into a PDF file, Generating coordinates of the group segment on the PDF image; And selecting and registering a text file corresponding to the coordinates of the generated group segment among the files in the NewsML.

본 발명의 일 측면에 따르면, 상기 텍스트 파일로 변환하는 단계는 광학적 문자인식(Optical Character Recognition, OCR)프로그램에 의해 수행될 수 있다.According to an aspect of the present invention, the conversion into the text file may be performed by an optical character recognition (OCR) program.

본 발명에 따른 종이문서, 특히 종이신문의 디지털화 방법의 효과에 대해 설명하면 다음과 같다.The effect of the digitizing method of the paper document, particularly paper newspaper according to the present invention, will be described below.

본 발명의 실시예들 중 적어도 하나에 의하면, 종이신문과 같은 종이로 발행된 종이문서를 디지털화하여 웹(web)상에서 종이신문의 내용이 구현되도록 할 수 있는 효과가 있다.According to at least one embodiment of the present invention, it is possible to digitize a paper document issued by a paper such as a paper newspaper, thereby realizing the contents of the paper newspaper on the web.

본 발명의 실시예들 중 적어도 하나에 의하면, 텍스트 파일을 발행된 일자와 관련된 파일명으로 저장함으로써 종이신문의 관리 및 활용이 용이한 장점이 있다. According to at least one of the embodiments of the present invention, there is an advantage that it is easy to manage and utilize a paper newspaper by storing a text file as a file name associated with the date of issue.

본 발명의 실시예들 중 적어도 하나에 의하면, 종이신문의 문자 및 이미지 영역을 좌표 정보를 포함하도록 저장함으로써 웹상에서 클릭시 보다 쉽게 상세한 내용을 확인할 수 있다는 장점이 있다.According to at least one of the embodiments of the present invention, the character and image areas of the paper newspaper are stored to include coordinate information, so that it is possible to more easily confirm the details when clicking on the web.

본 발명의 적용 가능성의 추가적인 범위는 이하의 상세한 설명으로부터 명백해질 것이다. 그러나 본 발명의 사상 및 범위 내에서 다양한 변경 및 수정은 당업자에게 명확하게 이해될 수 있으므로, 상세한 설명 및 본 발명의 바람직한 실시예와 같은 특정 실시예는 단지 예시로 주어진 것으로 이해되어야 한다.Further scope of applicability of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and specific examples, such as the preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art.

도 1은 본 발명과 관련된 종이신문을 디지털화하는 방법의 순서도이다.
도 2a 내지 도 2e는 본 발명과 관련된 이미지를 회전시키는 것을 설명하기 위한 도면이다.
도 3a 및 도 3b는 본 발명과 관련된 종이신문의 스캔 이미지이다.
도 4는 본 발명의 일 실시예에 따른 종이신문의 기울기 보정과 관련된 순서도이다.
도 5a 및 도 5b는 본 발명의 일 실시예에 따른 세그먼트화를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 문자행 세그먼트화하는 방법을 설명하기 위한 예시이다.
도 7은 본 발명과 관련된 블록 세그먼트화된 종이신문의 스캔 이미지이다.
도 8은 본 발명과 관련된 사진 및 광고가 수록된 종이신문의 스캔 이미지이다.
도 9은 본 발명과 관련된 종이신문의 그룹 세그먼트를 폴리곤처리하는 개념도이다.
도 10는 본 발명과 관련된 그룹 세그먼트화를 설명하기 위한 도면이다.
도 11은 본 발명과 관련된 크롭(crop)을 설명하기 위한 도면이다.1 is a flowchart of a method of digitizing a paper newspaper related to the present invention.
2A to 2E are diagrams for explaining rotation of an image related to the present invention.
3A and 3B are scanned images of paper newspapers related to the present invention.
4 is a flow chart related to tilt correction of a paper newspaper according to an embodiment of the present invention.
5A and 5B are views for explaining segmentation according to an embodiment of the present invention.
6 is an illustration for explaining a method of character line segmentation according to an embodiment of the present invention.
Figure 7 is a scanned image of a block segmented paper newspaper associated with the present invention.
8 is a scanned image of a paper newspaper containing photographs and advertisements related to the present invention.
9 is a conceptual diagram for polygon processing of a group segment of a paper newspaper related to the present invention.
10 is a diagram for explaining group segmentation related to the present invention.
11 is a diagram for explaining a crop related to the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소에는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, wherein like or similar components are denoted by the same reference numerals, and redundant explanations thereof will be omitted. The suffix " part " for the constituent elements used in the following description is to be given or mixed with consideration only for ease of specification, and does not have a meaning or role that distinguishes itself. In the following description of the embodiments of the present invention, a detailed description of related arts will be omitted when it is determined that the gist of the embodiments disclosed herein may be blurred. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed. , &Lt; / RTI > equivalents, and alternatives.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinals, such as first, second, etc., may be used to describe various elements, but the elements are not limited to these terms. The terms are used only for the purpose of distinguishing one component from another.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. The singular expressions include plural expressions unless the context clearly dictates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In the present application, the terms "comprises", "having", and the like are used to specify that a feature, a number, a step, an operation, an element, a component, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

도 1은 본 발명과 관련된 종이문서를 디지털화하는 방법의 순서도인데, 이하에서는 도 1을 참조하여 본 발명의 일 실시예에 따른 종이문서를 디지털화하는 방법에 대하여 설명하기로 한다.FIG. 1 is a flowchart of a method of digitizing a paper document related to the present invention. Hereinafter, a method of digitizing a paper document according to an embodiment of the present invention will be described with reference to FIG.

본 발명의 일 실시예는 종이문서 중에서 특히, 종이신문을 디지털화하는 방법에 관한 것으로, 이하에서는 종이신문을 디지털화하는 방법 위주로 설명하기로 한다. 다만, 이하의 내용이 종이신문을 디지털화하는 것에 한정되는 것은 아니고, 정기간행물, 잡지, 또는 단행본 등의 서적을 디지털화하는데 적용될 수도 있다.One embodiment of the present invention relates to a method of digitizing a paper newspaper, particularly, among paper documents, and hereinafter, a method of digitizing a paper newspaper will be described. However, the following contents are not limited to digitizing paper newspapers, but may be applied to digitizing books such as periodicals, magazines, or monographs.

먼저, 종이문서를 디지털화하기 위하여 디지털화하고자 하는 종이문서를 획득(S110)하고, 상기 획득한 종이문서를 스캔(scan)하여 상기 종이문서를 이미지화(S120)한다. 상기 이미지화는 jpg, png, bmp, gif, tiff 등의 확장자를 갖는 이미지 파일로 저장함으로써 스캔 이미지를 획득하는 단계를 말한다. 상기 이미지 파일의 종류는 특별히 한정되지 않으며, 앞서 나열한 이미지 파일의 확장자 외에도 다른 확장자를 갖는 이미지 파일도 가능함은 당연하다. 이때, 이미 전자화된 문서, 예를 들면 PDF 파일로 된 문서가 존재하는 경우에는, PDF로 이루어진 문서를 디지털화할 수도 있다. 이러한 경우에는 별도의 스캔 작업이 불필요하며, 스캔 이미지를 획득(S120)하거나 스캔 이미지를 수정(S130)하는 단계를 생략할 수 있으며, 이후의 과정은 동일하다.First, in order to digitize a paper document, a paper document to be digitized is acquired (S110), and the acquired paper document is scanned to image the paper document (S120). The imaging is a step of acquiring a scanned image by storing the image as an image file having an extension of jpg, png, bmp, gif, tiff, or the like. The type of the image file is not particularly limited, and it is natural that an image file having an extension other than the extension of the image file listed above is also available. At this time, in the case where a document that has already been electronicized, for example, a PDF file, exists, a document composed of PDF may be digitized. In this case, a separate scan operation is unnecessary, and the step of acquiring the scan image (S120) or modifying the scan image (S130) may be omitted, and the subsequent steps are the same.

본 발명의 일 실시예에서의 종이문서의 디지털화 방법에는 상기 스캔 이미지를 수정(S130)하는 단계를 더 포함할 수 있다. 도 2a 내지 도 2e는 본 발명과 관련된 스캔 이미지를 회전시키는 것을 설명하기 위한 도면이다. 도 2a는 스캔 이미지의 수정 전의 도면이고, 도 2b는 스캔 이미지를 수정한 이후의 도면이다.The method of digitizing a paper document according to an embodiment of the present invention may further include a step of modifying the scan image (S130). 2A to 2E are views for explaining rotation of a scan image related to the present invention. FIG. 2A is a diagram before correction of a scan image, and FIG. 2B is a diagram after correction of a scan image.

예를 들면, 상기 스캔 이미지를 획득하는 단계에서 상기 스캔 이미지가 기울어진 상태로 스캔된 경우, 상기 스캔 이미지를 정위치로 회전시킴으로써 이후 단계에서 텍스트 파일 변환(S150)시에 텍스트를 보다 정확하게 인식되도록 할 수 있다.For example, in the step of acquiring the scan image, when the scan image is scanned in an inclined state, the scan image is rotated to a predetermined position so that the text is more correctly recognized in the text file conversion (S150) can do.

도 2a 및 도 2b를 참조하여 상기 스캔 이미지를 수정하는 방법을 설명하면 다음과 같다. 스캔 이미지의 영역 내의 단위 세그먼트 중 폭 또는 높이가 일정하며, 폭 및 높이 중 어느 하나가 다른 하나의 길이보다 기설정된 크기보다 긴 경우에는 이를 선분으로 판단하며, 상기 선분이 상기 스캔 이미지의 제1축 및 제1축과 수직인 제2축 중 어느 축에 인접하여 형성되는지 판단하고, 상기 선분과 상기 스캔 이미지에서 인접한 축 사이의 각을 산출한 다음, 상기 스캔 이미지를 상기 산출된 각도만큼 회전시켜 상기 선분과 상기 인접한 축과 일치시킨다.A method of modifying the scan image will be described with reference to FIGS. 2A and 2B. If the width or height of the unit segment within the area of the scanned image is constant and if any one of the width and the height is longer than a predetermined length, the line segment is determined as a line segment, And a second axis perpendicular to the first axis, calculating an angle between the line segment and an adjacent axis in the scan image, rotating the scan image by the calculated angle, Line segments and the adjacent axes.

도 2a 및 도 2b를 참조하여 설명하면, 기준축이 X1, Y1이라 하고, 스캔 이미지의 가로 및 세로축을 X2, Y2라 하고, X1, X2, Y1, Y2 축의 교점을 O라 하기로 한다. 도 2a에 도시된 바와 같이, X1축과 X2축이 불일치되는 경우가 있다. 이러한 경우에는 X1축, X2축 및 O가 이루는 각 만큼의 오차가 있으므로 상기 스캔 이미지를 이 각도만큼 회전시키면 된다.2A and 2B, the reference axes are X1 and Y1, the horizontal and vertical axes of the scan image are X2 and Y2, and the intersection of the X1, X2, Y1, and Y2 axes is O. FIG. As shown in FIG. 2A, the X1 axis and the X2 axis may be discordant. In this case, there is an error between the X1 axis, the X2 axis, and the O, so that the scan image is rotated by this angle.

이때, 이러한 X1, X2축의 불일치 정도는 스캔 이미지의 내용에 의해 산출될 수 있다. 예를 들면, 상기 스캔 이미지 중 문자가 있는 영역을 확인 후 각각의 행마다 최초로 시작되는 문자의 위치를 연결하게 되면 하나의 선분(L1)이 생성되는데, 이러한 선분(L1)이 똑바로 세워지도록 함으로써 스캔 이미지를 수정할 수 있다. 다른 방법으로는, 스캔 이미지 내의 가로 또는 세로로 형성된 선분(L2)을 인식하여 인식된 선분(L2)이 가로 또는 세로가 되도록 할 수도 있다. 상기 선분(L2)는 인접 영역간의 분리를 위한 구분선일 수 있다.At this time, the degree of discrepancy between the X1 and X2 axes can be calculated by the contents of the scan image. For example, when a region of a character in the scan image is checked and a position of a character that is firstly started in each row is connected, a line segment L1 is generated. By making such a line segment L1 stand upright, You can modify the image. Alternatively, the recognized line segment L2 may be horizontally or vertically recognized by recognizing the line segment L2 formed in the horizontal or vertical direction in the scanned image. The line segment L2 may be a dividing line for separating adjacent regions.

즉, 종이신문 및 정기간행물의 경우 각각의 행의 첫 글자의 위치를 순차적으로 인식하고, 이들을 연결하면 하나의 선분(L1)이 생기게 된다. 상기 선분은 세로로 수직으로 형성되는 것이 일반적이므로 수직이 아닌 각도를 갖는 경우에는 그 차이만큼의 각으로 회전시킴으로써 똑바로 세워진 상태가 되도록 한다. 이와 같이, 스캔 이미지가 다소 기울어진 채로 스캔되는 경우로는 작업자가 종이문서를 다소 비뚤어진 상태로 인입시키거나, 구겨진 종이문서가 스캔하는 과정에서 일부분만 펼쳐지면서 기울어지게 되는 경우가 있다. 이때, 회전 중심은 종이문서의 중심(O) 또는 스캔 이미지를 회전시키는 프로그램의 가상의 공간에서의 중심(O)이 될 수 있다.That is, in the case of paper newspapers and periodicals, the position of the first letter of each line is sequentially recognized, and when these are connected, a line segment L1 is generated. Since the line segment is generally formed vertically vertically, when the line segment has a non-vertical angle, the line segment is rotated at an angle corresponding to the difference so as to stand upright. As described above, in the case where the scanned image is scanned with a slight inclination, the operator sometimes pulls the paper document in a somewhat distorted state, or the paper document is tilted while being partially unfolded in the process of scanning the paper document. At this time, the center of rotation may be the center O of the paper document or the center O in the virtual space of the program that rotates the scanned image.

그리고, 종이문서에는 가로 또는 세로로 다소 길게 형성되는 선분(L2)이 기재되는 경우가 있는데, 이러한 경우에는 가로 또는 세로 선분을 인식하고 수평 또는 수직각과 차이가 나는 만큼 회전시키게 된다. 일예로, 종이문서에 가로로 그어진 얇거나 굵은 선분이 있는 경우에는 그 선분은 0°의 각도를 가져야 하나, 다소의 오차를 가지는 경우에는 오차에 해당하는 만큼만 회전시키면 된다. 이를 기울기 보정(fix orientation) 단계로 칭할 수 있다. In some cases, a line segment (L2) formed somewhat longer in the horizontal or vertical direction is described in the paper document. In this case, the horizontal or vertical line segment is recognized and rotated by a difference from the horizontal or vertical angle. For example, if there is a thin or thick line segment horizontally in a paper document, the line segment should have an angle of 0 °, but if it has some error, it can be rotated only by the amount corresponding to the error. This can be referred to as a fix orientation step.

다시 말해, 종이문서의 경우에는 화상의 기울기를 찾는 여러 가지 방법이 있겠지만 종이신문에서는 특정 방법이 있다. 일반적으로 신문 특성상 "년월일 면정보" 를 나타내는 헤더(도 3b의 적색 도형 내의 선분 참조)가 있으며 임의의 두 지점을 연결하는 선분이 있다. 또한, 광고영역을 나타내는 광고구획선(도 3b의 적색 도형 내의 선분 참조)도 있을 수 있으며 기사와 기사 사이의 구분선(도 3a의 적색 도형 내의 선분 참조)이 가로신문이나 세로신문에 반드시 존재한다. 이때, 상기 선분, 광고 영역을 나타내는 선분 또는 기사와 기사 사이의 구분선은 도 2a의 Y2축과 대응될 수 있다.In other words, in the case of paper documents, there are many ways to find the slope of the image, but in paper newspapers there is a specific method. Generally, there is a header (refer to a line segment in the red figure of FIG. 3B) indicating "date and time information" in the nature of a newspaper, and there is a line segment connecting arbitrary two points. In addition, there may be an advertisement section line (refer to a line segment in the red figure in Fig. 3B) indicating an advertisement area, and a dividing line between the article and the article (see a line segment in the red figure in Fig. 3A) necessarily exists in the lateral newspaper or the vertical newspaper. At this time, the segment, the line segment indicating the advertisement area, or the dividing line between the article and the article may correspond to the Y2 axis in FIG. 2A.

여기서 기울기는 선분(Y2)을 찾아내 상기 선분(Y2)이 Y1축과 이루는 각도를 계산하여 기울기를 보정할 수 있다. 이때는 Y1축, Y2축과 O가 이루는 직각삼각형을 구현한 후 직각 삼각형 상에서의 삼각법 공식을 이용하여 각도를 구할 수 있다.Here, the inclination can correct the inclination by finding the line segment Y2 and calculating the angle formed by the line segment Y2 with the Y1 axis. In this case, the angles can be obtained by using the trigonometric formula on the right triangle after realizing the right triangle formed by Y1 axis, Y2 axis and O.

한편, 후술하는 단위 세그먼트 중에 두께가 일정하며, 특정 크기의 가로 길이가 세로 길이에 비해 기설정된 크기(예를 들면 20배) 이상으로 큰 경우나 반대로 세로 길이가 가로 길이에 비해 기설정된 크기(예를 들면 20배) 이상인 세그먼트를 찾는다면, 이러한 세그먼트들이 기울기를 보정하는 기준이 되는 선분이 될 수 있다. On the other hand, when the thickness of the unit segment described later is constant and the horizontal length of a specific size is larger than a predetermined size (for example, 20 times) compared with the vertical length, or when the vertical length is larger than a predetermined size 20 times), these segments can be a line segment serving as a reference for correcting the slope.

한편, 도 2c 내지 도 2e는 본 발명의 일 실시예와 관련된 이미지 회전과 관련된 도면으로, 도 2c는 기울기가 보정되기 전 상태의 스캔 이미지를 나타낸 것이고, 도 2d는 기울기를 보정한 상태에서의 여백을 제거한 도면이며, 도 2e는 여백을 제거한 후 마진을 설정(추가)한 상태의 도면이다. 2C is a view of a scan image in a state before the tilt is corrected, and FIG. 2D is a diagram illustrating a scan image in a state in which the tilt is corrected, FIG. 2E is a view showing a state in which margins are set (added) after margins are removed.

기울기가 보정되면 신문 영역 이외의 여백을 제거하는데, 여백제거는 좌측부터 시작하여 영역을 늘려가며 하나의 큰 프레임 세그먼트를 구하는 방식으로 수행한다. 예를 들면, 도 2c에서 좌측 단부에서부터 가상의 수직선을 우측으로 이동하면서 종이문서의 이미지와 닿게 되는 순간까지의 영역을 저장하고, 우측 단부에서부터 가상의 수직선을 좌측으로 이동하면서 종이문서의 이미지와 닿게 되는 순간까지의 영역을 저장하여 수평 방향으로의 종이문서의 스캔 이미지의 영역을 구할 수 있다. 이와 유사한 방식으로 상하부의 스캔 이미지의 영역을 구할 수도 있다. 즉, 스캔 이미지의 실제 스캔 영역 크기에서 프레임 세그먼트를 제외하여 종이신문을 자동으로 절개(AutoTrim) 한다. 이는 도 2c에서 격자 무늬가 절개되는 것에 대응된다. 즉, 여백제거는 스캔 이미지를 회전시켜 보정한 다음에 절개하는 것이어서 도 2c에 도시된 흰색 바탕 부분이 삭제된 상태에서 실제 종이문서의 영역을 회전시킨 결과와 동일하다.When the tilt is corrected, margins other than the newspaper area are removed. Margin removal is performed by increasing the area starting from the left side and obtaining one large frame segment. For example, in FIG. 2C, the area from the left end to the moment when the imaginary vertical line moves to the right and touches the image of the paper document is stored, and the imaginary vertical line is moved from the right end to the left. The area of the scanned image of the paper document in the horizontal direction can be obtained. Similarly, the areas of the upper and lower scan images can be obtained. That is, the paper newspaper is automatically cut off (AutoTrim) by excluding the frame segment from the actual scan area size of the scanned image. This corresponds to the incision in Fig. 2C. That is, the margin elimination is the same as the result of rotating the area of the actual paper document while the white background part shown in FIG.

이후에는 마진(margin)을 설정하게 되는데, 여백이 제거된 종이신문은 실제 배포된 신문 판형보다 여백이 없어 보기에 좋지 않기 때문에 이러한 과정을 거치게 된다.After that, the margin is set. Since the blank newspaper is not better than the actual newspaper version, it is necessary to do this process.

마진(도 2e의 테두리(액자) 부분) 설정은 상하좌우 여백을 지정한 크기로 배경에 추가함으로써 이루어진다. 즉, 여백을 제거한 상태에서의 프레임 크기에 여백 크기 만큼 더한 빈 프레임을 만든 다음, 빈 프레임에서 종이신문을 위치시킴으로써 마진을 설정하게 된다.The margin (the border (frame) portion of FIG. 2E) is set by adding the up, down, left, and right margins to the background to the specified size. That is, an empty frame is created by adding the margin size to the frame size with the margin removed, and then the margin is set by placing the paper newspaper in the empty frame.

이후에는 이미지 파일을 피디에프(pdf) 파일로 변환(S133)하게 되는데, 이때는 파일명을 신문사코드, 연, 월, 일, 판, 면, 섹션 등의 정보를 포함하도록 함으로써 파일명만 보고도 파일의 내용을 직관적으로 알 수 있도록 한다.Then, the image file is converted into a pdf file (S133). In this case, the file name includes information such as newspaper code, year, month, day, edition, face, section, To be intuitive.

한편, 도 4는 본 발명의 일 실시예에 따른 기울기 보정과 관련된 순서도로, 도 1에서의 기울기 보정 및 PDF 변환 단계(S130)를 보다 구체적으로 설명하기 위한 도면이다.Meanwhile, FIG. 4 is a flow chart related to the tilt correction according to an embodiment of the present invention, and is a diagram for more specifically explaining the tilt correction and PDF conversion step (S130) in FIG.

이하에서는 도 4를 참조하여 설명하기로 한다. Hereinafter, a description will be made with reference to FIG.

앞서 설명한 스캔 이미지의 기울기 보정하는 방법은 서로 연결되어 있는 최소 단위인 연결점 단위 세그먼트를 구한 이후에 가능하다. 본 발명의 일 실시예에 따른 연결점 단위 세그먼트화(S131)하는 과정은 상기 단위 세그먼트를 구하는 과정인데, 이는 소정의 사각형 영역 내에서 임의의 화소를 기준으로 상, 하, 좌, 우의 4근방에 위치한 인접 화소를 포함하는 화소 덩어리를 얻어내고, 상기 각 검은색 화소 덩어리들을 조사하여 화소수가 기설정된 개수(k) 이상인 덩어리들에 대하여 유효한 화소 덩어리로 판단하고, 이러한 유효 화소 덩어리 다수를 구한 다음, 상기 화소 덩어리 집합의 길이가 기설정된 길이 이상인 경우 상기 화소 덩어리 집합을 단위 세그먼트로 판단한다.The above-described method of correcting the tilt of the scan image can be performed after obtaining the connecting unit segment, which is the minimum unit connected to each other. The process of connecting point unit segmentation (S131) according to an embodiment of the present invention is a process of obtaining the unit segment. This process is performed by locating the unit segment in the vicinity of four pixels of up, down, left, Determining a plurality of valid pixel clusters by determining a valid pixel clusters for clusters having a predetermined number (k) or more of pixels by irradiating the black pixel clusters with the adjacent pixels, When the length of the set of pixel blocks is equal to or greater than a predetermined length, the set of pixel blocks is determined as a unit segment.

즉, 상기 얻어진 임의의 화소에 대하여 인접화소(3*3에서 중심에 위치한 임의의 화소를 제외한 8개 화소)에 대한 유효픽셀을 파악하여, 유효한 화소 덩어리들로 이루어진 화소 덩어리의 길이가 유효한 경우에 한하여 획수 산정에 포함함으로써 이루어진다. 이때, 상기 단위 세그먼트 중에서 종이신문의 실제 문자 인식 영역의 외곽정보(framework)를 구할 수 있다.That is, valid pixels for the obtained arbitrary pixel with respect to the adjacent pixels (8 pixels excluding the arbitrary pixel positioned at the center in the 3 * 3) are identified, and when the length of the pixel block composed of valid pixel blocks is valid By including it in the calculation of the number of strokes. At this time, the outer frame of the actual character recognition area of the paper newspaper among the unit segments can be obtained.

다시 말해, 단위 세그먼트를 구하기 위하여 종이신문의 좌측 상단 영역에서 우측으로 이동하면서 화소를 기록하여, 유효한 크기를 갖는 화소들을 중심으로 단위 세그먼트를 구하고, 노이즈가 될 수 있는 정도의 크기의 화소를 제거함으로써 유효한 단위 세그먼트를 산출하게 된다.In other words, in order to obtain a unit segment, pixels are recorded while moving from the upper left area of the paper newspaper to the right side to obtain a unit segment around pixels having an effective size, and pixels having a size enough to become noise are removed A valid unit segment is calculated.

기울기 보정을 위해서는 기준이 되는 선분을 선정해야 하는데, 본 발명의 일 실시예에서는 최소 단위의 세그먼트인 단위 세그먼트 중에서 기울기 보정의 기준이 되는 선분을 선정한다. 이때, 인접한 화소들이 연결되어 하나의 단일 세그먼트를 형성하는지 여부는 유효한 크기를 갖는 인접한 화소들 사이의 거리에 의해 판단되는데, 인접한 화소들 사이의 거리가 기설정된 거리보다 큰 경우에는 분리되어 있는 상태로 판단하도록 한다. 이러한 경우에는 인접하여 형성되는 텍스트인 경우가 대부분이다. 이때는 단위 세그먼트만 구성한 상태이고, 아직은 문자 세그먼트를 형성하기 전 상태이다. 즉, 기울기 및 여백 보정(S132)은 단위 세그먼트 중에서 기준이 될 수 있는 선분을 찾아 기울기를 수정한 상태에서 문자 세그먼트, 문자행 세그먼트, 블록 세그먼트 및 그룹 세그먼트를 형성하는 것이 바람직하다. 다만, 반드시 이에 한정되는 것은 아니고, 단위 세그먼트, 문자 세그먼트, 문자행 세그먼트, 블록 세그먼트 및 그룹 세그먼트를 동시에 수행한 다음에 기울기 보정을 할 수도 있다.In order to correct the tilt, a reference line segment should be selected. In an embodiment of the present invention, a line segment serving as a slope correction reference is selected from unit segments that are the minimum unit segments. At this time, whether or not adjacent pixels are connected to form a single segment is determined by the distance between adjacent pixels having a valid size. If the distance between adjacent pixels is greater than a predetermined distance, . In this case, most of the text is formed to be adjacent. In this case, only the unit segment is formed, and the state is still before the character segment is formed. That is, it is preferable that the slope and the margin correction (S132) form a character segment, a character row segment, a block segment, and a group segment in a state in which the slope is found by searching for a segment that can be a reference among the unit segments. However, the present invention is not limited to this, and the tilt correction may be performed after the unit segment, the character segment, the character row segment, the block segment, and the group segment are simultaneously performed.

이후, 상기 이미지 파일을 PDF 파일로 변환(S133)한다. 상기 이미지 파일을 PDF 파일로 변환(S133)하는 단계는 후술하는 크롭(crop)하기 전에만 하면 되고, 반드시 세그먼트화(S140) 단계 이전에 수행할 필요는 없다. Then, the image file is converted into a PDF file (S133). The step of converting the image file into a PDF file (step S133) may be performed only before cropping, which is described later, and is not necessarily performed before the segmentation step (S140).

본 발명의 일 실시예에서는 상기 스캔 이미지를 텍스트 영역 및 이미지 영역으로 구분하고, 상기 텍스트 영역을 화소(pixel)를 이용하여 서로 연결된 최소 단위의 세그먼트인 단위 세그먼트(segment), 하나 이상의 상기 단위 세그먼트의 조합에 의해 형성되는 문자 세그먼트, 하나 이상의 상기 문자 세그먼트의 조합에 의해 형성되는 문자행 세그먼트, 하나 이상의 상기 문자행 세그먼트의 조합에 의해 형성되는 블록 세그먼트를 형성하고, 하나 이상의 상기 블록 세그먼트의 조합에 의해 형성되는 그룹 세그먼트를 순차적으로 형성하는 세그먼트화(segmentation)하는 단계를 거친다. 이와 같이, 작은 크기의 세그먼트에서부터 점차 큰 세그먼트로의 세그먼트화 작업을 통하여 종이문서 내의 텍스트 및 이미지 영역에 대한 위치 정보(좌표 정보)를 얻을 수 있게 된다.In one embodiment of the present invention, the scan image is divided into a text area and an image area. The text area is divided into a unit segment, which is a minimum unit segment connected to each other by using pixels, A character line segment formed by a combination of one or more of the character segments, a block segment formed by a combination of one or more character line segments, and a character string segment formed by a combination of one or more of the block segments And sequentially segmenting the group segments to be formed. As described above, the segmentation operation from the small size segment to the large segment can obtain the position information (coordinate information) about the text and image area in the paper document.

이하에서는 본 발명의 일 실시예에 따른 세그먼트화(segmentation)에 대하여 설명하기로 한다.Hereinafter, segmentation according to an embodiment of the present invention will be described.

도 5a 및 도 5b는 본 발명의 일 실시예에 따른 세그먼트화를 설명하기 위한 도면으로, 문자 세그먼트, 문자행 세그먼트, 블록 세그먼트 및 그룹 세그먼트에 적용될 수 있으며, 특히 문자 세그먼트 이후의 세그먼트화에 적용될 수 있다. 단위 세그먼트화하는 방법에 대해서는 위에서 설명하였으므로 이하에서는 도 5a 및 도 5b를 참조하여 문자 세그먼트화하는 방법에 대하여 설명하기로 한다.5A and 5B are diagrams for explaining segmentation according to an embodiment of the present invention, and can be applied to a character segment, a character row segment, a block segment, and a group segment, have. Since the method of unit segmentation has been described above, a method of character segmentation will be described with reference to FIGS. 5A and 5B.

본 발명의 일 실시예에 따른 문자 세그먼트로 세그먼트화하는 방법은 상기 단위 세그먼트들로 이루어지는 단위 세그먼트들의 집합 내에서 제1 단위 세그먼트(도 5에서의 청색 네모) 및 상기 제1 단위 세그먼트와 인접한 제2 단위 세그먼트(도 5에서의 적색 네모)를 선택하고, 상기 제1 단위 세그먼트의 면적을 S₁이라 하고, 상기 제2 단위 세그먼트의 면적을 S₂라 하고, 상기 제1 단위 세그먼트 및 제2 단위 세그먼트의 겹침 면적을 S_I라 하고, 상기 제1 및 제2 단위 세그먼트가 합쳐진 상태의 면적을 S_u라 할 때, 하기 식 (1) 내지 (4)를 만족하면 문자 세그먼트로 판단한다.A method of segmenting into a character segment according to an embodiment of the present invention includes dividing a first unit segment (blue square in FIG. 5) and a second unit segment The area of the first unit segment is S ₁ , the area of the second unit segment is S ₂ , and the area of the first unit segment and the second unit segment (1) to (4) are satisfied, when the overlapping area of the first unit segment and the second unit segment is S _I and the area of the state where the first and second unit segments are combined is S _u .

S₁≥k₂S_u -------------------(3)S _1? K ₂ S _u ------------------- (3)

S₂≥k₂S_u -------------------(4)S _2? K ₂ S _u ------------------- (4)

즉, 본 발명의 일 실시예에서는 문자 세그먼트화하는 과정에서는 인접한 세그먼트들끼리의 겹침면적을 구하는 방식에 의하는데, 도 5b를 참조하여 보다 구체적으로 설명하면 다음과 같다.That is, in the embodiment of the present invention, the overlapping area between adjacent segments is obtained in the process of character segmentation, which will be described in more detail with reference to FIG. 5B.

단위 세그먼트 집합 내에서 임의의 두 세그먼트(A1, A2)를 선택하여 그것들의 면적S1, S2 과 두 세그먼트사이의 겹침면적 S_I, 두 세그먼트가 합쳐진 경우의 면적S_u 를 구한다. 즉, 한글 '전'에서 'ㅈ'단위 세그먼트와 'ㅓ' 단위 세그먼트(A1,A2)는 면적이 겹치기 때문에 하나의 문자 세그먼트인 '저'문자 세그먼트화할 수 있다.Two arbitrary segments A1 and A2 are selected in the unit segment set, and the areas S1 and S2 of them and the overlapping area S _I between the two segments and the area S _u when the two segments are combined are obtained. That is, the unit segment 'i' and the unit segment 'A1' and 'A2' in the Korean alphabet segment can be segmented into 'low' character, which is one character segment, because the areas overlap.

상기 조건식(1) 내지 (4)의 조건 중 어느 하나라도 만족하는 경우 두 세그먼트를 합쳐서 하나의 세그먼트로 만든다. 이때, 상기 k1의 값은 인쇄체인 경우는 0.1, 필기체인 경우는 0.3, k2의 값은 인쇄체와 필기체에 관계없이 0.95로 정하였다. 이상적인 인쇄체인 경우 k1 의 값은 0으로 정할 수도 있으나 실제로는 잡음이나 기울어짐을 비롯한 기타 영향을 받아 서로 다른 문자에 속하는 세그먼트인 경우에도 미세하게 겹치는 경우가 있으므로 0.1로 정하였다.If any one of the conditions (1) to (4) is satisfied, the two segments are combined into one segment. At this time, the value of k1 was set to be 0.1 in the case of print chain, and the value of k2 in the case of handwriting was set to 0.95 regardless of the printed matter and the writing body. In the ideal print chain, the value of k1 may be set to 0, but in practice, it may be slightly overlapped even in the case of segments belonging to different characters due to noise, tilting, or other influences.

상기 k1 및 k2의 값은 서체의 종류에 따라 달라질 수 있으므로, 상기 값들은 예시에 불과하고 이에 한정되는 것은 아니다. 이러한 과정을 거치면 각각의 'ㅈ'단위 세그먼트(A1)와 'ㅓ' 단위 세그먼트(A2)의 겹침면적을 고려하여 하나의 문자 세그먼트인 '저'문자 세그먼트화하게 된다. 나아가, '저'문자 세그먼트(A1,A2)와 면적이 겹쳐지는 'ㄴ'단위 세그먼트(A3)와 합쳐짐으로써 '전'문자 세그먼트(A4)가 완성되게 된다. 상기 문자 세그먼트(A4)는 하나 이상의 단위 세그먼트의 조합으로 이루어지며, 도 5b의 경우에는 세 개의 단위 세그먼트들의 영역(A1,A2,A3)을 포함하는 직사각형 영역(A4)으로 표현된다. 즉, 문자 세그먼트는 인접해 있는 다수의 단위 세그먼트들의 영역을 포함하는 직사각형 영역으로 정의될 수 있다. 이에 의해 후술하는 문자행 세그먼트, 블록 세그먼트 및 그룹 세그먼트들의 영역도 직사각형 영역으로 표현될 수 있다.Since the values of k1 and k2 may vary depending on the type of the typeface, the values are merely illustrative and not restrictive. In this process, a 'low' character segment, which is one character segment, is considered in consideration of the overlapping area of each of the 'X' unit segment A1 and the 'X' unit segment A2. Furthermore, the 'before' character segment A4 is completed by combining with the 'a' unit segment A3 where the areas overlap with the 'low' character segments A1 and A2. The character segment A4 is composed of a combination of one or more unit segments and is represented by a rectangular area A4 including three unit segment areas A1, A2, and A3 in the case of FIG. 5B. That is, the character segment may be defined as a rectangular area including an area of a plurality of adjacent unit segments. As a result, a region of a character line segment, a block segment, and a group segment, which will be described later, can also be expressed as a rectangular region.

그리고, 문자 세그먼트들로 이루어지는 문자행 세그먼트는 다수의 상기 문자 세그먼트들의 크기를 평균하여 문자 세그먼트의 기준 폭(w₀) 및 기준 높이(h₀)를 구하고, 다수의 상기 문자 세그먼트들을 제1 방향(일예로 가로방향)으로 정렬함으로써 구할 수 있다. A character line segment composed of character segments is obtained by averaging the sizes of the plurality of character segments to obtain a reference width (w ₀ ) and a reference height (h ₀ ) of the character segment, and a plurality of the character segments in a first direction For example, in the horizontal direction).

도 6은 본 발명의 일 실시예에 따른 문자행 세그먼트화하는 방법을 설명하기 위한 예시인데, 이하에서는 도 6을 참조하여 설명하기로 한다. FIG. 6 is an illustration for explaining a character line segmentation method according to an embodiment of the present invention, which will be described below with reference to FIG. 6. FIG.

상기 문자 세그먼트들 중 폭(width) 및 높이가 상기 기준 폭(w₀) 및 기준 높이(h₀)를 고려하여 기설정된 크기 범위 내의 문자 세그먼트들을 선별하여 문자행 세그먼트를 구성하고, 순서상으로 서로 인접한 문자 세그먼트들 사이의 간격이 상기 제1 방향을 따라 기설정된 간격(d) 이상이거나, 상기 제1 방향과 수직인 제2 방향을 따라 폭 또는 높이가 겹치지 않을 경우에는 새로운 문자행 세그먼트로 구성한다. 도 6에서는 가로쓰기를 일예로 하였는데, 세로쓰기인 경우에는 상기 문자 세그먼트들 사이의 간격은 가로방향이 아닌 세로방향을 기준으로 해야할 것이다. 즉, 상기 제1방향이 가로방향이면 d는 현재의 문자 세그먼트와 이전 행세그먼트 집합의 마지막 행과의 수평방향 거리이고, 상기 제1 방향이 세로방향이면 d는 현재의 문자 세그먼트와 이전 행세그먼트 집합의 마지막 문자 세그먼트와의 수직방향 거리이다. 또한, 상기 문자 세그먼트는 하나의 단어를 형성하는 단어 세그먼트를 형성할 수도 있다. 이때, 인접한 문자 세그먼트들 사이의 이격 간격은 기준문자의 폭 또는 높이의 배수가 기준이 될 수 있다.The character segments in the predetermined size range are selected in consideration of the width and the height of the character segments considering the reference width w ₀ and the reference height h ₀ to form character line segments, If the interval between adjacent character segments is greater than or equal to the predetermined interval d along the first direction or the width or height does not overlap along the second direction perpendicular to the first direction, a new character line segment is formed . In FIG. 6, horizontal writing is an example. In vertical writing, the interval between the character segments should be based on the vertical direction rather than the horizontal direction. That is, if the first direction is the horizontal direction, d is the horizontal distance between the current character segment and the last row of the previous row segment set, and if the first direction is the vertical direction, d is the current character segment and the previous row segment set Is the vertical distance from the last character segment of < RTI ID = 0.0 > In addition, the character segment may form a word segment forming one word. At this time, the spacing between adjacent character segments may be a reference of a width or a height of a reference character.

본 발명의 일 실시예에 따른 문자행 세그먼트화하는 방법은 앞단계의 결과로 얻어진 세그먼트 집합에 기초하여 문자행을 구성함으로써 세그먼트화의 정확도를 높이고 세그먼트화 시간을 단축할 수 있게 한다.The method of character line segmentation according to an embodiment of the present invention increases the accuracy of segmentation and shortens the segmentation time by constructing a character line based on the segment set obtained as a result of the previous step.

일반적으로 문자행의 폭은 문자 크기와 비슷하며, 또한 한 행에 놓이는 세그먼트들은 문자행의 방향으로 볼 때 겹침 정도가 크며 잡음 및 기울어짐을 비롯한 각종 영향들을 무시하는 경우 서로 다른 행에 속하는 세그먼트끼리 겹치는 경우는 없다고 볼 수 있다. 따라서 매 문자 세그먼트들 사이에 문자행 방향으로 겹침이 어느 정도인가를 보고 문자행을 구성할 수 있다.Generally, the width of a character line is similar to that of a character, and segments placed on one line have a large degree of overlap when viewed in the direction of a character line. When ignoring various influences such as noise and skewing, segments belonging to different lines overlap each other There is no case. Therefore, it is possible to construct a character line by looking at the degree of overlap between character segments in the character line direction.

즉, 문자 세그먼트의 기준 폭(w₀) 및 기준 높이(h₀)를 구하고, 문자 세그먼트의 폭이 일예로, 0.4 w₀ 이상이고 1.2 w₀이하인 세그먼트들을 선택함으로써 불필요한 세그먼트(노이즈)들이 행 분리에 영향을 미치지 않도록 한다.That is, by selecting segments having a reference width (w ₀ ) and a reference height (h ₀ ) of a character segment and having a width of a character segment of 0.4 w ₀ or more and 1.2 w ₀ or less, unnecessary segments (noises) .

이렇게 얻어진 세그먼트 집합 중에서 첫번째 문자 세그먼트를 첫 행세그먼트로 하여 행세그먼트 집합을 구성한다. 그리고, 두번째 문자 세그먼트부터 시작하여 다음의 조건을 만족하면 새 문자행으로 하여 행세그먼트 집합에 넣고 그렇지 않으면 행세그먼트 집합의 맨 마지막 행과 합친다. 이때, 문자를 행세그먼트와 비교하여 폭이 겹치지 않은 경우 새로운 행으로 보고 행세그먼트를 생성하여 넣는다. 겹칠 경우는 일예로, 겹치는 폭이 기준 폭의 24% 이내면 마지막 행에 합친다.Among the obtained segment sets, the first character segment is used as the first line segment, and a row segment set is constructed. Then, starting from the second character segment, if it satisfies the following condition, it is put into the line segment set as a new character line, otherwise it is combined with the last line of the line segment set. At this time, if the characters are compared with the row segments and the widths do not overlap, a new row is generated and a row segment is generated. For example, if the overlap width is within 24% of the reference width, it is added to the last line.

또한, 상기 문자행 세그먼트들에 의해 형성되는 블록 세그먼트로 세그먼트화하는 방법은 상기 문자 세그먼트의 기준 폭 및 기준 높이에 따라 문자 세그먼트를 서로 다른 서식으로 저장한 다음, 상기 문자 세그먼트의 기준 폭 및 기준 높이가 기설정된 범위 내이면서 동일한 길이를 갖는 문자행 세그먼트를 합치는 것을 반복하고, 상기 문자행 세그먼트를 합치는 단계에서 이웃하는 문자행 세그먼트와 상기 제2 방향을 따라 기설정된 간격 이상으로 이격된 문자행 세그먼트는 다른 블록 세그먼트를 형성한 다음, 상기 문자 세그먼트의 기준 폭 및 기준 높이가 다른 문자행 세그먼트는 다른 블록으로 처리하는 방식으로 이루어진다.In addition, a method of segmenting into a block segment formed by the character line segments may include storing character segments in different formats according to a reference width and a reference height of the character segment, Repeating the step of merging the character line segments having the same length and within the predetermined range, and in the step of merging the character line segments, the character line segments spaced apart from the neighboring character line segments by more than a predetermined interval in the second direction The segment is formed by forming another block segment, and then the character row segment having the reference width and the reference height of the character segment are processed into another block.

보다 구체적으로 설명하면, 앞에서 본문의 문자 세그먼트의 기준 폭(w₀) 및 기준 높이(h₀)를 구했고, 행간격도 산출될 수 있으므로, 행단위 세그먼트 내에서 기준크기보다 큰 문자열이 있는 행은 제목 블록 세그먼트로 분류한다. 기준 문자로 되어있는 행들의 집합을 비교하여 일정 패턴의 단 간격을 찾은 다음, 단 간격은 가로쓰기에서 세로로 일정크기(기준넓이의 1/3) 이내에서 겹치지 않는 영역을 단 간격으로 한다.More specifically, the reference width (w ₀ ) and the reference height (h ₀ ) of the character segment of the main text are obtained and the row spacing can also be calculated. Therefore, a row having a string larger than the reference size in the row unit segment Title block segment. A set of rows having a predetermined pattern is compared to find a step interval of a certain pattern, and a step interval is set as a step interval in which a non-overlapping area within a predetermined size (1/3 of the reference width) is written vertically in horizontal writing.

도 7은 본 발명과 관련된 블록 세그먼트화된 종이신문의 스캔 이미지이고, 도 8은 본 발명과 관련된 사진 및 광고가 수록된 종이신문의 스캔 이미지인데, 도 7 및 도 8을 참조하여 설명하기로 한다. 현재 행세그먼트에서 다음 행세그먼트를 블록화할 때는 행세그먼트의 길이가 일정 범위 내에서 같아야 하고, 길이가 같은 행세그먼트를 합쳐 나가면서 행을 넓혀 나갈 때 자신의 행 길이보다 큰 문자행 세그먼트나 블록이 나타나면 새로운 블록을 구성한다.FIG. 7 is a scan image of a block segmented paper newspaper related to the present invention, and FIG. 8 is a scan image of a paper newspaper containing photographs and advertisements related to the present invention, which will be described with reference to FIG. 7 and FIG. When blocking the next line segment in the current line segment, the length of the line segment must be the same within a certain range, and when you expand the line by combining the same line segments of the same length, Construct a new block.

이때, 본문 세그먼트의 기준 폭(w₀) 및 기준 높이(h₀) 보다 큰 행이 연속되는 경우 그 해당 행이 다른 행들과 겹치지 않는 경우에 하나의 블록으로 구성하는데, 이는 주로 제목 영역과 관련된다. 기준 폭(w₀) 및 기준 높이(h₀)를 기준으로 연속된 행길이로 반복되는 행을 합치면 본문이 된다. 이러한 방식으로 행세그먼트 단위의 합침이 완료되면 대부분의 블록들이 완료된다.At this time, if a row that is larger than the reference width (w ₀ ) and the reference height (h ₀ ) of the text segment is continuous, the block is composed of one block when the corresponding row does not overlap with other rows, . If the rows repeated with a continuous row length based on the reference width (w ₀ ) and the reference height (h ₀ ) are combined, the text becomes a body. In this way, most of the blocks are completed when the merging of row segment units is completed.

블록화된 세그먼트들의 영역은 문자 크기를 기준으로 구분할 수 있는데, 대제목, 부제목의 경우에는 문자크기를 기설정된 기준으로 판단하고, 본문은 기준문자 크기를 기준으로 판단한다. 한편, 종이신문 대부분의 제목크기는 신문사별로 이미 정의되어 있다. 즉, 문자 크기에 따라 대제목 세그먼트, 부제목 세그먼트, 본문 세그먼트 등으로 구분할 수 있다.The area of the blocked segments can be classified based on the character size. In the case of the title and the subtitle, the character size is determined based on a predetermined reference, and the text is determined based on the reference character size. On the other hand, most newspaper titles are already defined by newspapers. That is, it can be classified into a major title segment, a subheading segment, and a body segment depending on the character size.

한편, 상기 단위 세그먼트 중 폭 및 높이가 기설정된 크기보다 큰 경우에는 이미지 세그먼트에 포함하고 문자행 세그먼트에서 제외시킨다. 일예로, 상기 단위 세그먼트 중에서 기준 문자크기의 가로 및 세로가 5배(500%)이상인 세그먼트를 추출하고, 상기 단위 세그먼트 중 겹치는 부분이 기준 문자크기의 500% 이상 겹쳐 있는 경우에는 이미지 세그먼트에 포함하고 문자행 세그먼트에서 지운다. 이러한 방식으로 이미지 영역을 세그먼트화할 수 있고, 검정색 화소가 아닌 다른 색상의 화소가 형성되는 영역을 이미지 영역으로 세그먼트화할 수도 있다. 이는 텍스트의 경우에는 검정색으로 기재하는 반면, 광고 또는 사진의 경우에는 대부분 검정색 이외의 색상이 포함되어 있기 때문이다. 또한, 단위 세그먼트가 직사각형 형태의 박스로 이루어지면서 기준크기보다 소정의 크기보다 큰 경우에는 그 내부에 텍스트의 유무를 불문하고 광고 또는 사진 세그먼트로 판단할 수도 있다.On the other hand, if the width and height of the unit segments are larger than the predetermined size, they are included in the image segment and excluded from the character row segment. For example, a segment having a width of 5 times (500%) or more of the standard character size is extracted from the unit segment, and when overlapping portions of the unit segments are overlapped by 500% or more of the standard character size, they are included in the image segment Deletes from the character row segment. In this way, the image area can be segmented, and the area where pixels of other colors other than black pixels are formed can be segmented into image areas. This is because texts are written in black, while ads or photographs contain colors other than black. If the unit segment is formed of a rectangular box and is larger than the reference size by a predetermined size, the unit segment may be determined as an advertisement or a photograph segment regardless of the presence or absence of text.

일반적으로 종이신문의 단광고는 해당 신문사의 광고부가 관리하며 광고 위치는 항상 정해져 있다. 즉, 하단부터 1단~15단(전면광고)까지 있으며 변형단으로는 홀수면, 짝수면에 따라 위치가 다르다. 또한, 단 광고는 일반적으로 신문과 광고를 구분하기 위해 좌측에서 우측으로 형성되거나 상측에서 하측으로 형성되는 구분선이 존재 하거나 광고가 박스로 둘러싸여 있다. 이는 변형단도 마찬가지다. 종이신문에서는 이런 패턴을 이용하여 상기 사진영역의 세그먼트를 광고로 판단한다.Generally, the advertisement of a paper newspaper is managed by the advertisement department of the newspaper company, and the position of the advertisement is always fixed. That is, from the bottom to the 1st to 15th stage (full advertisement), the position of the deformation stage is different according to the hole surface and the even surface. In addition, the advertisement is generally formed from left to right to distinguish between newspaper and advertisement, or there is a dividing line formed from the upper side to the lower side, or the advertisement is surrounded by a box. This is also true of deformed ends. In a paper newspaper, a segment of the photo area is judged as an advertisement by using such a pattern.

이후, 상기 블록 세그먼트들 중 동일한 내용의 기사를 형성하는 것들을 하나의 그룹으로 구성하는 것이 필요하다. 일예로, 제목, 기사 내용(본문), 바이라인 등을 하나의 그룹으로 묶어 그룹 세그먼트를 구성한다. 이를 그룹 세그먼트화라고 칭하기로 한다.Thereafter, it is necessary to construct those of the block segments that form the same content article into a single group. For example, a group segment is formed by grouping a title, an article content (body text), a byline, and the like into one group. This will be referred to as group segmentation.

본 발명의 일 실시예에 따른 상기 그룹 세그먼트로 세그먼트화하는 방법은 서로 인접한 블록 세그먼트들 중 문자 세그먼트의 폭 및 높이가 가장 큰 제1 문자 세그먼트를 포함하는 제1 블록 세그먼트의 하부에 위치하고, 상기 제1 블록 세그먼트와 제2 방향을 따라 적어도 일부가 겹쳐지는 블록 세그먼트들을 동일한 그룹으로 그룹화하고, 상기 동일 그룹으로 형성된 블록 세그먼트와 상기 제1 방향 또는 상기 제2 방향을 따라 기설정된 간격 이상으로 이격되거나, 상기 동일 그룹으로 형성된 블록 세그먼트 중 상기 제1 문자 세그먼트를 포함하는 제2 블록 세그먼트 또는 구분선과 교차되는 블록 세그먼트는 다른 그룹으로 그룹화함으로써 이루어진다. A method of segmenting into group segments according to an embodiment of the present invention is characterized in that the segment segment is located at a lower portion of a first block segment including a first character segment having a largest width and height of character segments among adjacent block segments, Grouping block segments that overlap at least partially along a second direction with one block segment in the same group and spacing or spacing a block segment formed in the same group along the first direction or the second direction, And a second block segment including the first character segment or a block segment intersecting the dividing line among the block segments formed in the same group are grouped into different groups.

이는 종이신문의 특성상 문자크기가 가장 큰 대제목이 가장 상단에 위치하고, 그 아래에 부제목 또는 본문이 위치하기 때문이다.This is because the title of the newspaper with the largest letter size is located at the top of the newspaper, and the subtitle or the text is located beneath it.

동일한 그룹으로 묶기 위해서는 먼저, 첫번째 블록 세그먼트와 이에 인접한 다음 블록 세그먼트 사각형을 비교하며 교차점이 있으면 새로운 세그먼트를 추가한다. 가로신문을 기준으로 해서 대제목 하단에 존재하는 블록은 다음 대제목, 광고분리선이 나오기 전까지 모두 하나의 기사로 그룹화한다. 그리고, 글박스의 흐름은 좌에서 우로 이어지면서 그룹화가 이루어지고 별도로 대제목 영역에 겹쳐지지 않는 영역에 제목박스가 없는 글박스는 같은 그룹으로 결합한다. 이때, 형성된 그룹이 사각형이 아닌 경우에는 해당 영역을 다각형으로 생성하여 만든다. To group them into the same group, first compares the first block segment with the adjacent block segment rectangle, and if there is an intersection, adds a new segment. Blocks existing at the bottom of the title based on the horizontal newspaper are grouped into a single article until the next title and the advertisement separator line. Then, the flow of the text box is grouped while continuing from the left to the right, and the text box without the title box in the area which is not overlapped with the title title area is combined into the same group. At this time, if the group formed is not a rectangle, the region is created as a polygon.

세로쓰기로 쓰여지는 세로신문의 경우 대부분이 다각형으로 이루어지며, 세로신문 역시 대제목을 기준으로 이루어지며, 단줄과 분리선으로 구분하며 글의 흐름은 위에서 아래로 우측에서 좌측으로 흘러가는 규칙을 가지고 있다. 세로신문은 위에서 아래로 일예로, 15단으로 구성되어 있으므로 글박스는 좌측에 다른 제목 블록 세그먼트, 분리선 세그먼트, 사진 블록 세그먼트를 만나면 아랫단의 우측에 있는 글블럭을 찾는다. Most of vertical newspapers written in vertical writing are polygons. Vertical newspapers are also based on titles. They are divided into single lines and dividing lines, and the flow of writing flows from top to bottom from right to left . Vertical newspaper consists of 15 steps from top to bottom. For example, if you meet another title block segment, segment segment, or photo block segment on the left side of the text box, the text box is located on the right side of the bottom.

한편, 본 발명의 일 실시예에서 상기 단위 세그먼트, 문자 세그먼트, 문자행 세그먼트, 블록 세그먼트 및 그룹 세그먼트는 좌표 정보를 가지며, 상기 그룹 세그먼트는 상기 좌표 정보에 의해 다각형(polygon)을 형성하는데, 보다 구체적으로 상기 그룹 세그먼트는 상기 인접한 블록 세그먼트의 외곽을 따라 이동하면서 중복되지 않는 경로에 의해 형성되는 폐쇄 영역으로 판단한다. 즉, 상기 동일한 기사 내용으로 형성되는 경우에는 하나의 그룹 세그먼트 내에 배치되어야 한다. 상기 그룹 세그먼트는 대개 직사각형으로 형성되나, 반드시 이에 한정되는 것은 아니고, 다양한 형태의 다각형으로 형성될 수도 있다.Meanwhile, in an embodiment of the present invention, the unit segment, the character segment, the character row segment, the block segment, and the group segment have coordinate information, and the group segment forms a polygon by the coordinate information. The group segment is determined as a closed region formed by a path that does not overlap while moving along the outer periphery of the adjacent block segment. That is, if the contents of the same article are formed, they should be placed in one group segment. The group segments are generally formed in a rectangular shape, but the shape is not limited thereto, and may be formed in various shapes of polygons.

도 9는 본 발명의 일 실시예에 따른 그룹 세그먼트를 폴리곤(polygon)처리하는 방식의 개념도이고, 도 10은 본 발명의 일 실시예에 따른 그룹 세그먼트를 폴리곤처리하는 방식의 순서도이다. FIG. 9 is a conceptual diagram of a method of processing a group segment according to an embodiment of the present invention, and FIG. 10 is a flowchart of a method of polygon processing of a group segment according to an embodiment of the present invention.

이하에서는 도 9 및 도 10을 참조하여 본 발명의 일 실시예에 따른 그룹 세그먼트를 폴리곤화하는 방법에 대하여 설명하기로 한다.Hereinafter, a method of polygoning a group segment according to an embodiment of the present invention will be described with reference to FIGS. 9 and 10. FIG.

먼저, 각 블록 세그먼트들로 집합을 구성하고, 첫번째 블록 세그먼트부터 다음 블록 세그먼트와 비교하여 같거나 첫번째 블록 세그먼트에 100% 포함되어 있으면 작은 세그먼트를 제거한다. 그리고, 가로, 세로 스냅크기를 기준으로 확장하여 새 블록 세그먼트들의 집합으로 대체한다. 여기서 스냅 크기란 기사의 가로 단간격과 세로 행간격을 의미한다. 다시 말해, 가로쓰기 신문의 경우 인접한 블록 세그먼트들 사이의 가로 방향으로의 이격 거리를 단간격이라 하는데, 제목 세그먼트의 아래에 위치하면서 동일한 그룹 세그먼트를 형성하는 블록 세그먼트들 사이의 수평방향으로의 이격 거리를 의미한다. 상기 단간격을 기준으로 블록 세그먼트의 영역을 확장한다. 이렇게 함으로써 인접해 있는 블록 세그먼트와 겹쳐지는 면적이 발생하게 되며 인접해 있는 블록 세그먼트와 동일한 그룹 세그먼트를 형성하게 된다. First, an aggregate is formed of each block segment, and if the first block segment is compared with the next block segment, or is included in the first block segment by 100%, the small segment is removed. Then, it is expanded based on the horizontal and vertical snap sizes and replaced with a set of new block segments. Here, the size of the snap means the distance between the traverse and the vertical line of the article. In other words, in the case of a horizontally-typed newspaper, the horizontal spacing distance between adjacent block segments is referred to as a short interval, and the horizontal spacing between block segments forming the same group segment located below the title segment . And extends the area of the block segment based on the step interval. By doing so, the area overlapping with the adjacent block segment is generated, and the same group segment as the neighboring block segment is formed.

이에 대해 도 9를 참조하여 보다 구체적으로 설명하기로 한다. This will be described in more detail with reference to FIG.

첫번째 블록 세그먼트(B1)부터 연속된 다음 세그먼트들(B2,B3)과 비교하여 교차되는 사각형을 추가한다. 상기 블록 세그먼트 사각형의 각 모서리의 정보를 생성하여 어레이(Array)를 구성한다. 상기 어레이에서 그룹 세그먼트에 포함된 점(x로 표기됨)은 삭제한다. 첫번째 위치(B11)부터 전체 포인트를 비교하여 다음 좌표(next) 정보를 찾아 추가한다. 추가된 포인트는 다음부터는 비교하지 않도록 한다. 만약, 남아 있는 포인트가 있다면 새 단독 폴리건으로 처리한다. The intersection of the first block segment B1 and the subsequent segments B2 and B3 is added. Information on each corner of the block segment rectangle is generated to construct an array. (Denoted by x) included in the group segment in the array. Compares all points from the first position (B11) and finds the next coordinate (next) information. Do not compare added points from now on. If there are any remaining points, treat them as new single polygons.

즉, 먼저 인접해 있으면서 적어도 일부의 면적이 겹쳐지는 블록 세그먼트들(B1,B2,B3)을 배치하고, 중복되는 면적의 위치정보(x표기된 지점)는 무시하고, 새롭게 형성되는 부분의 위치정보(⊙로 표기된 지점)를 새롭게 추가한다. 이후, 첫번째 블록 세그먼트(B1)부터 외곽을 따라 가장 큰 면적을 갖도록 하는 위치정보를 찾는다. 이는 도 9에서의 화살표 방향을 따라 이동함에 의해 구해지며 최외곽의 지점들을 연결하는 선분들에 의해 그룹 세그먼트의 영역이 구해진다. That is, the block segments B1, B2, and B3, which are adjacent to each other and overlap at least a part of the area, are disposed, the position information of the overlapping area (x marked points) is ignored, ⊙ point) is newly added. Then, from the first block segment (B1), positional information that has the largest area along the outline is found. This is obtained by moving along the arrow direction in FIG. 9, and the area of the group segment is obtained by the line segments connecting the outermost points.

도 10a 내지 도 10d를 참조하여 설명하면, 도 10a에서는 대제목, 부제목 및 본문별로 블록 세그먼트화된 상태를 도시한 것이고, 도 10b는 도 10a에서 인접한 블록 세그먼트끼리 보다 큰 블록 세그먼트를 형성한 상태를 도시한 것이며, 도 10c는 이미지 세그먼트를 제외한 텍스트 세그먼트를 그룹화한 상태를 도시한 것이며, 도 10d는 이미지 세그먼트 및 텍스트 세그먼트를 하나의 그룹 세그먼트화한 것을 도시한 것이다.10A to 10D, FIG. 10A shows a block segmented state by a title, a subhead, and a body. FIG. 10B shows a state where block segments larger than adjacent block segments are formed in FIG. 10A FIG. 10C shows a state in which text segments other than an image segment are grouped, and FIG. 10D shows a group segmentation of an image segment and a text segment.

보다 구체적으로 설명하면, 도 10a에서는 대제목 세그먼트, 부제목 세그먼트, 본문 세그먼트, 이미지 세그먼트들로 각각 구획되어 다수의 블록 세그먼트들이 배치되어 있는 상태이다. 이때는 대제목 세그먼트와 부제목 세그먼트간에 그룹화는 이루어지지 않은 상태이다. 이후, 대제목 세그먼트의 영역을 스냅 크기인 대제목 세그먼트의 높이의 배수만큼 확장시킴으로써 대제목 세그먼트와 부제목 세그먼트를 그룹화한다. 이때, 본문 세그먼트들끼리도 단간격의 배수(반드시 정수 배수일 필요는 없음)만큼 블록 세그먼트를 확장시킴으로써 본문 세그먼트들을 보다 큰 영역의 본문 세그먼트로 그룹화한다. 이러한 과정에 의해 도 10b에 도시된 바와 같이, 인접해 있는 블록 세그먼트들이 그룹화된다. 더 나아가, 도 10c에 도시된 바와 같이, 대제목 세그먼트의 하부에 위치해 있는 동일 기사를 형성하는 텍스트 세그먼트들을 그룹화한다. 이는 확장된 대제목 세그먼트와 본문 세그먼트의 확장된 영역의 겹침에 의해 이루어진다. 이와 같이, 텍스트 세그먼트를 그룹화한다. 다음에는 도 10d에 도시된 바와 같이, 이미지 세그먼트의 영역을 확장함으로써 텍스트 세그먼트와의 겹침에 의해 그룹 세그먼트를 완성하게 된다.More specifically, FIG. 10A shows a state in which a plurality of block segments are divided into a major title segment, a subheading segment, a main segment, and an image segment, respectively. In this case, grouping between the headline segment and the subheading segment is not performed. Then, the major title segment and the subheading segment are grouped by extending the area of the major title segment by a multiple of the height of the major title segment, which is the snap size. At this time, the body segments are grouped into body segments of a larger area by expanding the block segment by a multiple of a short interval (which is not necessarily an integral multiple). By this process, adjacent block segments are grouped as shown in FIG. 10B. Further, as shown in FIG. 10C, text segments forming the same article are grouped underneath the heading segment. This is accomplished by overlapping the extended heading segment with the extended region of the body segment. In this way, text segments are grouped. Next, as shown in FIG. 10D, the group segment is completed by overlapping with the text segment by extending the area of the image segment.

상기 이미지 세그먼트는 사진 세그먼트 및 광고 세그먼트를 포함할 수 있으며, 사진 세그먼트는 사진 영역을 의미하고, 광고 세그먼트는 광고 영역을 의미한다. 다만, 상기 이미지 세그먼트는 세그먼트화하지 않고 텍스트 변환 단계에서 이미지 파일로 변환(S150)할 수도 있다.The image segment may include a photo segment and an advertisement segment, the photo segment means a photo region, and the advertisement segment means an advertisement region. However, the image segment may be converted into an image file in the text conversion step (S150) without segmenting.

상기 세그먼트화 단계에서는 세그먼트들이 단순히 이미지로써 인식될 뿐 문자로 인식되는 상태는 아니다. 따라서, 텍스트 세그먼트(문자, 문자행, 블록, 그룹 세그먼트를 포함)는 이미지 세그먼트가 아닌 다른 세그먼트를 의미하는 것으로 이해될 수 있다. In the segmentation step, the segments are not simply recognized as images but recognized as characters. Thus, it can be understood that a text segment (including a character, a character row, a block, and a group segment) means a segment other than an image segment.

상기 텍스트 세그먼트를 텍스트화하는 과정이 필요한데, 본 발명의 일 실시예에서는 상기 스캔 이미지의 내용, 보다 구체적으로는 각종 텍스트 세그먼트들을 판독하여 상기 종이문서(신문)에 기록되어 있는 문자를 추출하여 텍스트 파일로 변환(S150)하는데, 상기 텍스트 파일은 일예로 txt 확장자를 갖는다.In the embodiment of the present invention, the content of the scanned image, more specifically, the various text segments, is read out to extract the characters recorded in the paper document (newspaper) (S150), and the text file has a txt extension as an example.

이때, 상기 텍스트 파일로 변환하는 단계는 문자인식 프로그램, 일예로 광학적 문자인식(Optical Character Recognition, OCR)프로그램에 의해 수행될 수 있으며, 그 종류는 한정되지 않는다. At this time, the conversion into the text file may be performed by a character recognition program, for example, an optical character recognition (OCR) program, and the type is not limited.

본 발명의 일 실시예에서는 문자인식 프로그램에 의해 스캔 이미지로부터 텍스트를 추출하는데, 고문서의 경우에는 텍스트를 추출하는 게 쉽지 않다. 특히, 한자(漢字)의 경우에 더욱 그러하다. 이를 위하여, 본 발명의 일 실시예에서는 상기 문자 영역을 기설정된 크기로 분할하여 세그먼트(segment)를 생성하고, 상기 생성된 세그먼트에 위치된 문자를 인식하여 텍스트로 저장하도록 할 수 있다. In an embodiment of the present invention, text is extracted from a scanned image by a character recognition program, and it is not easy to extract text in the case of an old document. This is especially true for the case of kanji. For this, in an embodiment of the present invention, the character region may be divided into a predetermined size to generate a segment, and characters recognized in the generated segment may be recognized and stored as text.

이때, 상기 생성된 세그먼트에 부가해야 할 정보를 지정하여 세그먼트를 수정하는 단계를 더 포함할 수 있다. 상기 세그먼트를 수정하는 단계에서는 유사한 문자에 대한 표본을 추출하여 대표문자꼴을 추출한 다음 대표문자꼴로 변환되도록 학습하여 반복적인 문자에 대하여 오류없이 인식되도록 할 수도 있다. 그리고, 상기 세그먼트를 수정하기 위하여 이미 학습된 오류문자에 대한 교정된 문자를 적용할 수도 있다.At this time, it may further include modifying the segment by specifying information to be added to the generated segment. In the step of modifying the segment, a sample of a similar character may be extracted, a representative character may be extracted, and then the character may be transformed into a representative character so that repeated characters can be recognized without errors. In order to correct the segment, a corrected character for the already learned error character may be applied.

이후, 상기 종이문서에서 추출된 문자를 추출된 문자의 맞춤법, 띄어쓰기, 문장부호 등의 텍스트를 교정(S160)하는 단계를 거칠 수 있다. 이때부터는 종이문서의 내용이 문자로 인식될 수 있다. Thereafter, the text extracted from the paper document may be subjected to a step of correcting text such as spelling, spacing, and punctuation of extracted characters (S160). From this point on, the contents of the paper document can be recognized as characters.

이때, 상기 교정 단계는 기설정된 사전(dictionary)에 수록된 단어와 상이한 오류단어 발견시에는 대체가능한 단어가 추천되는 단계와, 상기 추천된 단어를 선택하여 상기 오류단어를 추천된 단어로 대체하는 단계를 거친다. 즉, 기 설정된 사전(모듈)에 수록된 단어와 상이한 것을 찾아내어 오류가 있는 단어로 판단하게 되고, 이러한 오류단어를 적절한 단어로 교체하게 된다. In this case, the correcting step may include a step of recommending a replaceable word when an erroneous word different from a word included in a predetermined dictionary is found, and a step of replacing the erroneous word with a recommended word by selecting the recommended word It goes through. That is, a word that is different from a word included in a predetermined dictionary (module) is found to be an erroneous word, and the erroneous word is replaced with an appropriate word.

상기 오류단어 재출현시 자동으로 추천된 단어로 대체하게 되는데 이때, 반드시 오류단어를 추천되는 단어로 대체할 필요는 없으며, 만약 상기 오류단어를 대체할 후보 단어가 없는 경우 새로운 단어를 추가하고, 이후에 출현하는 상기 새로운 단어를 지속적으로 교정하도록 한다. 이는 기설정된 사전에 수록되어 있지는 않으나, 마땅한 단어가 없는 경우에 새로운 단어를 추가하고, 이후에 출현하는 경우에는 동일한 단어로 대체하도록 하기 위함이다. 이와 같이, 본 발명의 일 실시예에서는 동일한 오류로 확인된 문자(단어)에 대해서는 동일한 교정 과정을 거치도록 한다.It is not necessary to replace the error word with a recommended word. In this case, if there is no candidate word to replace the error word, a new word is added to the error word, So as to continuously correct the new word appearing in the new word. This is intended to add a new word in the absence of a predefined dictionary, but replace it with the same word if it appears later. As described above, in the embodiment of the present invention, the same correction process is applied to characters (words) identified as the same error.

본 발명의 일 실시예에서는 상기 오류단어 발견을 위해 단어의 오류 여부를 검사하는 검사모듈이 사용되는데, 상기 검사모듈은 다수의 단어가 저장되어 있는 사전모듈, 상기 사전모듈에 따라 단어의 오류를 체크하여 오류 여부를 판단하는 검사모듈, 형태소 등을 분석하여 단어의 원형을 복원하는 원형복원모듈, 한자를 한글로 변환하는 한자한글변환모듈, 단어의 형태소를 분석하는 형태소분석모듈, 단어의 품사 정보를 관리하는 품사정보관리모듈, 그리고 자주 틀리는 맞춤법, 외래어 표기, 입력 오류, 띄어쓰기 또는 붙여쓰기가 있는 단어나 어절들은 별도의 규칙으로 처리하는 규칙인코더모듈 중 하나 이상일 수 있다. According to an embodiment of the present invention, an inspection module for checking whether a word is erroneous is used for detecting the erroneous word. The inspection module includes a dictionary module in which a plurality of words are stored, A restoration module for restoring the original form of the word by analyzing the morpheme of the word, a Hanja conversion module for converting the Chinese character into Hangul, a morpheme analysis module for analyzing the word morpheme, Managed speech information management module, and a rule encoder module that processes words or phrases that are frequently mistaken for spelling, foreign language, input errors, spacing, or pasting as separate rules.

텍스트 파일 변환(S150)이 완료되면, 상기 텍스트 파일을 이용하여 확장형 마크업 언어(Extensible Markup Language, XML)로 변환(S170)하고, 상기 XML 파일을 저장하도록 한다. When the text file conversion (S150) is completed, the text file is converted into an extensible markup language (XML) (S170), and the XML file is stored.

상기 확장형 마크업 언어(eXtensible Markup Language, 이하 'XML'이라 함)는 인터넷상에서 구현되는 전자상거래, 다수의 웹 포털, 컨텐츠 서비스 및 정보처리 어플리케이션을 위한 언어이다. 이러한 XML은 월드와이드웹(WWW)을 통한 전송을 위해 문서를 구조화하는 범용 언어이다. 또한, 상기 확장형 마크업 언어(XML, eXtensible Markup Language)는 화면 출력이 목적이 아닌 내용을 기술하도록 설계된 것으로서, 사용자가 구조를 기술하는데 새로운 태그를 정의하도록 허용할 수 있다. XML은 데이터 구조와 내용, 출력형식이 각각 분리되어 전체 데이터를 정의하므로 데이터 구조의 재활용 및 출력형식의 유연성, 데이터 구조에 대한 검색 기능 등의 데이터 구조화에 따른 다양한 특성을 제공한다는 장점이 있다. XML에 있는 기본단위는 엘리먼트(Element)로, 사용자가 엘리먼트를 자유롭게 정의할 수 있으며, 엘리먼트 내부에 텍스트나 엘리먼트 혹은 텍스트와 엘리먼트가 혼합된 형태를 가질 수 있다. 이와 같은 XML은 언어 자체가 지니는 정보 표현의 유연성으로 인터넷상의 정보 표현 및 전자 데이터 교환을 위한 표준이 되었다. 상기 XML은 이미 알려진 언어이므로 이하에서는 구체적인 설명을 생략하기로 한다.The extensible markup language (XML) is a language for electronic commerce, a number of web portals, content services, and information processing applications implemented on the Internet. Such XML is a general-purpose language for structuring documents for transmission over the World Wide Web (WWW). In addition, the extensible markup language (XML) is designed to describe content that is not intended for screen output, and allows the user to define a new tag to describe the structure. XML has the advantage of providing various characteristics according to data structure such as data structure reuse, flexibility of output format, search function of data structure, and so on, since data structure, contents, and output format are defined as whole data. The basic unit in XML is an Element, which allows the user to freely define the element, and can have a text or element or a mixture of text and elements within the element. Such XML has become a standard for information representation and electronic data exchange on the Internet due to the flexibility of the information expression of the language itself. Since the XML is a known language, a detailed description will be omitted below.

본 발명의 일 실시예에서의 상기 종이문서는 정기간행물일 수 있으며, 이하에서는 종이문서가 정기간행물 중 특히 종이신문인 경우를 중심으로 설명하기로 한다. 다만, 본 발명의 권리범위가 이에 한정되는 것은 아니며, 종이문서가 단행본인 경우도 포함된다.In the embodiment of the present invention, the paper document may be a periodical publication. Hereinafter, the paper document will be mainly described as a periodical, especially a paper. However, the scope of rights of the present invention is not limited thereto, and the case where a paper document is a monograph is also included.

한편, 본 발명의 일 실시예에서의 상기 텍스트 파일 및 이미지 파일(세그먼트)은 상기 종이신문이 발행된 일자, 판, 면을 포함하는 파일명으로 저장한다. 이와 같이, 발행일자와 관련된 파일명으로 저장함으로써 파일명 자체가 파일의 내용을 포함하고 있으므로 활용 및 관리가 보다 용이하다. 보다 구체적으로는, 파일명을 추출하여 손쉽게 스캔 이미지의 누락 또는 중복 여부를 판단할 수 있으며, 특정 날짜에 발행된 면수 등을 용이하게 확인할 수 있다. On the other hand, in the embodiment of the present invention, the text file and the image file (segment) store the file name including the date, edition, and face on which the paper newspaper is issued. As described above, since the file name itself includes the contents of the file by storing the file name in association with the issue date, utilization and management are easier. More specifically, it is possible to easily determine whether a scan image is missing or duplicated by extracting a file name, and can easily confirm the number of faces issued on a specific date.

또한, 본 발명의 일 실시예에서는 상위폴더(또는 디렉토리)에서부터 하위폴더(또는 디렉토리)로 향하는 경로를 이용하여 파일명으로 저장할 수 있다. 즉, 특정 폴더 내에 있는 파일들의 파일명은 그 폴더의 경로에 있는 폴더명들을 포함하여 명명될 수 있다. 이렇게 경로를 따라 파일명을 명명함으로써 특정 파일의 위치를 쉽게 찾을 수 있으며, 파일이 정위치에 저장되어 있는지 여부도 쉽게 확인할 수 있는 장점이 있다.In addition, in an embodiment of the present invention, a file name can be stored using a path from an upper folder (or directory) to a lower folder (or directory). That is, the file names of the files in the specific folder can be named including the folder names in the path of the folder. By naming the file along the path, it is easy to find the location of a specific file, and it is easy to check whether the file is stored in the correct location.

상술한 바와 같이 종이신문의 텍스트 파일을 발행일자, 판, 면수와 관련된 파일명으로 저장함으로써 종이문서와 대응되는 텍스트 파일의 누락 여부를 쉽게 확인할 수 있게 된다. 즉, 상기 텍스트 파일과 이미 보유하고 있는 종이문서의 정보를 대조함으로써 누락된 지면이 있는지 또는 전면광고 지면인지를 일괄적으로 검사할 수 있으며, 상기 이미 보유하고 있는 종이문서의 정보는 일예로 상기 스캔 이미지일 수 있다. 상기 스캔 이미지 뿐만 아니라, 상기 스캔 이미지를 pdf 파일로 변환한 경우에는 pdf 파일이 될 수도 있다.As described above, by storing the text file of the paper newspaper as the file name associated with the issue date, edition, and number of pages, it is possible to easily check whether or not the text file corresponding to the paper document is missing. In other words, it is possible to collectively check whether there is a missing page or an entire advertisement page by collating the information of the paper file with the text file, and the information of the already held paper document is, for example, Image. In addition to the scan image, a pdf file may be used when the scan image is converted into a pdf file.

이하에서는 기사를 입력하는 단계에 대하여 보다 구체적으로 설명하기로 한다.Hereinafter, the step of inputting an article will be described in more detail.

먼저, 신문기사 내용을 면 단위로 파일을 만들어 저장한다. 즉, 판이 있는 경우에는 파일명을 신문사(2)_발행일자(8)_판(1)_면(2).txt와 같이 저장하고, 판이 1개인 경우에는 판 부분을 생략하여 신문사(2)_발행일자(8)_면(2).txt와 같은 형식으로 저장한다. 상기 괄호 안의 숫자는 문자의 개수를 의미한다. 일예로, 상기 텍스트 파일명은 GJ_19890102_01.txt, GJ_19890102_2_01.txt와 같이 저장할 수 있다. 상기 괄호 안의 숫자는 일예에 불과하다.First, the contents of newspaper articles are created and stored in units of pages. In other words, if there is a plate, the name of the file is saved as newspaper (2), publication date (8) __ plate (1) __ (2) .txt, Save it in the same format as the date of issue (8) __ (2) .txt. The number in parentheses means the number of characters. For example, the text file name may be stored as GJ_19890102_01.txt, GJ_19890102_2_01.txt. The numbers in parentheses are only examples.

상기 텍스트 파일은 면 단위로 파일을 만들도록 한다. 상기 텍스트 파일은 제목, 기사분류, 본문 및 바이라인(byline)을 포함하고, 상기 제목, 기사분류, 본문 및 바이라인은 각각의 구분자에 의해 구분된다. 일예로, 모든 기사는 @@@제목, ###기사분류, $$$본문, %%%기자명으로 입력할 수 있는데, @@@, ###, $$$, %%%은 각각 제목, 기사분류, 본문, 기자명(또는 바이라인)에 대한 구분자이다. 기자가 아니어도 끝에 ~병원장, ~교수 등과 같은 경우에도 %%% 로 표시한다. 이때, 상기 바이라인(byline)이란 기자, 투고자 등 글을 쓴 사람을 말한다. The text file makes a file on a surface basis. The text file includes a title, an article classification, a body text, and a byline, and the title, the article classification, the body text, and the byline are separated by respective delimiters. For example, all articles can be typed as @@@ title, ### article classification, $$$ body, %%% @ @@, ###, $$$, %%% Title, article classification, text, journal name (or byline). Even if you are not a journalist, you should mark it as %%% even at the end, such as a hospital or a professor. At this time, the byline refers to a person who writes articles such as journalists and contributors.

이와 같이 작성된 텍스트 파일에 의해 내용을 쉽게 분류할 수 있다. 즉, 각각의 정해진 구분자 이후에는 기설정된 내용이 입력되므로, 통계 분석에 용이하다. 예를 들면, 본문에 반복하여 등장하는 단어에 의해 기사분류를 정할 수도 있고, 특정 기간에 종이신문에 자주 등장한 단어를 분석하여 그 시대의 이슈(issue)가 무엇인지를 알 수도 있다.The contents can be easily classified by the text file thus created. That is, after predetermined delimiters, predetermined contents are inputted, so that it is easy to perform statistical analysis. For example, the article classification may be determined by words appearing repeatedly in the text, or the words frequently appearing in paper newspapers during a specific period may be analyzed to know what issues are in the period.

또한, 본 발명의 일 실시예에서는 분류정보를 사전에 구축할 수 있는데, 예를 들면 상기 기사분류는 문자 필터링 후 출현 빈도수에 따라 정해질 수 있다. 또한, 사용자에 따라 유사한 분류정보를 사전에 구축할 수도 있어 사용자에 따라 기사분류의 오류를 줄일 수 있다. 이때의 사용자는 기자명이나 이메일에 의해 특정될 수 있다. 즉, 입력된 기사에서 바이라인 정보를 추출한 다음, 바이라인 정보를 분석함으로써 기자명, 이메일 또는 독자투고자에 따른 기사를 분류함으로써 이후에 반복되는 기사분류시에 정보를 제공할 수 있게 된다. In addition, in the embodiment of the present invention, classification information may be constructed in advance, for example, the article classification may be determined according to the appearance frequency after character filtering. In addition, similar classification information may be constructed in advance according to the user, so that the error of the article classification can be reduced according to the user. At this time, the user can be specified by the name of journalist or e-mail. That is, by extracting the byline information from the input article and analyzing the byline information, the article name, e-mail, or article according to the individual contributor can be classified so that information can be provided at the time of subsequent article classification.

이때, 다수의 텍스트 파일을 포함하는 빅데이터 기사에서 바이라인 정보를 추출하고, 바이라인 정보를 분석하여 기자명, 이메일, 독자투고자를 저장한 다음, 이후에 발생되는 오류 문자 발견시에는 상기 저장된 바이라인으로 수정할 수 있도록 한다. 이는 신문과 같은 정기간행물에는 특정 기자의 이름이 자주 등장하므로, 특정 기자의 이름이 잘못된 경우라면 정정할 수 있게 된다.At this time, the bi-line information is extracted from the big data article including a plurality of text files, the bi-line information is analyzed and the journal name, e-mail, and the individual contributor are stored. Then, Line. This is because the name of a specific reporter often appears in periodicals such as newspapers, so that if a specific reporter's name is wrong, it can be corrected.

그리고, 본 발명의 일 실시예에서는 종이신문의 경우 상기 텍스트 파일, 제목, 기사분류, 본문 및 바이라인을 이용하여 NewsML로 변환(S170)할 수 있다. 예를 들면, 텍스트, 작업용 XML 등을 기본 환경정보와 결합하여 NewsML로 변환할 수 있는데, 제목, 분류, 바이라인 등의 항목에서 태그를 추출한 다음 테이블을 작성함으로써 NewsML로 변환하게 된다. 보다 구체적으로, 제목(title) 및/또는 분류(class) 추출하여 검사를 수행하고, 바이라인 검사 목록을 생성한다. 생성된 목록에 오류가 있는지 확인하고, 오류가 없으면 테이블 작성 단계로 넘어간다. 테이블 작성 단계에서는 내용검수에서 추출된 오류 기자명이나 이메일을 보정할 수 있는 테이블을 만들어서 이후에 발생되는 오류를 자동으로 수정되도록 한다. In an embodiment of the present invention, in the case of a paper newspaper, the text file, the title, the article classification, the body text, and the bi-line can be converted into NewsML (S170). For example, text, work XML, etc. can be combined with basic environment information and converted into NewsML. It is converted into NewsML by extracting tags from items such as title, classification, and byline and then creating a table. More specifically, a title and / or a class are extracted to perform an inspection, and a bi-line check list is generated. Check the generated list for errors, and if there are no errors, go to the table creation step. In the table creation step, a table can be created to correct the error journal name or e-mail extracted from the contents check, and the errors occurring afterwards are automatically corrected.

상기 NewsML은 뉴스 생성 언어(News Markup Language)로 뉴스를 정의하기 위한 확장형 마크업 언어(XML) 기반의 표준 형식을 말하며, 모든 형태의 멀티미디어 뉴스를 생산에서 저장, 매달에 이르기까지의 라이프사이클을 관리하기 위한 XML 기반의 표준이다.The NewsML is a standard format based on an Extensible Markup Language (XML) for defining news in News Markup Language, and manages all types of multimedia news from production to storage, It is an XML-based standard for

이후에는 각종 자료 목록을 추출하기 위하여 NewsML과 상기 스캔 이미지의 PDF 파일이 최종 완료되면 각종 검사 및 목록을 생성한다. 보다 구체적으로, NewsML 기사 입력 리스트, PDF 파일 리스트, NewsML+PDF 교차 목록, NewsML 전체 목록, 바이라인 추출, 바이라인 없는 기사 추출, NewsML 기본판, PDF 기본판, 월간(PDF+텍스트+문자수)를 생성한다. 상기 PDF는 종이신문을 PDF로 변환한 파일을 의미한다. 이와 같이, 각종 자료 목록에 의해 통계 분석이 가능하며, 빅데이터의 활용이 가능해진다. After the NewsML and the PDF file of the scanned image are finally completed to extract various data lists, various tests and lists are generated. More specifically, NewsML article entry list, PDF file list, NewsML + PDF cross list, NewsML full list, byline extraction, without line line extraction, NewsML basic edition, PDF basic edition, monthly (PDF + text + . The PDF refers to a file converted from paper newspaper to PDF. In this way, statistical analysis is possible by various data lists, and big data can be utilized.

한편, 본 발명의 일 실시예에서는 웹에서 서비스를 하기 위해 지면 맵핑 정보를 생성할 수 있는데, 이를 위하여 본 발명의 일 실시예에서는 상기 문자 및 이미지의 영역을 인식하여 문자 및 이미지의 영역을 좌표 정보로 저장한다. 즉, 지면에서 텍스트 영역(세그먼트)의 좌표 정보를 생성하고, 좌표에 따라 크롭(crop)된 기사 이미지 또는 PDF를 생성하여 웹상에서 종이신문을 보는 것과 같은 효과를 얻을 수 있다. 이때, 본 발명의 일 실시예에서 크롭(crop)은 종이문서를 화상으로 추출하는 것을 의미한다.Meanwhile, in one embodiment of the present invention, the ground mapping information can be generated to provide services on the web. For this purpose, in an embodiment of the present invention, the region of characters and images is recognized, . That is, coordinate information of a text area (segment) is generated on the ground, and an article image or a PDF cropped according to the coordinates is generated to obtain the same effect as viewing a paper newspaper on the web. At this time, in one embodiment of the present invention, crop refers to extracting a paper document as an image.

도 11은 본 발명과 관련된 종이신문의 크롭(crop)을 설명하기 위한 도면인데, 도 11을 참조하면, 먼저 상기 스캔 이미지를 PDF 파일로 변환하고, 상기 PDF 이미지 상에서 상기 그룹 세그먼트의 좌표를 생성한 다음, 상기 NewsML 내의 파일 중 상기 생성된 그룹 세그먼트의 좌표에 해당하는 텍스트 파일을 선택하여 등록한다. 11 is a view for explaining cropping of a paper newspaper related to the present invention. Referring to FIG. 11, first, the scan image is converted into a PDF file, and coordinates of the group segment are generated on the PDF image Next, a text file corresponding to the coordinates of the generated group segment among the files in the NewsML is selected and registered.

보다 구체적으로, 디지털상에서의 좌표와 NewsML 기사를 연결하고, 크롭하고자 하는 PDF, JPG 등의 이미지에서 좌표정보를 생성한다. 그리고, 도 11의 오른쪽에 위치한 기사 리스트에서 해당 기사를 선택한 다음 좌측 화면에서 해당 블록 세그먼트를 마우스로 선택한다. 이와 같은 방식으로 매핑이 선택되면 리스트에 매핑 기사번호가 등록되고, 모든 매핑이 끝나면 면 완료를 한다.More specifically, coordinate information is generated from an image such as a PDF, JPG, or the like, which is to be linked with a NewsML article on a digital basis and cropped. Then, the corresponding article is selected from the article list located on the right side of FIG. 11, and the corresponding block segment is selected with the mouse on the left screen. When the mapping is selected in this manner, the mapping article number is registered in the list, and when all the mapping is completed, the surface is completed.

이때, 블록처리되는 디지털면은 이미 블록 세그먼트화되어 있기 때문에 선택만 하면 된다. 즉 앞서 블록 세그먼트 및 그룹 세그먼트화되었으므로, 디지털면에서 특정 기사를 선택할 경우에 종이신문의 대응되는 영역이 그대로 표시된다.At this time, since the digital surface to be subjected to block processing has already been segmented into blocks, selection is sufficient. That is, since the block segment and the group segmentation are made earlier, the corresponding area of the paper newspaper is displayed as it is when the specific article is selected in the digital aspect.

한편, 신문사에서 pdf 파일을 가지고 있는 경우가 있는데, 이러한 경우에 텍스트 복사(text copy) 과정에서 개행처리 되는 경우가 있다. 이러한 경우에는 종이신문의 기사 영역을 pdf 파일에서 텍스트 복사를 하는 경우, 클립보드에서 붙여넣기 전 개행을 추출한 다음 클립보드 내용을 대체하여 입력기에 대체된 문서로 붙여넣도록 한다. 일예로, “~다.”을 기준으로 개행 정리를 할 수 있다. 이와 같이, 자동으로 개행처리된 문장을 수정함으로써 독자의 가독성을 향상시킬 수 있게 되었다.On the other hand, in some cases, a newspaper company has a pdf file. In such a case, a new copy may be processed in a text copy process. In this case, if you are copying text from a pdf file into the article area of a paper newspaper, extract the newline before pasting from the clipboard, and then replace the contents of the clipboard with the replacement document. For example, a new line can be arranged based on "~~". In this manner, the readability of the reader can be improved by correcting the sentence automatically processed by the new line.

상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The foregoing detailed description should not be construed in all aspects as limiting and should be considered illustrative. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the scope of equivalents of the present invention are included in the scope of the present invention.

Claims

Obtaining a paper document to be digitized;
Scanning the acquired paper document to obtain a scanned image of the paper document;
The scan image is divided into a text area and an image area, and the text area is divided into a unit segment, which is a minimum unit segment connected to each other by using pixels, a character segment formed by a combination of at least one unit segment, A character row segment formed by a combination of one or more of the character segments, a block segment formed by a combination of one or more character row segments, and group segments formed by combinations of one or more of the block segments are sequentially A segmentation step of forming a segment;
Converting the contents of the paper document into a text file by extracting characters from the segmented scanned image, and converting the image area into an image file; And
And generating an Extensible Markup Language (XML) using the text file,
Wherein segmenting into the character line segments comprises:
Obtaining a reference width (w ₀ ) and a reference height (h ₀ ) of the character segment according to a size of the plurality of character segments;
Aligning the plurality of character segments in a first direction;
Constructing a character row segment by selecting character segments within a predetermined size range in consideration of the width and height of the character segments considering the reference width w ₀ and the reference height h ₀ ; And
Wherein the spacing between adjacent character segments in the sequence is greater than or equal to a predetermined interval along the first direction or the width or height does not overlap along a second direction in which the character segment is perpendicular to the character line segment and the first direction And a new character line segment if the character string is a new character line segment.

The method according to claim 1,
Wherein segmenting into unit segments comprises:
Obtaining a pixel block including neighboring pixels located in four neighborhoods of upper, lower, left, and right with respect to a certain pixel within a predetermined rectangular area;
Determining a valid pixel mass when the pixel mass of the pixel mass has a predetermined number or more of pixels and obtaining a plurality of the effective pixel masses to obtain a set of pixel mass; And
And determining the set of pixel clusters as a unit segment if the length of the set of pixel clusters is greater than or equal to a predetermined length.

3. The method of claim 2,
Wherein segmenting into the character segments comprises:
Selecting a first unit segment and a second unit segment adjacent to the first unit segment in the set of unit segments including the unit segments;
The area of the first unit segment is S ₁ , the area of the second unit segment is S ₂ , the overlapping area of the first unit segment and the second unit segment is S _I , (1) to (4) when an area of a state in which two unit segments are combined is S _u , the step of determining the same character segment.
S _I ≥k ₁ S ₁ ------------------- (1)
_{_{_{S I ≥k 1 S 2 ------------------- (}}} 2)
S _1? K ₂ S _u ------------------- (3)
S _2? K ₂ S _u ------------------- (4)
In the above equation, k ₁ and k ₂ are preset constants determined according to the typeface.

delete

The method according to claim 1,
Wherein segmenting into the block segments comprises:
Storing character segments in different formats according to the width and height of the character segments;
Repeating combining character string segments having the same length within a predetermined range while the width and height of the character segments are within a predetermined range;
The step of combining the character line segments and the character line segments spaced apart from the neighboring character line segments by more than a predetermined interval along the second direction form another block segment; And
And processing the character line segments having different widths and heights of the character segments into different blocks.

6. The method of claim 5,
Wherein segmenting into the group segments comprises:
A block segment located at a lower portion of a first block segment including a first character segment having a largest width and a height of a character segment among adjacent block segments and at least partially overlapping the first block segment along a second direction, Forming into the same group; And
A second block segment or dividing line spaced apart from the block segment formed in the same group by a predetermined interval or along the first direction or the second direction or including the first character segment among the block segments formed in the same group, And grouping the intersecting block segments into different groups.

The method according to claim 6,
Grouping the image segments located between the grouped block segments into the same group. &Lt; Desc / Clms Page number 20 >

8. The method of claim 7,
After the segmenting step,
Further comprising the step of correcting the scanned image by rotating the scanned image to the correct position when the scanned image is scanned in a skewed manner.

9. The method of claim 8,
Wherein the modifying the scan image comprises:
Determining one of the unit segments as a segment if the width or the height is constant and either one of the width and height is longer than a predetermined length;
Determining whether the line segment is formed adjacent to a first axis of the scan image and a second axis perpendicular to the first axis;
Calculating an angle between the line segment and an axis adjacent to the line segment on the scan image of the first axis and the second axis; And
And rotating the scan image by the calculated angle to match the line segment and the adjacent axis.

8. The method of claim 7,
If the width and the height of the unit segment are greater than a predetermined size, excluding the character string segment from the image segment.

11. The method of claim 10,
Wherein the paper document is a paper newspaper.

12. The method of claim 11,
Wherein the unit segment, the character segment, the character row segment, the block segment, and the group segment have coordinate information, and the group segment forms a polygon according to the coordinate information.

13. The method of claim 12,
Wherein the group segment is formed as a closed area formed by a path that does not overlap while moving along the outermost part of the adjacent block segment.

14. The method of claim 13,
Wherein the text file includes a title, an article classification, a body text, and a byline, and the step of generating an extensible markup language (XML) using the text file includes: The byline is separated by each delimiter,
Generating News News Markup Language (NewsML) by combining the extended markup language and basic environment information. &Lt; Desc / Clms Page number 19 >

15. The method of claim 14,
Converting the scan image into a PDF file;
Generating coordinates of the group segment on the PDF image; And
And selecting and registering a text file corresponding to the coordinates of the generated group segment among the files in the NewsML.

The method according to claim 1,
Wherein the conversion into the text file is performed by an optical character recognition (OCR) program.