KR20220168787A

KR20220168787A - Method to extract units of Manchu characters and system

Info

Publication number: KR20220168787A
Application number: KR1020210078719A
Authority: KR
Inventors: 이충호
Original assignee: 한밭대학교 산학협력단
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2022-12-26

Abstract

The present invention relates to a method of extracting Manchu characters. According to one embodiment of the present invention, the method of extracting Manchu characters comprises: a binarization step of digitizing document data including at least one character string; a first projection step of deriving the number of pixels by projecting the binarized document data in a first direction; a character string extraction step of deriving a first point where the number of pixels derived in the first projection step corresponds to a first standard, and extracting a character string by dividing the number of pixels based on the first point; a second projection step of deriving the number of pixels by projecting the character string extracted in the character string extraction step in a second direction; a word extraction step of deriving a second point where the number of pixels derived in the second projection step corresponds to a second standard, and extracting words by dividing the number of pixels based on the second point; a third projection step of deriving the number of pixels by projecting the words extracted in the word extraction step in a second direction; and a character extraction step of deriving a third point where the number of pixels derived in the third projection step corresponds to the third standard, and extracting characters by dividing the number of pixels based on the third reference point. According to the present invention, the method is efficient in extracting Manchu characters and has a relatively small number of extraction errors.

Description

Manchu character extraction method and system for performing the same {Method to extract units of Manchu characters and system}

본 발명은 만주어의 글자 추출 방법에 관한 것이다. The present invention relates to a method for extracting Manchu characters.

최근 문서를 스캔 또는 촬영하여 생성한 화상 데이터로부터 문자를 인식하거나, 카메라를 통해 촬영하면서 실시간으로 문자를 인식하는 기술이 활용되고 있다. 선행기술문헌인 한국 공개특허 제10-2017-0032347호는 기설정된 폭 값을 기준으로 문자 경계를 인식함으로써 문자를 인식하는 문자 인식 기술을 개시한다. Recently, a technology for recognizing text from image data generated by scanning or photographing a document or recognizing text in real time while photographing through a camera has been utilized. Korean Patent Publication No. 10-2017-0032347, which is a prior art document, discloses a character recognition technology for recognizing a character by recognizing a character boundary based on a predetermined width value.

이러한 종래의 문자 인식 기술은 각 글자가 일정한 간격으로 씌여지거나 띄어쓰기가 되어 있는 경우에 적용이 가능하다. 상기 선행기술문헌의 경우에도 문서 데이터를 이치화한 후 화소값이 0인 지점을 경계로 삼고 있다. This conventional character recognition technology can be applied when each character is written at regular intervals or spaced. Even in the case of the prior art literature, the point where the pixel value is 0 after binarizing the document data is taken as a boundary.

만주어는 세로로 씌여지며 한 단어 안에서는 띄어쓰기 없이 이어져 있기 때문에 문자를 인식하기 전에 글자영역 분리와 글자를 이루는 단위를 분리해 내는 전처리과정이 필요하다. 상기 선행기술문헌 등의 종래의 기술로는 띄어쓰기 없이 이어져 쓰여진 만주어 문서에서 각 글자를 추출하는 것이 불가능하다.Since Manchu is written vertically and is connected without spaces within a word, a pre-processing process is required to separate the character area and separate the unit constituting the character before recognizing the character. With conventional technologies such as the above prior art documents, it is impossible to extract each character from a Manchu language document written consecutively without spaces.

한국 공개특허 제10-2017-0032347호Korean Patent Publication No. 10-2017-0032347

본 발명은 만주어로 쓰여진 문서 데이터로부터 각 글자 단위를 추출할 수 있는 만주어의 글자 추출 방법을 제공함을 목적으로 한다. An object of the present invention is to provide a method for extracting Manchu characters that can extract each character unit from document data written in Manchu.

또한, 만주어 글자 출추함에 있어서 효율적이고 오류가 적다. In addition, it is efficient and error-free in extracting Manchu characters.

본 발명의 실시 예를 따르는 만주어의 글자 추출 방법은, 적어도 하나의 문자열을 포함하는 문서 데이터를 이치화하는 이치화 단계; 상기 이치화된 문서 데이터를 제1방향으로 투영하여 화소수를 도출하는 제1투영단계; 상기 제1투영단계에서 도출된 화소수가 제1기준에 해당하는 제1지점을 도출하고, 상기 제1지점을 기준으로 구분하여 문자열을 추출하는 문자열추출단계; 상기 문자열추출단계에서 추출된 문자열을 제2방향으로 투영하여 화소수를 도출하는 제2투영단계; 상기 제2투영단계에서 도출된 화소수가 제2기준에 해당하는 제2지점을 도출하고, 상기 제2지지점을 기준으로 구분하여 단어를 추출하는 단어추출단계; 상기 단어추출단계에서 추출된 단어를 제2방향으로 투영하여 화소수를 도출하는 제3투영단계; 및 상기 제3투영단계에서 도출된 화소수가 제3기준에 해당하는 제3지점을 도출하고, 상기 제3지지점을 기준으로 구분하여 글자를 추출하는 글자추출단계;를 포함한다. A method for extracting Manchurian characters according to an embodiment of the present invention includes a binarization step of binarizing document data including at least one character string; a first projection step of deriving the number of pixels by projecting the binarized document data in a first direction; a character string extraction step of deriving a first point corresponding to a first criterion by the number of pixels derived in the first projection step and extracting a character string by classifying the first point as a standard; a second projection step of deriving the number of pixels by projecting the character string extracted in the character string extraction step in a second direction; a word extraction step of deriving a second point corresponding to a second criterion by the number of pixels derived in the second projection step, and extracting a word by classifying the second support point as a criterion; a third projection step of deriving the number of pixels by projecting the words extracted in the word extraction step in a second direction; and a letter extraction step of deriving a third point corresponding to a third criterion by the number of pixels derived in the third projection step and extracting letters by classifying them based on the third supporting point.

상기 제1지점은 화소수가 0인 지점일 수 있다. The first point may be a point where the number of pixels is 0.

상기 제2지점은 화소수가 0인 지점일 수 있다.The second point may be a point where the number of pixels is 0.

상기 제3지점은 화소수가 0을 초과하고, 특정 값 이하인 지점일 수 있다.The third point may be a point where the number of pixels exceeds 0 and is less than or equal to a specific value.

상기 글자추출단계는, 상기 만주어의 중심축 영역을 식별하는 단계; 및 상기 만주어의 중심축 영역에 화소수가 0인 경우에는 상기 제3지점에서 제외하는 단계;를 포함할 수 있다.The character extraction step may include identifying a central axis region of the Manchu language; and excluding from the third point when the number of pixels in the central axis region of the Manchu language is zero.

본 발명의 실시 예를 따르는 만주어의 글자 추출 시스템은, 문서 데이터 입력부 및 연산부를 포함한다. A Manchurian character extraction system according to an embodiment of the present invention includes a document data input unit and a calculation unit.

상기 문서 데이터 입력부 적어도 하나의 문자열을 포함하는 문서 데이터를 입력 받는 단계를 수행하고, 상기 제어부는 이치화 단계, 제1투영단계, 문자열추출단계, 제2투영단계, 단어추출단계, 제3투영단계 및 글자추출단계를 수행한다. The document data input unit performs a step of receiving document data including at least one string, and the control unit performs a binarization step, a first projection step, a string extraction step, a second projection step, a word extraction step, a third projection step, and Perform the character extraction step.

본 발명의 실시 예를 따르는 만주어의 글자 추출 방법은, 만주어로 쓰여진 문서 데이터로부터 각 글자 단위를 추출할 수 있다. The Manchu character extraction method according to an embodiment of the present invention may extract each character unit from document data written in Manchu.

도 1은 만주어 문서 데이터를 도시한 것이다.
도 2는 도 1의 만주어 문서 데이터를 이치화한 후 X축 방향으로 투영하였을 때의 화소값을 나타낸 것이다.
도 3은 도 1에서 추출된 문자열을 도시한 것이다.
도 4는 도 3에서 추출된 단어를 도시한 것이다.
도 5는 도 4에서 추출된 글자를 도시한 것이다.
도 6은 본 발명의 실시 예를 따르는 만주어의 글자 추출 방법을 도시한 것이다.
도 7은 본 발명의 실시 예를 따르는 만주어의 글자 추출 시스템을 도시한 것이다. Figure 1 shows Manchu language document data.
FIG. 2 shows pixel values when the Manchurian document data of FIG. 1 is binarized and then projected in the X-axis direction.
Figure 3 shows the character string extracted in Figure 1.
FIG. 4 shows words extracted in FIG. 3 .
FIG. 5 shows the letters extracted in FIG. 4 .
6 illustrates a method for extracting Manchu characters according to an embodiment of the present invention.
7 illustrates a Manchu character extraction system according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시 형태들을 다음과 같이 설명한다. 그러나, 본 발명의 실시 형태는 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명하는 실시 형태로 한정되는 것은 아니다.　 또한, 본 발명의 실시 형태는 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다.　 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있으며, 도면 상의 동일한 부호로 표시되는 요소는 동일한 요소이다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다. 덧붙여, 명세서 전체에서 어떤 구성요소를 "포함"한다는 것은 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings. However, the embodiments of the present invention may be modified in various forms, and the scope of the present invention is not limited to the embodiments described below. In addition, the embodiments of the present invention are provided to more completely explain the present invention to those skilled in the art. Therefore, the shape and size of elements in the drawings may be exaggerated for clearer description, and elements indicated by the same reference numerals in the drawings are the same elements. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and actions. In addition, "include" a component in the entire specification means that other components may be further included without excluding other components unless otherwise stated.

도 6은 본 발명의 실시 예를 따르는 만주어의 글자 추출 방법을 도시한 것이다. 6 illustrates a method for extracting Manchu characters according to an embodiment of the present invention.

도 6을 참조하면, 본 발명의 실시 예를 따르는 만주어의 글자 추출 방법은, 적어도 하나의 문자열을 포함하는 문서 데이터를 이치화하는 이치화 단계; 상기 이치화된 문서 데이터를 제1방향으로 투영하여 화소수를 도출하는 제1투영단계; 상기 제1투영단계에서 도출된 화소수가 제1기준에 해당하는 제1지점을 도출하고, 상기 제1지점을 기준으로 구분하여 문자열을 추출하는 문자열추출단계; 상기 문자열추출단계에서 추출된 문자열을 제2방향으로 투영하여 화소수를 도출하는 제2투영단계; 상기 제2투영단계에서 도출된 화소수가 제2기준에 해당하는 제2지점을 도출하고, 상기 제2지지점을 기준으로 구분하여 단어를 추출하는 단어추출단계; 상기 단어추출단계에서 추출된 단어를 제2방향으로 투영하여 화소수를 도출하는 제3투영단계; 및 상기 제3투영단계에서 도출된 화소수가 제3기준에 해당하는 제3지점을 도출하고, 상기 제3지지점을 기준으로 구분하여 글자를 추출하는 글자추출단계;를 포함한다. Referring to FIG. 6 , a method for extracting Manchurian characters according to an embodiment of the present invention includes a binarization step of binarizing document data including at least one character string; a first projection step of deriving the number of pixels by projecting the binarized document data in a first direction; a character string extraction step of deriving a first point corresponding to a first criterion by the number of pixels derived in the first projection step and extracting a character string by classifying the first point as a standard; a second projection step of deriving the number of pixels by projecting the character string extracted in the character string extraction step in a second direction; a word extraction step of deriving a second point corresponding to a second criterion by the number of pixels derived in the second projection step, and extracting a word by classifying the second support point as a criterion; a third projection step of deriving the number of pixels by projecting the words extracted in the word extraction step in a second direction; and a letter extraction step of deriving a third point corresponding to a third criterion by the number of pixels derived in the third projection step and extracting letters by classifying them based on the third supporting point.

본 발명의 실시 예를 따르는 방법들은 만주어의 글자 추출 시스템을 통해 수행될 수 있다. 도 7은 본 발명의 실시 예를 따르는 만주어의 글자 추출 시스템을 도시한 것으로, 도 7을 참조하면, 상기 시스템은 문서 데이터 입력부 및 연산부를 포함한다. 상기 시스템은 통상적으로 정보의 저장 및 연산을 수행하기 위해 사용하는 컴퓨터, 노트북 컴퓨터, 휴대용 단말기, 서버 등일 수 있다. Methods according to embodiments of the present invention may be performed through a Manchu character extraction system. 7 shows a Manchu character extraction system according to an embodiment of the present invention. Referring to FIG. 7, the system includes a document data input unit and a calculation unit. The system may be a computer, a notebook computer, a portable terminal, a server, or the like that is typically used to store information and perform calculations.

상기 문서 데이터 입력부는 외부로부터 만주어를 포함하는 문서 데이터를 입력 받을 수 있다. 상기 문서 데이터는 만주어로 된 문자열을 적어도 하나 포함하는 것으로, 만주어 문서를 스캔하거나 카메라로 촬영하여 생성된 문서 데이터일 수 있다. 또는, 카메라도 촬영하고 있는 중에 실시간으로 전송됨으로써 컴퓨터의 메모리 반도체 칩에 일시적 또는 영구적으로 저장된 문서 데이터일 수 있다. 상기 문서 데이터 입력부는 데이터를 입력 받을 수 있는 것이면 특별히 제한하지 않는다.The document data input unit may receive document data including Manchu from the outside. The document data includes at least one character string in Manchu, and may be document data generated by scanning a document in Manchu or photographing a document in Manchu. Alternatively, it may be document data temporarily or permanently stored in a memory semiconductor chip of a computer by being transmitted in real time while a camera is also taking pictures. The document data input unit is not particularly limited as long as it can receive data.

상기 연산부는 입력 받은 문서 데이터를 처리하고 연산하는 작업을 수행한다. 이러한 작업들은 컴퓨터에서 판독가능한 저장매체에 저장된 프로그램이 컴퓨터의 프로세서 또는 반도체 칩에 의해 실행됨으로써 수행하는 것일 수 있다. 상기 연산부는 프로세서 또는 반도체 칩을 적어도 하나 포함하는 것으로 통상적으로 소프트웨어 프로그램을 실행하고 정보를 처리하여 사용자에게 다양한 정보를 제공하는 것일 수 있다. The operation unit processes and calculates input document data. These tasks may be performed by executing a program stored in a computer-readable storage medium by a processor or a semiconductor chip of the computer. The arithmetic unit includes at least one processor or semiconductor chip, and may typically execute a software program and process information to provide various information to a user.

이치화란 스캔 또는 카메라로 촬영하여 얻어진 화상 데이터를 일정한 기준으로 구분하여 표현한 것을 의미한다. 일 예로, 그림 또는 문서 데이터를'흑' 또는 '백'으로 구분하고, 각각 '0'과 '1' 또는 '1'과 '0'의 비트 데이터를 표현할 수 있다. 이러한 이치화 방법은 본 기술분야에서 일반적으로 사용하고 알려진 이치화 프로그램 등에 의해 수행할 수 있으며 특별히 제한하지 않는다. 이치화를 통해 각 지점 또는 영역의 화소수를 셀 수 있다. Binaryization means that image data obtained by scanning or photographing with a camera is divided and expressed according to a certain standard. For example, picture or document data may be classified as 'black' or 'white', and bit data of '0' and '1' or '1' and '0' may be expressed, respectively. This binarization method can be performed by a binarization program commonly used and known in the art, and is not particularly limited. Through binarization, the number of pixels in each point or area can be counted.

본 발명의 실시 예에서, 글자 영역이 흰색이 되고, 배경색이 검정색이 되도록 반전하여 글자 영역이 그레이레벨 255, 배경색이 그레이레벨 0이 되도록 하였다. 다음으로, 침식과 팽창으로 솔트 앤 페퍼 에러(salt and pepper error)를 제거하였다. In the embodiment of the present invention, the text area becomes white and the background color is reversed to become black, so that the text area becomes gray level 255 and the background color becomes gray level 0. Next, salt and pepper errors were removed by erosion and dilation.

상기 제1투영단계 및 문자열추출단계는 이치화된 문서 데이터를 일 방향으로 순차적으로 화소수를 세고, 화소수를 기준으로 문자열을 추출하는 단계이다. 도 1을 참조하면, 만주어는 세로로 쓰여 있기 때문에 가로 방향인 x축 방향으로 투영을 함으로써 문자열을 구분하는 제1지점을 도출할 수 있다. The first projection step and the character string extraction step are steps of sequentially counting the number of pixels of the binarized document data in one direction and extracting the character string based on the number of pixels. Referring to Figure 1, since the Manchu language is written vertically, it is possible to derive a first point for distinguishing character strings by projecting in the x-axis direction, which is a horizontal direction.

도 2는 도 1의 문서 데이터를 이치화한 후 x축 방향으로 투영하여 추출한 누적 화소수를 나타낸 그래프이다. 도 2를 참조하면, 일정한 간격마다 화소수가 0인 지점이 나타나는 것을 알 수 있으며, 이 지점이 문자열의 경계와 동일함을 알 수 있다. 보다 구체적으로, 도 2에서 세로로 분리하는 지점은 6, 44, 78, 112, 146, 180, 214, 248, 281, 316, 350, 383, 417, 453으로 총 13개 열로 분리가능하다. FIG. 2 is a graph showing the accumulated number of pixels extracted by binarizing the document data of FIG. 1 and then projecting in the x-axis direction. Referring to FIG. 2 , it can be seen that a point where the number of pixels is 0 appears at regular intervals, and this point is the same as the boundary of the character string. More specifically, in FIG. 2, the vertical separation points are 6, 44, 78, 112, 146, 180, 214, 248, 281, 316, 350, 383, 417, and 453, which can be separated into a total of 13 columns.

본 단계에서는 화소수가 0인 지점을 제1지점으로 판정하고, 이 지점을 기준으로 나눔으로써 각각의 문자열을 추출할 수 있다. 도 3(a)는 위의 방법으로 추출된 문자열 중 하나를 표현한 것이다.In this step, a point where the number of pixels is 0 is determined as a first point, and each character string can be extracted by dividing this point based on the standard. 3(a) represents one of the character strings extracted by the above method.

상기 제2투영단계 및 단어추출단계는 각각의 문자열 데이터를 일 방향으로 순차적으로 화소수를 세고, 화소수를 기준으로 단어를 추출하는 단계이다. 도 3(a)를 참조하면, 만주어에서 문자열은 여러 단어에 의해 구별됨을 알 수 있다. 각각의 단어는 세로로 쓰여져 있기 때문에 세로 방향인 y축 방향으로 투영을 함으로써 단어를 구분하는 제2지점을 도출할 수 있다. The second projecting step and the word extracting step are steps of sequentially counting the number of pixels in each string data in one direction and extracting words based on the number of pixels. Referring to FIG. 3(a), it can be seen that a character string in Manchu is distinguished by several words. Since each word is written vertically, a second point for classifying words can be derived by projecting in the vertical y-axis direction.

도 3(b)는 도 3(a)에서 제2지점을 표시한 것이다. 도 3(a)를 y축 방향으로 투영하는 경우, 각 단어의 사이마다 화소수가 0인 지점이 나타나는 것을 알 수 있으며, 이 지점이 단어의 경계와 동일함을 알 수 있다.FIG. 3(b) shows the second point in FIG. 3(a). When FIG. 3(a) is projected in the y-axis direction, it can be seen that a point where the number of pixels is 0 appears between each word, and it can be seen that this point is the same as the word boundary.

본 단계에서는 화소수가 0인 지점을 제2지점으로 판정하고, 이 지점을 기준으로 나눔으로써 각각의 단어를 추출할 수 있다. 도 4(a)는 위의 방법으로 추출된 단어 중 하나를 표현한 것이다. In this step, a point where the number of pixels is 0 is determined as a second point, and each word can be extracted by dividing this point based on the standard. 4(a) represents one of the words extracted by the above method.

상기 제3투영단계 및 글자추출단계는 각각의 단어 데이터를 일 방향으로 순차적으로 화소수를 세고, 화소수를 기준으로 글자를 추출하는 단계이다. 도 4(a)를 참조하면, 만주어에서 단어는 여러 글자가 연속적으로 연결된 것임을 알 수 있다. 각각의 글자는 세로로 쓰여 있기 때문에 세로 방향인 y축 방향으로 투영을 함으로써 글자를 구분하는 제3지점을 도출할 수 있다. 그러나, 앞선 방법과 달리, 만주어에서 글자는 서로 연결되어 쓰이며 가운데 부분에 위치하는 중심축이 존재한다. 따라서, 단순히 화소수가 0인 지점을 기준으로 제3지점을 도출할 수 없다. The third projection step and the character extraction step are steps of sequentially counting the number of pixels of each word data in one direction and extracting the character based on the number of pixels. Referring to Figure 4 (a), it can be seen that a word in Manchu is a series of several letters connected. Since each letter is written vertically, it is possible to derive a third point for distinguishing the letters by projecting them in the vertical y-axis direction. However, unlike the previous method, in Manchu, characters are used in conjunction with each other, and there is a central axis located in the middle. Therefore, the third point cannot be simply derived based on the point where the number of pixels is 0.

본 단계에서는 상기 제3지점은 화소수가 0을 초과하고, 특정 값 이하인 지점일 수 있다. 즉, 화소수에 대한 상한 및 하한의 범위를 미리 설정하여 두고, 단어를 y축 방향으로 투영하여 획득한 화소수가 상기 상한 및 하한의 범위에 해당하는 경우를 제3지점으로 도출할 수 있다. In this step, the third point may be a point where the number of pixels exceeds 0 and is less than or equal to a specific value. That is, the upper and lower limit ranges for the number of pixels are set in advance, and a case where the number of pixels obtained by projecting a word in the y-axis direction falls within the upper and lower limit ranges can be derived as a third point.

도 4(b)는 화소수가 일정한 범위에 해당하는 지점을 제3지점으로 도출하여 표시한 것이고, 도 5(a) 내지 (d)는 앞서 설명한 방법에 의해 추출된 글자를 각각 표시한 것이다. 보다 구체적으로, 도 4(b)에서 세로로 분리하는 지점은 7, 13, 20, 30으로 총 4개 열로 분리가 가능하다. 4(b) shows a point where the number of pixels falls within a certain range is derived as a third point, and FIGS. 5(a) to (d) show the letters extracted by the method described above. More specifically, in FIG. 4(b), the vertical separation points are 7, 13, 20, and 30, which can be divided into a total of four columns.

도 4(a)에 표시된 단어의 아래 부분을 참조하면, 중심축이 없는 영역에서도 화소수가 상기 일정범위에 해당하는 것으로 나타날 우려가 있다. 따라서, 이러한 문제를 해결하기 위해 상기 글자추출단계는, 만주어의 중심축 영역에 화소수가 0이거나 일정한 범위(제3지점을 판단하는 기준이 되는 화소수 범위)를 벗어나는 경우에는 상기 제3지점에서 제외하는 단계;를 포함할 수 있다. Referring to the lower portion of the word shown in FIG. 4(a), there is a concern that the number of pixels falls within the predetermined range even in an area without a central axis. Therefore, in order to solve this problem, the character extraction step is performed at the third point when the number of pixels in the central axis area of Manchu is 0 or outside a certain range (the range of the number of pixels serving as a criterion for determining the third point). Excluding step; may include.

상기 중심축 영역을 식별하는 단계는, 사용자에 의해 중심축의 위치를 입력 받는 단계를 포함하거나, 단어의 전체 폭 및 화소수를 기초로 연산하여 중심축의 좌표를 도출하는 포함할 수 있다. 이러한 방법을 통해 중심축의 화소수가 0이거나 일정한 범위를 벗어나는 경우에는, 해당 행의 화소수가 제3지점을 판단하는 기준이 되는 화소수 범위에 해당하더라도 제3지점에서 제외할 수 있어, 보다 정확하게 글자를 추출할 수 있다. The identifying of the central axis area may include receiving a position of the central axis by a user, or may include deriving coordinates of the central axis by calculating the total width of the word and the number of pixels. Through this method, if the number of pixels on the central axis is 0 or out of a certain range, it can be excluded from the third point even if the number of pixels in the corresponding row falls within the range of the number of pixels that is the criterion for determining the third point. can be extracted.

본 발명은 상술한 실시 형태 및 첨부된 도면에 의해 한정되는 것이 아니며 첨부된 청구범위에 의해 한정하고자 한다. 따라서, 청구범위에 기재된 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 당 기술분야의 통상의 지식을 가진 자에 의해 다양한 형태의 치환, 변형 및 변경이 가능할 것이며, 이 또한 본 발명의 범위에 속한다고 할 것이다. The present invention is not limited by the above-described embodiments and accompanying drawings, but is intended to be limited by the appended claims. Therefore, various forms of substitution, modification, and change will be possible by those skilled in the art within the scope of the technical spirit of the present invention described in the claims, which also falls within the scope of the present invention. something to do.

10: 만주어 문서 데이터10: Manchu document data

Claims

a binarization step of binarizing document data including at least one character string;
a first projection step of deriving the number of pixels by projecting the binarized document data in a first direction;
a character string extraction step of deriving a first point corresponding to a first criterion by the number of pixels derived in the first projection step and extracting a character string by classifying the first point as a standard;
a second projection step of deriving the number of pixels by projecting the character string extracted in the character string extraction step in a second direction;
a word extraction step of deriving a second point corresponding to a second criterion by the number of pixels derived in the second projection step, and extracting a word by classifying the second support point as a criterion;
a third projection step of deriving the number of pixels by projecting the words extracted in the word extraction step in a second direction; and
A letter extraction step of deriving a third point corresponding to the third criterion by the number of pixels derived in the third projection step and extracting letters by dividing them based on the third support point;
Manchu character extraction method.

According to claim 1,
The first point is a point where the number of pixels is 0,
Manchu character extraction method.

According to claim 1,
The second point is a point where the number of pixels is 0,
Manchu character extraction method.

According to claim 1,
The third point is a point where the number of pixels exceeds 0 and is less than or equal to a specific value,
Manchu character extraction method.

According to claim 1,
In the character extraction step,
identifying the central axis region of the Manchu language; and
Excluding from the third point when the number of pixels in the central axis region of the Manchu language is 0;
Manchu character extraction method.

Including a document data input unit and a calculation unit,
The document data input unit performs a step of receiving document data including at least one character string;
The control unit performs the binarization step, the first projection step, the string extraction step, the second projection step, the word extraction step, the third projection step and the letter extraction step,
Manchu character extraction system.

According to claim 6,
In the binarization step, the control unit binarizes the document input from the document data input unit.
Manchu character extraction system.

According to claim 6,
In the first projection step, the control unit projects the binarized document data in a first direction to derive the number of pixels.
Manchu character extraction system.

According to claim 6,
In the step of extracting a string, the control unit derives a first point corresponding to a first criterion in terms of the number of pixels derived in the first projection step, and extracts a string by dividing the first point based on the standard.
Manchu character extraction system.

According to claim 6,
In the second projection step, the control unit derives the number of pixels by projecting the character string extracted in the character string extraction step in a second direction,
Manchu character extraction system.

According to claim 6,
In the word extraction step, the control unit derives a second point corresponding to a second criterion in terms of the number of pixels derived in the second projection step, and extracts a word by dividing the second point based on the second support point.
Manchu character extraction system.

According to claim 6,
In the third projection step, the control unit derives the number of pixels by projecting the word extracted in the word extraction step in a second direction,
Manchu character extraction system.

According to claim 6,
In the letter extraction step, a third point corresponding to the third criterion is derived by the number of pixels derived in the third projection step, and the letters are extracted based on the third support point.
Manchu character extraction system.

According to claim 13,
The third point is a point where the number of pixels exceeds 0 and is less than or equal to a specific value,
Manchu character extraction system.

According to claim 13,
In the character extraction step,
identifying the central axis region of the Manchu language; and
Excluding from the third point when the number of pixels in the central axis region of the Manchu language is 0;
Manchu character extraction system.

executing the method of claim 1 by being executed by a processor;
A program stored on a storage medium readable by a computer.