KR20030018519A

KR20030018519A - The Easy Chinese Character Input and Correction Method using Image Retrieval Technologies

Info

Publication number: KR20030018519A
Application number: KR1020010052674A
Authority: KR
Inventors: 정경택
Original assignee: 서치캐스트 주식회사
Priority date: 2001-08-30
Filing date: 2001-08-30
Publication date: 2003-03-06

Abstract

PURPOSE: A method for correcting a Chinese character input is provided to input a Chinese character easily by extracting characteristic information of an image and using a search technique by a similarity comparison of the information without a difficult knowledge such as Chinese classic. CONSTITUTION: A scanning device(11) inputs data of an ancient document. An image duplication checking device(12) measures a similarity for avoiding a duplication of scanned data. An image correction device(13) removes a miscellaneous image for extracting information accurately in a scanned image and executes a horizontal and vertical set of a page. A segmentation device(14) cuts a character one by one in a sentence included in an image of a scanned page and creates character image information. An image storing device(15) stores an image of each extracted character in a storage. An image indexing device(16) extracts and stores characteristic information which represents each image applied to a stored character. An image searching device(17) arranges and displays similarity information according to a similarity to a given character. An automatic code endowing device(18) endows a character within a range of a similarity given in the searched result with a character code automatically. A manual code endowing device(19) endows many characters designated by a user in the searched result with a code value, respectively. A code value correction device(20) inputs a code value for checking an error of characters having a character code previously, displays a character image and a code value corresponded thereto, and corrects a wrong code value.

Description

Easy Chinese Character Input and Correction Method using Image Retrieval Technologies

본 발명은 고문서 등의 한자 중심의 국학자료를 디지털화하기 위해 PC 등의 컴퓨터에 전자파일로 입력하기 위한 작업을 이미지가 갖고 있는 특징 정보 등을 이용한 이미지 검색기술을 적용하여 자동화 시키고 아울러 좀 더 쉬운 방법으로 자료를 입력할 수 있는 방법에 관한 것이다.The present invention is to automate the task of inputting an electronic file into a computer such as a PC in order to digitize the Chinese character-oriented study materials such as ancient documents by applying image retrieval technology using the characteristic information of the image and an easier method. This is how you can enter data.

아울러 한자 입력 결과에 대한 교정 방법은 원문과 입력된 한 글자 한글자의 비교 방법이 아니라, 한자 코드에 의해 이미 입력된 글자들의 세그먼테이션된 글자를 모두 모아 비교함으로써 틀리게 입력된 글자를 쉽게 찾을 수 있는 교정 방법을 제공한다.In addition, the correction method for the Chinese character input result is not a comparison method between the original text and the input one-letter Hangul characters, but a correction method that makes it easier to find a wrong input character by collecting and comparing all segmented characters of characters already input by the Chinese character code To provide.

기존에 적용한 자동화된 입력 방법은 글자 인식 기술(OCR)을 이용하여 사람이 수작업 하는 부분을 많이 줄여 보려 하였지만, 현재 OCR 기술의 기술적 한계와 OCR 알고리즘의 처리 능력의 제한에 의해 한자 자료에 대해서는 좋은 결과를 얻지 못하고 있는 것이 현실이다.The existing automated input method attempts to reduce the human hand part by using the character recognition technology (OCR), but the result is good for the Chinese character data due to the technical limitation of the OCR technology and the limitation of the processing capacity of the OCR algorithm. The reality is not getting it.

한자 자료의 경우 모든 자료가 사람의 자필에 의해 쓰여지다 보니 같은 글자라도 자료를 만든 사람에 따라 필체가 다르고, 수십 수백년 동안 보관 관리되어 내려오다 보니 원문에 손상된 부분이 많아 OCR 기술에 의한 정확한 인식은 쉽지 않은상황이다. 더욱이 한자의 경우 다양한 서체를 갖고 있으며 이들을 OCR 기술로 처리하는 것은 매우 힘든 일이다.In the case of Chinese characters, all the materials are written by human handwriting, so the handwriting is different depending on the person who created the same letter, and it has been kept for several hundreds of years. It's not easy. Moreover, Chinese characters have various fonts and it is very difficult to process them with OCR technology.

많은 한적 자료를 자동으로 입력하는 방법뿐만 아니라 입력된 내용이 정확히 입력되었는지를 검사하고 수정할 수 있는 효율적인 교정 방법도 필요한데, 기존의 방법은 일일이 손과 눈으로 한글자씩 확인하는 것이 유일한 방법이었다.In addition to the automatic input of a large number of Korean-language data, an efficient calibration method that checks and corrects the inputted contents is required. The conventional method was to check each letter by hand and eyes one by one.

따라서, 본 발명의 목적은 이미지의 특징 정보 추출과 이들의 유사도 비교에 의한 검색기술로 한문학 등의 어려운 지식이 없어도 쉽게 한자 등을 입력할 수 있는 방법을 제공하는데 있다.Accordingly, it is an object of the present invention to provide a method of easily inputting Chinese characters without difficult knowledge such as Chinese literature as a retrieval technique by extracting feature information of an image and comparing similarities thereof.

또한 본 발명의 다른 목적은 이미지 처리 기술의 적용으로 동일하고 유사한 글자들을 모아서 한 눈에 비교할 수 있게 함으로써 전문가들도 인식 또는 판독에 있어 자주 발생하는 오류를 줄이는 방법을 제공하는데 있다.In addition, another object of the present invention is to provide a method for reducing errors that occur frequently in the recognition or reading by the expert by applying the image processing technology to collect the same and similar letters at a glance.

또한 본 발명의 다른 목적은 기본 한자 코드 입력 방법은 한 글자 단위로 입력하여 글자수에 비례하여 입력 공수가 늘어나고 있으나, 본 발명의 기술은 검색된 결과에서 선택된 일부 혹은 전부를 동시에 코드 값을 부여함으로써 입력공수를 기하급수적으로 떨어뜨릴 수 있는 방법을 제공하는데 있다.In addition, another object of the present invention is the basic Chinese character code input method is input by increasing the number of characters in proportion to the number of characters input by one character unit, the technique of the present invention by inputting a code value to all or part selected from the search results at the same time To provide a way to explode airborne exponentially.

또한 본 발명의 다른 목적은 이미지 유사도 비교에 의해 측정된 유사도 값을 기준으로 어느 일정 임계치 이하의 글자에 대해 자동으로 코드값을 부여하는 방법을 적용하여 OCR 기술에 대체할 수 있는 자동 코드 입력 방법을 제공하는데 있다.In addition, another object of the present invention is to apply an automatic code input method that can be substituted for OCR technology by applying a method for automatically assigning a code value to a character below a certain threshold based on the similarity value measured by image similarity comparison. To provide.

도 1은 이미지 기반 입력 교정 시스템의 기능 흐름을 나타내는 그림,1 is a diagram showing the functional flow of an image-based input calibration system;

도 2는 이미지 중복 검사 장치에 의한 결과를 보여주는 그림,2 is a view showing the result of the image redundancy check apparatus,

도 3은 이미지 교정 장치의 수직 스펙트럼 분석을 나타내는 그림,3 is a diagram illustrating vertical spectral analysis of an image calibration apparatus;

도 4는 이미지 세그먼테이션 장치의 이미지로부터의 줄 분리와 그 이후의 낱 글자 분리를 나타내는 그림,4 is a diagram illustrating line separation from an image of an image segmentation device and subsequent single letter separation;

도 5는 이미지 색인 장치에 의한 색인 정보 생성 결과의 예를 나타낸 그림,5 is a diagram showing an example of an index information generation result by the image indexing apparatus;

도 6은 이미지 검색 장치를 위한 사용자 인터페이스 화면의 한 예를 나타낸 그림,6 is a diagram illustrating an example of a user interface screen for an image retrieval device;

도 7은 자동 코드 부여 장치의 사용자 인터페이스 화면의 예를 나타낸 그림,7 is a view showing an example of a user interface screen of the automatic code assignment device;

도 8은 수동 코드 부여 장치의 사용자 인터페이스 화면의 예를 나타낸 그림,8 is a view showing an example of a user interface screen of the manual code assignment device;

도 9는 코드값 교정 장치의 사용자 인터페이스 화면의 예를 나타내 그림,9 is a diagram illustrating an example of a user interface screen of a code value calibrating apparatus;

도 10은 이들 모든 장치가 컴퓨터 내에서 구현되고 동작되는 하나의 양호한 예를 나타낸 그림,10 illustrates one preferred example in which all these devices are implemented and operated within a computer;

도 11은 이미지 처리의 효과를 보여주기 위한 실제 한자 빈도수 측정자료를나타낸 그래프.11 is a graph showing actual Chinese character frequency measurement data to show the effect of image processing.

상기한 목적을 달성하기 위한 본 발명에 따르면, 이미지검색 기반 한자 입력교정 시스템이 제공된다. 이러한 이미지검색 기반 한자 입력교정 시스템은, 입력하고자 하는 고문서의 자료를 입력하기 위한 스캐닝 장치; 스캐닝 된 많은 자료의 중복을 피하기 위해 이미지 유사도 측정에 의한 이미지 중복 검사 장치; 스캐닝 된 이미지에서 정확한 정보 추출을 위해 잡영을 없애고 페이지의 수평과 수직 맞추기 위한 이미지 보정 장치; 스캔된 한 페이지의 이미지에 포함된 문장 혹은 여러 글자로 구성된 것에서 한 글자씩 잘라내어 글자 이미지 정보를 생성하기 위한 세그먼테이션 장치; 한 글자씩 추출된 각 글자의 이미지를 저장장치에 저장하기 위한 저장장치; 저장된 글자를 대상으로 각 이미지를 대표하는 특징 정보를 추출하여 보관하는 이미지 색인 장치; 주어진 글자(이미지)와 가장 유사한 것부터 순서대로 나열하며 유사도 정보를 보여주는 이미지 검색 장치; 검색된 결과에서 주어진 유사도 값의 범위 내에 있는 글자에 자동으로 글자 코드를 부여하는 자동 코드 부여 장치; 검색된 결과에서 사용자가 지정한 여러 글자들에 코드값을 부여하는 수동 코드 부여 장치; 이미 글자 코드값이 완료된 글자들의 오류를 검사하기 위해, 코드값을 입력하고 이에 해당하는 글자 이미지와 코드값을 보여주어 잘못된 코드값을 수정할 수 있는 교정 장치를 포함한다.According to the present invention for achieving the above object, there is provided an image search-based Chinese character input correction system. The image retrieval-based Chinese character input correction system includes: a scanning device for inputting data of an old document to be input; Image redundancy inspection device by measuring image similarity to avoid duplication of many scanned materials; An image correction device for removing horizontal noise and aligning the page horizontally and vertically for accurate information extraction from the scanned image; A segmentation device for generating character image information by cutting out characters one by one from sentences or multiple characters included in the scanned one page image; A storage device for storing an image of each letter extracted one letter at a storage device; An image indexing apparatus for extracting and storing feature information representing each image from the stored characters; An image retrieval device that lists similarity information in order from the most similar to a given letter (image); An automatic code assigning device for automatically assigning a letter code to a letter within a range of a given similarity value in the searched result; A manual code assigning device for assigning a code value to a plurality of characters designated by a user in a search result; In order to check for errors of the characters that have already completed the character code value, the device includes a correction device that inputs a code value and displays a corresponding character image and code value to correct an incorrect code value.

양호하게는, 고문서의 자료를 입력하기 위한 상기 스캐닝 장치는 고문서의 한페이지 단위로 흑백 정보 표현 형식으로 데이터를 저장장치의 파일 시스템으로 보관하게 된다.Preferably, the scanning device for inputting the data of the old document stores the data in the file system of the storage device in a monochrome information representation format in units of one page of the old document.

보다 양호하게는 스캔된 이미지 정보의 파일을 이미지기반 한자 입력교정 시스템의 내부 메모리에서 직접 관리하여 사용자가 별도의 저장 관리 및 전달 과정 없이 시스템 내에서 처리할 수 있도록 제공할 수도 있다.More preferably, the file of the scanned image information may be directly managed in the internal memory of the image-based Chinese character input calibration system so that the user may process the file within the system without a separate storage management and transfer process.

양호하게는, 스캐닝 된 자료의 중복을 없애기 위한 중복 검사 장치는, 이미지 검색 기술을 기반으로 유사도 측정에 의해 판단하여 대량의 이미지 데이터를 효율적으로 관리하고 처리하기 위한 장치이다. 여러 명의 작업자에 의해 스캐닝 작업이 진행되는 과정에서 부정확한 관리 프로세스에 의해 발생할 수 있는 중복 저장된 이미지를 데이터베이스 등에서 추출하여 중복 없는 이미지 데이터만을 관리하기 위한 장치이다.Preferably, the duplication inspection apparatus for eliminating duplication of the scanned data is an apparatus for efficiently managing and processing a large amount of image data by judging by similarity measurement based on image retrieval technology. It is a device for managing only image data without duplicates by extracting a duplicated stored image that may be generated by an incorrect management process during a scanning operation by several workers from a database or the like.

여기서 이미지 검색 기술은 일반적으로 알려진 여러 가지 알고리즘을 적용할 수 있으며, 양호하게 구현된 예로는 Edge 성분 추출 기반의 유사도 측정 알고리즘, 혹은 Region Shape 알고리즘 등을 적용하여 장치를 구성할 수 있다.Here, the image retrieval technique may apply various known algorithms, and a well-implemented example may be configured by applying a similarity measurement algorithm or region shape algorithm based on edge component extraction.

양호하게는, 한 페이지 단위로 스캐닝 된 이미지에서 정확한 글자 세그먼테이션을 위해 이미지 보정 장치는, 잡영(Noise)을 없애고 페이지의 수평과 수직을 맞추기 위한 처리를 한다. 고도서의 장기간의 보관과 관리의 소홀에 의해 생긴 여러 가지 손상된 것들이 이미지 자동 세그먼테이션 등에 좋지 않은 영향을 주게 되는것을 방지하기 위한 것이 본 장치의 목적이다.Preferably, the image correction device performs a process for eliminating noise and aligning the page horizontally and vertically for accurate character segmentation in the image scanned by one page. The purpose of this device is to prevent the various damages caused by the long-term neglect of long-term storage and management of the old books, which would adversely affect the automatic image segmentation.

여기서 한 페이지를 스캔한 이미지의 수평을 맞추기 위해서는 이미지의 x 축의 각 좌표에 대해 y 축의 픽셀(pixel : 0 또는 1)들의 값을 더한 값을 나타낸 수평 스펙트럼을 분석하여 결정한다. 즉, 글자들이 종서로 쓰여져 있으므로, 글자들이 있는 부분은 높은 스펙트럼 값이 나타나고, 종서로 쓰여진 글자의 한 줄과 다음 줄 사이는 글자가 없는 빈 공간이 되므로 이 때의 스펙트럼은 가장 낮은 값을 갖게 된다. 스캔된 이미지가 기울어져 있다면 스펙트럼은 x 축을 기준으로 변화가 적은 사인 커브 혹은 평탄한 모양을 나타내게 되고, 기울어지지 않고 아주 정확히 수평이 맞은 상태라면 가장 높은 스펙트럼 값과 가장 낮은 스펙트럼 값을 갖게 될 것이다.In order to level the scanned image of one page, the horizontal spectrum representing the coordinates of the x-axis of the image plus the values of pixels (pixel: 0 or 1) of the y-axis is determined. In other words, because the letters are written as vertical letters, the parts with letters appear high spectral values, and between one and the next lines of written letters becomes empty space without letters, so the spectrum at this time has the lowest value. . If the scanned image is skewed, the spectrum will show a sinusoidal curve or flat shape with less variation on the x-axis, and will have the highest and lowest spectral values if it is not tilted and is exactly horizontal.

이렇게 되면 스캔된 한 페이지의 이미지는 수평 수직을 맞춘 상태가 되며, 수작업 혹은 자동화된 방법으로 글자와는 관계없는 부분에 잡영들을 제거하는 과정을 거쳐서 이미지 보정 과정을 마친다.In this case, the scanned image of the page is aligned vertically and horizontally, and the image correction process is completed by removing ghosts in a part not related to the text by manual or automated methods.

양호하게는, 스캔된 한 페이지의 이미지에 포함된 글자 이미지 정보를 생성하기 위한 세그먼테이션 장치는, 문장 혹은 여러 글자로 구성된 것에서 한 글자씩 잘라내는 기능을 하게 되며, 이렇게 분리된 각각의 글자 이미지에 대해 원래 페이지에서의 글자 위치를 복원하기 위한 위치 정보 등도 같이 가져갈 수 있도록 한다.Preferably, the segmentation apparatus for generating the character image information included in the scanned one page image is to cut out the characters one by one from a sentence or composed of several characters, for each of the separated character image You can also bring location information to restore the text position on the original page.

더욱 양호하게는, 글자의 세그먼테이션은 자동으로 이루어지지만, 경우에 따라 수작업이 필요한 경우도 있으며, 자동 처리에 의해 발생한 세그먼테이션 오류도 수작업 과정을 통해 수정할 수 기능도 제공할 수 있도록 한다.More preferably, the segmentation of the letters is automatic, but in some cases, manual work is required, and the segmentation error generated by the automatic processing can also be provided through the manual process.

양호하게는, 한 글자씩 분리된 각각의 글자 이미지를 저장장치에 저장하기 위한 저장장치는 컴퓨터의 메모리 혹은 파일로 저장하거나, 많은 양의 글자를 동시에 처리하기 위해서는 데이터베이스 등을 이용하여 저장할 수 있도록 한다.Preferably, a storage device for storing each character image separated by one letter in a storage device may be stored in a computer memory or a file, or by using a database or the like to process a large amount of characters at the same time. .

상기된 내용과 같이 글자의 원래 위치 정보를 처리하기 위하여 위치정보 등도 같이 저장하여 원래 문장을 복원하는 과정에서 순서나 의미의 변화가 생기지 않도록 처리한다.As described above, in order to process the original position information of the letter, the position information is also stored together so as not to change the order or meaning in the process of restoring the original sentence.

양호하게는, 저장된 글자를 대상으로 이미지 색인 장치는 각 글자 이미지를 대표하는 특징 정보를 추출하여 유사도 비교가 가능할 수 있도록 색인 정보를 생성하도록 한다. 색인 정보는 글자의 서체나 모양이 조금 차이가 있더라도 같은 글자로 판단할 수 있거나, 또한 다른 글자들은 뚜렷이 구별할 수 있는 특징 정보를 정하는 것이 좋으며, 여러 가지 구현 예 중에서 양호하게 적용할 수 있는 것은 Edge 성분 추출 기반의 유사도 측정 알고리즘, 혹은 Region Shape 알고리즘 등이 있다.Preferably, the image indexing apparatus extracts feature information representative of each character image to generate index information to enable similarity comparison. Index information can be judged as the same letter even if the font or shape of the letter is slightly different, or it is better to define characteristic information that can be distinguished from other letters, and among the various embodiments, Edge can be applied well. Similarity measurement algorithm based on component extraction, or Region Shape algorithm.

더욱 양호하게는 이러한 알고리즘에 의한 색인 정보는 다차원으로 표현되며 예를 들어 수직 성분 x%, 수평 성분 y%, 슬래쉬(/) 대각선 성분 z%, 백슬래쉬(\) 대각선 성분 w% 등으로 표현되어, 각 글자들이 이 들 각각의 성분들과 어느 정도씩 일치하느냐에 따라 얼마나 유사한지를 판단하게 된다.More preferably, the index information by this algorithm is expressed in multiple dimensions, for example, vertical component x%, horizontal component y%, slash (/) diagonal component z%, backslash (\) diagonal component w%, etc. As a result, it is determined how similar each letter is to each of these components.

이렇게 결정된 각 글자의 이미지에 대한 색인 정보도 데이터베이스의 저장 장치 혹은 파일로 저장되어 전체 글자에 대한 색인 정보를 이루게 된다.The index information of the image of each character thus determined is also stored in the storage device or file of the database to form the index information of the entire character.

양호하게는, 유사도 정보를 보여주는 이미지 검색 장치는 사용자가 제시한 글자(이미지)와 가장 유사한 것으로 판단된 글자 이미지들을 순서의 결과로 제공하며, 이 때 유사도 판단은 이미지 색인 장치에 의해 만들어진 색인 정보를 이용하여 결정한다.Preferably, the image retrieval apparatus showing the similarity information provides as a result of the character images determined to be the most similar to the character (image) presented by the user, wherein the similarity determination is based on the index information generated by the image indexing apparatus. Decide by using.

더욱 양호하게는 유사도 판정을 위해 다양한 알고리즘을 적용할 수 있으며 일반적으로 벡터 공간 모델(vector space model) 등을 사용할 수 있다. 이렇게 판정된 결과를 사용자에게 보여주기 위하여 가장 유사한 것으로 판단된 글자부터 차례로 보여주게 되며, 유사도 측정 결과도 함께 제시하여 사용자가 판단하는데 도움을 주도록 할 수 있다.More preferably, various algorithms may be applied for the similarity determination, and in general, a vector space model or the like may be used. In order to show the result determined to the user is shown in order from the letter which is determined to be the most similar, the similarity measurement result can also be presented to help the user to judge.

양호하게는, 검색된 결과를 대상으로 한 자동 코드 부여 장치는 시스템 사용자 혹은 관리자에 의해 미리 결정되어 입력된 임계 유사도 값에 의해, 일정 범위 내에 있는 글자 이미지에 대해 자동으로 글자 코드를 부여하는 방법을 적용할 수 있다. 이러한 방법은 기존의 OCR(Optical Character Recognition)과 동일한 기능을 제공할 수 있는 새로운 방법이라고 할 수 있다.Preferably, the automatic code assigning apparatus for the searched results applies a method for automatically assigning a character code to a character image within a predetermined range based on a threshold similarity value predetermined and input by a system user or administrator. can do. This method is a new method that can provide the same function as the existing optical character recognition (OCR).

더욱 양호하게는 사용자가 시스템을 사용하면서 글자들의 상태 혹은 시스템의 성능에 따라 상기 임계 유사도 값을 조정하여 좀 더 낫은 결과를 얻을 수 있게 할 수 있다.More preferably, as the user uses the system, the threshold similarity value may be adjusted according to the state of the characters or the performance of the system to obtain better results.

양호하게는, 수동 코드 부여 장치는 검색된 결과에서 수작업에 의해 사용자가 지정한 여러 글자들에 코드값을 부여하는 방법을 제공한다. 검색된 결과에서 일부 다른 글자들이 원하는 글자들 사이에 끼어 있을 수도 있으며, 때로는 유사도 순서에서 멀리 떨어져 나타나는 경우도 있으므로 이러한 경우 유용하게 쓸 수 있는 방법이다. 사용자에게 제공되는 수작업에 의한 선택 방법은 여러 개를 동시에 선택하는 방법, 여러 개를 동시에 선택하지 않는 방법, 하나씩 선택하는 방법, 하나씩 선택하지 않는 방법 등이 있을 수 있다.Preferably, the manual code assigning apparatus provides a method of assigning a code value to various characters designated by a user by hand in the searched result. In the search results, some other letters may be interposed between the desired letters, and sometimes appear far from the order of similarity, which is a useful way to do this. The manual selection method provided to the user may include a method of selecting several simultaneously, a method of not selecting several simultaneously, a method of selecting one by one, and a method of not selecting one by one.

양호하게는, 잘못된 코드값을 수정할 수 있는 교정 장치는 이미 글자 코드값이 완료된 글자들의 오류를 검사하기 위한 것으로 글자 코드값을 입력하면 똑 같은 글자 코드가 부여된 해당하는 모든 글자들의 이미지를 보여주는 기능을 제공한다. 이를 통해 사용자는 동일한 글자들의 집합에서 틀린 글자를 쉽게 발견할 수 있고, 바로 교정이 이루어질 수 있는 기능을 제공한다.Preferably, a correction device capable of correcting an invalid code value is for checking an error of characters that have already been completed with a code value, and when a character code value is input, a function of displaying an image of all corresponding letters that are given the same letter code is provided. To provide. This allows the user to easily find the wrong letter in the same set of letters, and provides the ability to immediately correct.

더욱 양호하게는 틀린 글자의 수정은 다양한 글자 선택 방법, 즉, 여러 글자를 동시에 선택하는 방법, 여러 글자를 동시에 선택하지 않는 방법, 하나씩 선택하는 방법, 하나씩 선택하지 않는 방법 등을 이용한 선택 방법으로 쉽게 교정이 이루어질 수 있는 기능을 제공한다.More preferably, the correction of wrong letters can be easily performed by selecting various letters, i.e., selecting several letters at the same time, not selecting several letters at the same time, selecting one by one, not selecting one by one, etc. It provides the ability for calibration to take place.

도 1은 본 발명의 한 실시 예에 따른 이미지검색 기반 한자 입력교정 시스템의 양호한 한 예를 도시한 도면이다. 이러한 이미지검색 기반 한자 입력교정 시스템은크게 스캐닝 장치(11), 이미지 중복 검사 장치(12), 이미지 보정 장치(13), 세그먼테이션 장치(14), 이미지 저장장치(15), 이미지 색인 장치(16), 이미지 검색 장치(17), 자동 코드 부여 장치(18), 수동 코드 부여 장치(19), 코드값 교정 장치(20)로 이루어진다.1 is a diagram illustrating a preferred example of an image search-based Chinese character input correction system according to an embodiment of the present invention. The image retrieval-based Chinese character input correction system includes a scanning device 11, an image redundancy inspection device 12, an image correction device 13, a segmentation device 14, an image storage device 15, and an image indexing device 16. , An image retrieval device 17, an automatic code assignment device 18, a manual code assignment device 19, and a code value correction device 20.

도 2는 본 발명의 이미지 중복 검사 장치(12)의 구현에 의해 실제 제공된 결과의 한 예를 보여주고 있으며, 21에서 25까지는 같은 이미지의 스캔된 자료들이 모여 있음을 알 수 있다. 각 이미지들은 같은 내용을 여러 사람이 각각 스캔한 결과들로서, 내용은 같지만 해당 이미지의 위치가 다른 것을 알 수 있다.Figure 2 shows an example of the results actually provided by the implementation of the image redundancy check apparatus 12 of the present invention, it can be seen that the scanned data of the same image is gathered from 21 to 25. Each image is the result of several people scanning the same contents, and it can be seen that the contents are the same but the positions of the images are different.

도 3은 본 발명의 이미지 보정 장치(13)에 해당하는 기능을 보여주는 것으로 양호하게는 수직 스펙트럼 분석에 의하여 기울기를 측정하는 예를 보여주고 있다. 양호하게는 31의 이미지에 대하여 수직 스펙트럼을 측정하면 32와 같은 결과가 나오는데 이렇게 평평한 것은 이미지의 기울기가 삐뚤어져 있음을 나타내고, 33의 이미지에 대하여 수직 스펙트럼을 측정하면 34와 같은 결과를 얻게 된다. 따라서 이렇게 구한 스펙트럼의 값을 이용하여 스캔된 이미지가 정확히 수직 수평이 맞게 스캔된 것인지를 판단할 수 있다.3 shows a function corresponding to the image correction device 13 of the present invention, and preferably shows an example of measuring tilt by vertical spectrum analysis. Preferably, measuring the vertical spectrum with respect to 31 images yields the same result as 32. This flatness indicates that the image is skewed, and measuring the vertical spectrum with respect to 33 results in 34. Therefore, it is possible to determine whether the scanned image is exactly vertically and horizontally scanned using the obtained spectrum values.

도 4는 세그먼테이션 장치(14)가 이미지에서 줄 분리를 거친 후 낱자를 분리하는 과정을 보여주는 그림이다. 양호하게 41은 도 3에서 설명한 수직 수평스펙트럼(프로젝션)을 보여주고 있고, 세그먼테이션 장치는 먼저 42, 43과 같이 줄 분리를 한 후, 44와 45 처럼 낱자 분리를 수행한다.4 is a diagram illustrating a process in which the segmentation device 14 separates the pieces after passing the lines in the image. Preferably, 41 shows the vertical horizontal spectrum (projection) described in FIG. 3, and the segmentation apparatus first performs line separation as shown in 42 and 43, and then performs word separation as shown in 44 and 45. FIG.

도 5는 이미지 색인장치(16)에 의한 글자 이미지에 대한 양호한 색인정보 생성 방법의 한 예를 보여 주고 있다. 51 혹은 53과 같은 글자는 52와 54 같은 색인정보를 생성하게 된다. 예에서도 보는 바와 같이 51의 수평성분(52에서 나타난 값)이 53의 수평성분(54에서 나타난 값)보다 많다는 것을 알 수 있다.5 shows an example of a method of generating good index information on a character image by the image indexing device 16. Characters like 51 or 53 will generate index information like 52 and 54. As can be seen from the example, it can be seen that the horizontal component of 51 (the value shown in 52) is larger than the horizontal component of 53 (the value shown in 54).

도 6은 양호한 경우의 이미지 입력 교정 시스템의 사용자 화면을 나타내고 있으며, 61은 사용자가 선택한 이미지 나타내며, 찾기 기능에 의해 유사한 이미지(글자)라고 판단된 것들이 63 영역에 결과로 나타난다.6 shows a user screen of the image input correction system in a good case, 61 shows an image selected by the user, and those judged as similar images (letters) by the search function are displayed in the area 63. FIG.

도 7은 자동 코드 부여 장치(18)가 자동으로 코드를 부여하기 위한 화면으로, 양호하게는 미리 설정된 유사도 임계치에 의해 일괄적으로 선택된 이미지들은 자동으로 코드값을 부여하며, 이러한 자동 코드 부여에 의해 처리되지 않은 나머지 글자들에 대해 사용자가 수작업으로 선택한 글자(71)들을 일괄 코드 부여가 가능하도록 지원한다.FIG. 7 is a screen for automatically assigning codes by the automatic code assigning device 18. Preferably, images collectively selected by a preset similarity threshold are automatically assigned code values. Supports batch code assignment of the characters 71 manually selected by the user for the remaining unprocessed characters.

도 8은 수동 코드 부여 장치(19)의 양호한 사용자 인터페이스 화면으로 이미 코드값이 입력된 글자(81)는 본 장치의 대상이 아니며, 아무런 코드를 부여 받지못한 글자(82)들을 사용자들이 직접 코드값을 입력할 수 있도록 지원한다.8 is a good user interface screen of the manual code assigning device 19. The letter 81, in which the code value is already input, is not the object of the present device, and the user directly reads the letter 82 which is not given any code. It allows you to enter.

도 9는 코드값 교정 장치(20)의 양호한 사용자 인터페이스 화면으로 일반적인 한자 입력 방법인 음가에 의한 한자 입력을 통해 91번 창에 글자를 입력하고, 이와 같은 코드값이 부여된 글자를 검색하게 되면 94번 창에 그 결과들이 나타나게 되고, 이 때 잘못 코드가 부여된 92, 93과 같은 글자는 쉽게 눈으로 식별 가능하며, 이러한 코드는 쉽게 수정할 수 있다.9 is a good user interface screen of the code value correcting apparatus 20. When a character is inputted into the window 91 through the Chinese character input by the phonetic value, which is a general Chinese character input method, and the character is given the code value 94 The results are displayed in the flourish, and letters such as 92 and 93 that are incorrectly coded are easily visible and can be easily modified.

도 10은 이러한 장치들이 컴퓨터 시스템 내에서 구성된 양호한 예를 제시한 것이며, 모든 장치는 하드디스크(114)로부터 메인 메모리(113)로 로딩되어, 모니터(111)를 통해 사용자와 인터페이스 되며 이미지 저장 혹은 색인, 검색 장치들은 하드 디스크로 필요한 정보를 저장하거나 호출하여 사용하게 된다. 이미지 스캔 장치는 외부의 스캐너(112)와 통신 포트를 통해 연결되고, CPU(115)는 이들 모든 프로세싱을 관장하게 된다.10 shows a good example of such devices configured in a computer system, all devices being loaded from the hard disk 114 into the main memory 113, interfaced with the user through the monitor 111, and storing or indexing images. In addition, retrieval devices store or recall necessary information on a hard disk. The image scanning device is connected to an external scanner 112 through a communication port, and the CPU 115 manages all these processing.

한자로 쓰여진 한적자료는 전문가가 아니면 해독뿐만 아니라 자료 구축에도 많은 어려움이 있어 왔다. 한적자료 처리의 문제점은 다양한 글자, 다양한 서체 그리고 이체자 등으로 인해 해독의 어려움이 있으며, 실제 이러한 자료를 디지털화하여 연구 저장하기 위해서는 입력과정이 필요하다. 입력과정에서의 문제점은 많은 입력공수를 필요로 하고 입력하는 과정에서 유사한 글자, 틀리기 쉬운 글자 등에 의해 정확한 입력이 이루어지지 못하고 있다. 이러한 입력은 전문가가 아니라 단순한 입력 작업자들에 의해 이루어지고 있으므로 교정이라는 과정이 필연적이다. 교정의 어려움은 입력된 텍스트와 원문과의 비교를 위해 항상 원본과의 눈 대조를 통해 이루어지므로 쉽지 않다는 점, 그리고 전문가가 아닌 이상 글자가 아닌 그림으로 밖에 비교될 수 없다는 점이 문제이다.Korean characters written in Chinese characters have been difficult to construct as well as to decipher unless they are experts. The problem of Korean data processing is that it is difficult to decipher due to various letters, various fonts, and transfer characters. Actually, an input process is required to digitize and store these data. Problems in the input process require a lot of input maneuver and in the process of input is not made accurate input by similar letters, wrong letters and the like. Since these inputs are made by simple input operators, not by experts, the process of calibration is inevitable. The difficulty of proofing is that it is not easy to compare the input text with the original text, so it is not easy and it can only be compared to a non-letter picture unless you are an expert.

이러한 문제를 해결하기 위한 것이 본 발명의 목적이며 위에서 지적한 한적 자료 처리의 문제점을 해결한다. 즉, 해독의 어려움을 한자를 이미지로 처리하는 기술을 적용하여 피하고, 이미지 색인 및 검색 기술에 의해 한자를 처리하는 방법을 적용하고, 검색된 이미지의 일괄 입력 방법을 고안하여, 전문가가 아닌 일반인이 쉽게 작업할 수 있는 방법을 제공하여 전문가의 작업 영역을 줄여주어 좀 더 전문화된 영역에서 연구할 수 있도록 한다.It is an object of the present invention to solve this problem and solves the problem of the Korean data processing pointed out above. In other words, avoiding the difficulty of decoding by applying the technology of processing Chinese characters into images, by applying the method of processing Chinese characters by image indexing and retrieval techniques, and devising a batch input method of retrieved images, it is easy for the general public It provides a way to work, reducing the area of work of specialists so that they can study in more specialized areas.

본 발명은 입력 뿐만 아니라 이미지 처리 방법을 적용하여 입력된 자료의 교정 및 대조를 위한 효과적인 방법을 제공하여 보다 완벽한 한적 자료의 디지털화에 일조할 수 있도록 한다.The present invention applies an image processing method as well as an input to provide an effective method for correcting and contrasting the inputted material, thereby contributing to the digitization of more complete isolated data.

본 발명 기술의 효용성은 도 11의 그래프가 보여주는 바와 같이 한자 빈도수 순위로 1,900여자만을 처리하여도 전체 자료의 95% 이상을 처리하는 것을 보여 주고 있으며, 결국 비전문가에 의한 본 발명의 기술을 이용할 경우 한적자료 처리의 많은 부분을 쉽게 처리할 수 있음을 보여주고 있다.As shown in the graph of Fig. 11, the effectiveness of the present invention shows that the processing of more than 95% of the entire data is possible even if only 1,900 women are processed by the Chinese character frequency rank. It shows how much of the data processing can be handled easily.

Claims

A scanning device for inputting data of an old document to be input; And

Image redundancy inspection device by measuring image similarity to avoid duplication of many scanned materials; And

An image correction device for removing horizontal noise and aligning the page horizontally and vertically for accurate information extraction from the scanned image; And

A segmentation device for generating character image information by cutting out characters one by one from sentences or multiple characters included in the scanned one page image; And

A storage device for storing an image of each letter extracted one letter at a storage device; And

An image indexing apparatus for extracting and storing feature information representing each image from the stored characters; And

An image retrieval device that lists similarity information in order from the most similar to a given letter (image); And

An automatic code assigning device for automatically assigning a letter code to a letter within a range of a given similarity value in the searched result; And

A manual code assigning device for assigning a code value to a plurality of characters designated by a user in a search result; And

An image-based input calibration system comprising a calibration device that can correct an incorrect code value by inputting a code value and displaying the corresponding character image and code value to check for errors in characters that have already been completed.

The image overlapping apparatus of claim 1,

A way to check for image duplication with an image index on a full page basis.

The image correcting apparatus of claim 1,

Full page-by-page image vertical and horizontal spectral analysis that allows you to examine image corrections and modify images horizontally.

The apparatus of claim 1, wherein the image segmentation device comprises:

Full page image vertical and horizontal spectral analysis that cuts the text components contained in an image to create a text image in single-letter units.

The image storage device of claim 1, wherein the image storage device comprises:

A method of storing information generated as a single character image by a segmentation device for indexing and searching in a database or the like.

The method of claim 1, wherein the image indexing device,

A method of extracting index information by analyzing vertical, horizontal, and diagonal components of all stored single letter images.

The apparatus of claim 1, wherein the image retrieval device is

A method of calculating the similarity by vector space measurement based on the index information of each character, and ranking the most similar.

The apparatus of claim 1, wherein the image automatic code assigning device is provided.

Character images collectively selected by the preset similarity threshold are automatically assigned a code value, and a method of enabling batch code assignment of characters manually selected by the user for the remaining characters not processed by such automatic code assignment .

The apparatus of claim 1, wherein the image manual code assigning device comprises:

A method that allows users to directly enter a code value for characters that have not been given a code, except for characters that have already been automatically entered with a code value.

The apparatus of claim 1, wherein the code value correcting apparatus comprises:

If you enter a code in the character input window through the Chinese character input by the phonetic value, which is a general Chinese character input method, and search for a character with the same code value, the results will be displayed. A identifiable and easy way to modify the code of these letters.