KR100361176B1

KR100361176B1 - Method for recognizing written multi-character

Info

Publication number: KR100361176B1
Application number: KR1019950052889A
Authority: KR
Inventors: 박수근
Original assignee: 엘지전자 주식회사
Priority date: 1995-12-20
Filing date: 1995-12-20
Publication date: 2003-02-19
Also published as: KR970049822A

Abstract

PURPOSE: A method for recognizing a written multi-character is provided to recognize a character of at least one language by studying a morpheme, constructing a database based on the morpheme, searching a corresponding morpheme from the database when a copyist writes down a character, judging a related position between each morpheme, and combining the morpheme. CONSTITUTION: Duplicated dots and dots existing in a predetermined distance are removed in an input dot of a cell which is a group of inputted strokes comprising at least one stroke. An area is separated in accordance with a constitution of a stroke and an accumulated angle formed by a movement of each dot. A characteristic value capable of displaying a characteristic of each area is extracted. A dissimilarity is calculated by comparing the extracted characteristic value with a characteristic value of a reference morpheme in a database. A reference morpheme having the lowest dissimilarity is set as a recognized morpheme. Candidate characters are formed using all recognized morphemes as strokes. A character made by a morpheme having the lowest total dissimilarity is set as a recognition character.

Description

Cursive Multi-Character Recognition Method

본 발명은 필기체 문자 인식방법에 관한 것으로 특히 필기체 문자의 의존적인 부분과 비의존적인 부분을 분리하여 인식함으로써 한 가지 이상의 문자 즉, 한글자, 영문자, 일문자, 한자, 숫자, 기호등을 동시에 인식할 수 있는 필기체 다중 문자 인식방법에 관한 것이다.The present invention relates to a method for recognizing handwritten characters, and in particular, by recognizing a dependent part and a non-dependent part of a handwritten character separately, one or more characters, that is, a Korean character, an English character, a Chinese character, a Chinese character, a number, a symbol, and the like can be simultaneously recognized. The present invention relates to a handwritten multi-character recognition method.

일반적인 문자인식의 한 분야로 온라인 문자인식방법이 있는데, 이 방법은 필기자가 문자를 쓸때 획의 움직임을 특징으로 추출하여 문자를 인식하게 된다.As a general field of character recognition, there is an online character recognition method, which recognizes a character by extracting the stroke character when the writer writes the character.

그런데, 이러한 경우는 단일 언어의 문자만의 인식을 목적으로 하고 있어서 각 언어의 고유한 특성때문에 하나 이상의 언어에 대한 문자를 혼용하여 쓸 경우에는 만족할 만한 인식 결과를 얻을 수 없는 문제점이 있었다.However, in this case, since only the characters of a single language are recognized, there is a problem in that a satisfactory recognition result cannot be obtained when the characters of one or more languages are mixed and used because of the unique characteristics of each language.

따라서, 본 발명은 이러한 문제점을 감안하여 의미단위(의미를 가지는 최소단위) 즉, 언어에 의존적인 부분을 학습하여 이를 데이타베이스화한 다음 필기자가필기시에 데이타베이스로부터 일치되는 의미단위를 찾아내고, 각각의 의미단위간의 위치관계가 바른지를 판단한 후 이를 조합함므로써 하나 이상의 언어에 대한 문자에 대해서도 인식할 수 있도록 하는데 목적이 있는 것으로, 이와같은 목적을 갖는 본 발명을 상세히 설명한다.Therefore, in view of this problem, the present invention learns a semantic unit (minimum unit having meaning), that is, a language-dependent part, and forms a database, and then finds a matching semantic unit from the database when the writer writes. By determining whether the positional relationship between each semantic unit is correct and combining them, the present invention has the purpose of being able to recognize letters for one or more languages.

본 발명 필기체 다중 언어 인식방법은 제 1 도에 도시한 바와같이 1개 이상의 획으로 이루어진 입력획의 집합인 셀의 입력점에서 중복된 점들과 일정한 거리안에 존재하는 점들을 제거함으로써 거리 필터링하는 제 1 과정과, 획의 구성과 각 점들이 움직이면서 만드는 누적각에 따라서 영역을 분리한 다음 각 영역의 특성을 살릴 수 있는 특징값을 추출하는 제 2 과정과, 상기 제 2 과정에 의해 추출된 특징값을 데이타베이스에 기 저장해 놓은 기준의미단위의 특정값과 비유사도를 계산하여 비유사도가 자장 낮은 기준의미단위를 인식된 의미단위로 하는 제 3 과정과, 상기 제 3 과정에 의해 인식된 의미단위의 획들로 문자가 구성될 수 있는 가능한 모든 경우를 시도하여 후보문자를 만든 다음 비유사도의 합이 가장 낮은 의미단위로 이루어진 문자를 인식문자로 하는 제 4 과정으로 이루어 진다.The handwritten multilingual recognition method of the present invention includes a first method of filtering distances by removing overlapping points and points existing within a predetermined distance from an input point of a cell, which is a set of input strokes of one or more strokes as shown in FIG. And a second process of separating regions according to the composition of the stroke and the cumulative angles created by moving the points, and then extracting feature values to make use of the characteristics of each region, and the feature values extracted by the second process. A third process of calculating a specific value and dissimilarity of a reference semantic unit previously stored in a database, and using the semantic unit recognized by the third process as a reference semantic unit having a low dissimilarity; Attempts to create a candidate character by trying all possible cases where the character may be composed, and then recognizes the character with the lowest semantic unit of dissimilarity. It takes place in a fourth step of characters.

이와같이 이루어진 본 발명은 제 1 도 및 제 2 도를 참조하여 상세히 설명한다.The present invention thus constructed will be described in detail with reference to FIGS. 1 and 2.

1개 이상의 획으로 이루어진 입력획의 집합인 셀(Cell)은 학습 데이타베이스의 의미 최소단위와 비교되는 단위로서, 특징추출과 인식과정은 이 셀을 기준으로 이루어 진다.A cell, which is a set of input strokes consisting of one or more strokes, is a unit that is compared with the minimum minimum unit of the learning database. The feature extraction and recognition process is based on this cell.

그러므로, 우선적으로 셀의 중복된 점들과 일정한 거리안에 존재하는 점들을제거하는 거리 필터링과정을 거치게 되는데, 이 거리 필터링의 기준이 되는 거리값은 해상도와 획의 크기에 따라 가변적으로 변할 수 있다.Therefore, first, the distance filtering process removes the overlapping points of the cell and the points existing within a certain distance, and the distance value, which is the reference of the distance filtering, may vary according to the resolution and the stroke size.

또한 이 거리값은 일반적으로 해상도가 높고, 획의 크기가 크면 큰값을 사용하게 된다.In addition, this distance value generally has a high resolution and uses a large value when the stroke size is large.

이러한 거리 필터링을 통해 필기자의 필기 속도에 상관없이 셀의 입력점들을 일정한 시간간격으로 샘플링(sampling)하는 효과를 얻을 수 있으며, 이로인해 필기속도에서 오는 변화를 흡수할 수 있게 되고, 또한 입력점의 갯수를 줄일 수 있게되어 그에따른 계산량도 줄일 수 있게 된다.This distance filtering provides the effect of sampling the cell's input points at regular time intervals, regardless of the writing speed of the writer, thereby absorbing the change in the writing speed. The number can be reduced so that the amount of computation can be reduced.

이후, 상기 거리 필터링 과정을 거친 데이타를 가지고 이의 영역을 분리함과 아울러 특징추출과정을 거치게 되는데, 이 특징 추출은 인식률에 가장 큰 영향을 끼치는 것으로서, 그 문자를 나타낼 수 있는 최적의 특징만을 추출하는 것이 중요하다.Subsequently, the data having been subjected to the distance filtering process are separated and its feature is extracted, and feature extraction is performed. This feature extraction has the greatest influence on the recognition rate, and extracts only the optimal feature that can represent the character. It is important.

그런데, 이 특징 추출과정에서 단기 획의 방향만을 생각하게 되면 획내에서 길이의 변화를 흡수하지 못하는 단점이 발생하고, 반면에 절대적 거리를 특징으로 추출하게 되면 크기의 변화라든지 획의 기준선의 회전으로 인하여 발생하는 변화를 흡수하지 못하는 단점이 생기게 된다.However, if we consider only the direction of the short stroke in this feature extraction process, there is a disadvantage that it cannot absorb the change of length in the stroke. On the other hand, if the feature is extracted with the absolute distance, the change in size or rotation of the baseline of the stroke occurs. There is a disadvantage of not absorbing the changes that occur.

따라서, 특징 추출은 이러한 단점을 보완하는 방법을 사용하는 것이 보편화되어 있으므로, 본 발명에서도 마찬가지로 획의 방향에 대한 특징과 절대적 거리에 대한 특징을 모두 사용한다.Therefore, since feature extraction is generally used to compensate for these disadvantages, the present invention uses both the feature of the direction of stroke and the feature of absolute distance.

먼저, 획의 구성과 각 점들이 움직이면서 만드는 누적각에 따라서 획을 분리함으로서 영역을 분리하게 되는데, 일반적으로 문자는 수직선분, 수평선분, 그리고 원선분으로 이루어져 있게되어, 만일 상기 3가지 선분을 정확히 구분해 낼 수 있다면 어떠한 필체의 문자도 인식할 수 있게된다.First of all, the stroke is separated by the stroke composition and the cumulative angle created by each point. In general, the letter is composed of vertical segments, horizontal segments, and circular segments. If you can distinguish it, you can recognize any handwritten character.

그러나, 불행히도 정자에 가까운 글자는 만족할만한 정도로 인식할 수 있지만, 반면에 흘림체로 쓰여진 문자에 대해서는 그 흘림이 심할수록 인식률은 떨어질 수 밖에 없는 것이다.However, unfortunately, letters near sperm can be recognized to a satisfactory level. On the other hand, the more severe the shedding, the lower the recognition rate will be.

그러므로, 본 발명에시는 영역분리를 할 수 있는데까지 하여 문자의 특성을 살리고, 흘림이 심하여 원래의 특성을 추출할 수 있는 문자에 대해서는 의미단위 인식과정에서 해결한다.Therefore, the present invention utilizes the characteristics of the characters to the extent that the domains can be separated, and solves the semantic unit recognition process for the characters from which the original characteristics can be extracted due to heavy shedding.

다음으로, 특징추출과정을 살펴보면, 상기 영역분리과정에 의한 데이타로만 문자인식을 행하게 되면 각 영역내에서의 미세한 변화를 감지할 수 없게 되어 결국 인식률의 향상을 기대할 수 없기 때문에 각 영역에 대해서 영역의 특성을 살릴 수 있는 몇가지 특징값을 정하게 되는데, 하나의 영역을 표현하는 특징값은 다음에 정의된 값을 사용한다.Next, referring to the feature extraction process, if the character recognition is performed only by the data according to the region separation process, it is impossible to detect a small change in each region, and thus it is impossible to expect an improvement in the recognition rate. Several characteristic values can be determined to save the characteristic. The characteristic value representing a region uses the following defined values.

먼저, 영역내에서 획의 움직임에 관한 정보를 특징으로 추출하는 과정에 대하여 살펴보면, 영역내에 있는 획의 모양을 코딩하는 과정이 필요하게 되므로 각 영역을 길이에 상관없이 8등분하여 각 점의 각도와 그 영역의 처음과 끝으로 이루어지는 각도와의 차를 특징값으로 이용한다.First, the process of extracting information on the movement of a stroke in a region is characterized by the process of coding the shape of the stroke in the region. Therefore, each region is divided into eight equal parts regardless of the length of each point. The difference between the angle consisting of the beginning and the end of the area is used as the feature value.

이로써, 영역분리로써 문자의 대략적인 모양을 구분할 수 있게 되고 영역의 8등분으로 글자의 미세한 모양을 감지할 수 있게 된다.Thus, it is possible to distinguish the approximate shape of the letter by area separation and to detect the minute shape of the letter in eight equal parts of the area.

다음, 영역의 크기에 관한 정보를 특징으로 추출하는 과정을 살펴보면, 셀내에서 분리된 영역의 크기와 위치를 특징값으로 하는 것으로, 이 값은 영역의 좌상(top-left)의 x,y값과 우하(bottom-right)의 x,y값이 된다.Next, the process of extracting information on the size of the region is characterized by using the size and position of the separated region in the cell as a feature value, which is equal to the x-y value of the top-left of the region. It is the x, y value of the bottom-right.

마지막으로, 영역에서 획의 시작점과 끝점에 관한 정보를 특징으로 추출하는 과정에 대하여 살펴보면, 이는 영역내에서 획의 움직이는 방향을 특징화하는 값으로, 영역내의 획의 시작점과 끝점을 표시하는 값이다.Finally, the process of extracting information on the start and end points of a stroke in a region is characterized by the direction of the stroke's movement in the region, which indicates the start and end points of the stroke in the region. .

만일, 셀이 여러획으로 이루어지고 각 획이 여러 영역으로 이루어져 있으며 입력된 획의 순서에 따라 각 특징값을 지정하게 된다.If the cell consists of several strokes, each stroke consists of several regions, and each feature value is designated according to the input stroke order.

즉, 제 1 획이 3 개의 영역으로 나누어져 각 특징값이 F11, F12, F13로 이루어져 있고, 제 2 획이 2 영역르로 나누어져 특징값이 F21,F22로 이루어져 있다면, 전체 2획으로 이루어진 영역의 특징값은 하기 식 (1)과 같이 5개의 특징값을 표현될 수 있다.That is, if the first stroke is divided into three regions and each feature value is composed of F11, F12, and F13, and the second stroke is divided into two regions, and the feature value is composed of F21 and F22, the entire stroke is composed of two strokes. The feature values of the region may be expressed by five feature values as shown in Equation (1) below.

F = {F11, F12, F13, F21, F22 } ------------ 식 (1)F = {F11, F12, F13, F21, F22} ------------ Formula (1)

상기에서 특징추출과정까지 마친 데이타는 메모리의 용량을 최소화하면서도 필기체 문자의 모양을 그대로 재현할 수 있도록 하는 정규화 과정을 거치게 되는데, 이제까지 셀을 이루는 각 영역에서 특징값들인 획의 시작점과 끝점에 관한 정보, 그리고 영역의 크기에 관한 정보들은 글씨를 쓰는 위치에 따라 그 특징값이 완전히 달라질 수 있고, 그 특징값 또한 커서 연산하기에 메모리와 용량을 너무 많이 요구하게 된다.The data extracted from the feature extraction process is subjected to a normalization process that can reproduce the shape of the handwritten characters while minimizing the memory capacity. In addition, the information about the size of the region can be completely changed depending on the writing position, and the feature value also requires too much memory and capacity for the cursor operation.

이에따라, 메모리의 용량을 최소화하면서도 필기체 모양의 문자형태를 그대로 재현할 수 있도록 셀의 대각선 길이를 15로 하고, 각 특징값은 셀의 크기에 비례하도록 값을 정한다.Accordingly, the diagonal length of the cell is set to 15 so that the handwritten character form can be reproduced as it is while minimizing the memory capacity, and each feature value is set to be proportional to the size of the cell.

그 이유는 0에서 5까지의 값을 표시하기 위해서는 단지 4비트의 메모리 영역만을 필요로 하고, 또한 이 정규화과정을 통해 특징의 무시할 수 있는 작은 변화는 흡수할 수 있게 된다.The reason is that only 4 bits of memory are required to display values from 0 to 5, and through this normalization process, the negligible small changes in the feature can be absorbed.

다음과정으로 상기에서 정규화된 특징값을 데이타베이스에 기 저장해 놓은 기준의미단위의 특징값과 비유사도를 계산하여 비유사도가 가장 낮은 기준의미단위를 인식된 의미단위로 하는 의미단위 인식과정을 살펴보면 다음과 같다.Next, the process of recognizing a semantic unit that calculates the feature value and dissimilarity of the reference semantic unit previously stored in the database as the normalized feature value as the recognized semantic unit will be described. Same as

기준이 되는 의미단위들의 특징값이 영역의 갯수에 따라 각기 다른 다수의 데이타베이스 형태로 저장되어 있다.The feature values of the semantic units that are the reference are stored in a number of different database formats according to the number of regions.

의미단위의 인식은 데이타베이스에 있는 각 의미단위의 특징값과 추출된 셀의 특징값과 비교하는 과정으로, 먼저 특징영역의 갯수로 분리되어 있는 다수의 데이타베이스에서 추출된 셀 영역의 갯수와의 차이가 -2~+2 사이에 있는 데이타베이스를 선택한다.Recognition of semantic units is a process of comparing the feature values of each semantic unit in the database with the feature values of the extracted cells. First, the recognition of semantic units is performed with the number of cell regions extracted from multiple databases separated by the number of feature regions. Choose a database whose difference is between -2 and +2.

이러한 과정은 흔히 문자인식에서 일컫는 대분류과정이라 할 수 있다.This process is often called a major classification process in letter recognition.

이 데이타베이스 선택과정을 살펴보면, 먼저 영역의 갯수가 같은 데이타베이스를 검색하고, 다음에는 영역의 갯수가 하나 적은 데이타베이스를, 차례대로 둘 적은 데이타베이스를, 하나 큰 데이타베이스를, 둘 큰 데이타베이스를 선택하게 되는데, 이 이유는 필체가 바르고 영역의 분리가 바람직하게 되었다면 기준 데이타베이스와 영역갯수가 같을 확률이 가장 높고, 다음으로 후킹(hooking : 온라인 문자인식에서 가장 크게 오인식으로 작용할 수 있는 것으로서 획의 시작이나 끝에서 입력장치의 불완전과 필기자의 실수로 입력되는 노이즈)이 하나 있을 경우에는 영역갯수가 하나 작은 기준 데이타베이스와 같은 확률이 가장 높이며, 그 다음으로 둘 작은 기준 데이타베이스가 같을 확률이 높게 된다.In this database selection process, we first look for a database with the same number of regions, then a database with one less region, then two smaller databases, one large database, and two larger databases. The reason for this is that if the handwriting is correct and separation of regions is desired, the probability of having the same number of regions is the same as that of the reference database, and then hooking: If there is one input device at the beginning or the end of the input device and the noise inputted by the writer's mistake, the probability is the same as that of the reference database with the smallest number of areas, and the probability that the next two smaller reference databases are the same. Becomes high.

그러나, 영역개수가 하나 큰 데이타베이스, 둘 큰 데이타베이스와 같을 확률은 기준 데이타베이스에 후킹이 들어간 필체가 학습되어 그에 대한 데이타가 있는 경우이므로 확률이 낮으므로 뒤에 검색하게 된다.However, the probability that the number of regions is the same as that of one large database and two large databases is searched later because the probability is low because the handwriting with hooking is learned in the reference database and there is data about it.

이러한 과정으로 데이타베이스가 선택되면 데이타베이스에 있는 모든 의미단위와 셀의 특징값과 비교하는 과정을 거쳐 의미단위를 인식하게 되는데, 이의 알고리즘을 설명하면 다음과 같다.When a database is selected by this process, the semantic unit is recognized by comparing all semantic units and feature values of the cell. The algorithm is described as follows.

먼저, 수식에 사용되는 몇개의 변수를 미리 정의한다.First, we define some variables used in the expression.

n ; 입력패턴는 영역갯수n; Input pattern is number of areas

m : 비교패턴의 영역갯수m: number of areas of comparison pattern

: n개의 영역갯수를 가진 입력패턴의 i번째 영역 : i-th area of input pattern with n number of areas

: m개의 영역갯수를 가긴 비교패턴의 j번깨 영역 j j area of the comparison pattern with m number of areas

: 입력패턴의 i번째 영역과 비교패턴의 j번째 영역 : I-th area of the input pattern and j-th area of the comparison pattern

: 입력패턴의 i i+1번째 영역관계와 비교패턴 j와 j+1번째 영역관계의 비유사도 : Dissimilarity between i i + 1 th region relation and comparison pattern j and j + 1 th region relation of input pattern

상기 정의된 변수에 의해 영역분할된 셀의 영역갯수와 선택된 데이타베이스의 영역갯수가 같을 경우(n=m)를 살펴보면, 전체 비유사도 페널티(Penalty)는 하기 식 (2)에 의해 표현될 수 있다.Looking at the case where the number of regions partitioned by the defined variable and the number of regions of the selected database are the same (n = m), the total dissimilarity penalty can be expressed by the following equation (2). .

여기서, 상기 식 (2)에 의해 n번의 계산을 항상 하는것이 아니라, Dist1과 Dist2의 값이 일정이상이면 즉, 영역이 서로 다르다고 판단되는 경우에는 다음 항목을 계산하지 않는다.Here, n calculations are not always performed according to the above formula (2). If the values of Dist1 and Dist2 are equal to or greater than a certain value, that is, the area is determined to be different from each other, the next item is not calculated.

다음으로 영역분할된 셀의 영역갯수와 선택된 데이타베이스의 영역갯수가 서로 다를 경우(n=m)를 살펴보면, 같은 문자임에도 불구하고 영역의 갯수가 다르게 나타나는 경우는 후킹에 의해 영역이 나누어지는 경우와, 수직과 수평성분의 판단오류에 의해 영역이 분할되는 두 가지의 경우가 있게된다.Next, look at the case where the area number of the cell partitioned and the number of area of the selected database are different (n = m). If the number of areas appears different even though they are the same characters, the area is divided by hooking. In this case, there are two cases in which the region is divided by the error of judging vertical and horizontal components.

그러므로, 후킹에 의해 영역의 갯수가 다르다고 판단되는 경우라면 후킹을 제거한 다음 영역를 비교하여야 하며, 반면에 수직과 수평성분의 판단오류에 대한 경우라면 나누어진 두 영역을 다시 하나로 합친 다음 영역의 갯수를 비교하여야 한다.Therefore, if it is determined that the number of areas is different by hooking, then the area must be removed after the hooking is removed. On the other hand, if the error of the vertical and horizontal components is judged, the two divided areas are put together again and then the number of areas is compared. shall.

먼저 후킹에 의해 영역 갯수가 다르다고 판단되는 경우를 살펴보면, 비교대상이 되는 두개 영역의 크기에 대한 차이 및 두 영역의 방향에 대한 차이가 일정이상이면서, 동시에 다음 비교대상인 두 영역간의 크기에 대한 차이 및 방향에 대한차이가 일정이하라면 현재 비교되는 영역을 후킹이라고 간주하고 전체 비유사도에 후킹에 의한 비유사도를 더해준다.First, when the number of areas is judged to be different by hooking, the difference between the size of the two areas to be compared and the direction of the two areas is more than a certain level, and at the same time, the difference between the sizes of the two areas to be compared and If the difference in direction is not constant, the area currently compared is considered as hooking, and the similarity by hooking is added to the total dissimilarity.

또한, 수직과 수평성분의 판단오류에 의해 영역이 분리되었다고 판단할 경우에 있어서, 원래는 하나여야 할 영역이 두 점 사이의 각도변화가 일정이상이어서 두개의 영역으로 나누어지는 경우이므로, 두 영역이 모두 직선성분이면서 두 영역의 시작점과 끝점이 이루는 방향각과 비교되는 영역의 방향각이 일정이하의 차이를 보이면 영역이 분리되었다고 판단하여 두 영역을 합친 다음 영역간의 비유사도를 구하게 되어 의미단위의 획을 인식하게 된다.In addition, in the case where it is determined that the area is separated by the error of judging the vertical and horizontal components, the area that should be one originally is divided into two areas because the angle change between two points is more than a certain level. If the direction angle of the area compared to the direction angle between the starting point and the end point of the two areas is a certain difference or less, it is determined that the areas are separated, the two areas are combined, and then the dissimilarity between the areas is calculated. To be recognized.

이의 비유사도의 계산은 하기 식 (3)에 의한다.The calculation of dissimilarity thereof is based on the following equation (3).

이때, p는 n개의 입력 패턴과 m개의 비교 패턴에서 후킹성분의 영역을 제거하고, 또 나누어진 주 영역을 서로 합친뒤에 실제로 두 패턴에서 영역이 비교된 횟수 즉, 수정된 다음의 영역갯수를 의미하며 P < min(n,m)을 만족한다.In this case, p means the number of regions after the removal of the hooking components from the n input patterns and the m comparison patterns, and after combining the divided main regions with each other, in the two patterns, that is, the number of modified regions. P <min (n, m) is satisfied.

다음은 마지막으로 인식된 의미단위의 획들로 문자가 구성될 수 있는 가능한 모든 경우를 시도하여 후보문자를 만든 다음 비유사도의 합이 가장 낮은 의미단위로 이루어진 문자를 인식문자로 하는 의미단위 조합과정을 살펴보면 다음과 같다.The following is a process of combining semantic units by making a candidate character by trying all possible cases where the character may consist of the strokes of the last recognized semantic unit, and then using the character consisting of the semantic unit with the lowest dissimilarity as the recognition character. Looking at it as follows.

상기 의미단위 인식과정에 의하여 인식된 획들로 문자가 구성될 수 있는 가능한 모든 경우를 시도하여 트리(Tree)를 만들고 그 결과로 인식문자 및 후보문자를 만들게 된다.All possible cases in which characters may be composed of strokes recognized by the semantic unit recognition process are attempted to create a tree, and as a result, a recognition character and a candidate character are generated.

즉, 제 2 도에 도시한 바와같이 3개의 획으로 이루어진 '가'라는 입력문자를 가지고 인식하는 과정은, 처음에는 1획을 셀로 만들어 인식을 한 다음 첫번째 트리를 만들고, 다음에는 1획과 2획을 셀로 만들어 두번째 트리를 만드는 과정을 순차적으로 시도한다.In other words, as shown in FIG. 2, the process of recognizing the input character 'ga' consisting of three strokes is performed by first making one stroke into a cell, recognizing the first tree, and then the first stroke and two strokes. The process of creating a second tree by making strokes into cells is performed sequentially.

만일 노드(Node)의 인식결과가 복수개가 되는 정부 새로운 트리가 만들어 진다.If a node recognizes a plurality of nodes, a new tree is created.

각 트리의 노드에 대한 의미단위의 인식이 실패한 경우나, 현재의 노드와 이전 노드와의 위치관계가 맞지 않는 경우에는 더 이상의 노드 확장도 없고 인식문자도 만들어지지 않는다.If the semantic unit recognition of the nodes of each tree fails, or the positional relationship between the current node and the previous node is not correct, no further node expansion and no identifier are made.

이러한 방법으로 모든 경우에 대한 트리가 만들어지고, 마지막 획까지 도달한 트리에서 후보문자가 인식되어진다.In this way a tree is created for all cases, and candidate characters are recognized in the tree reaching the last stroke.

결과적으로 트리에 의해 여러개의 후보 문자열이 만들어 지면 구중에서 최종인식된 문자를 인식된 결과중에서 가장 낮은 비유사도를 가진 의미단위들로 이루어진 문자를 인식문자로 하게된다.As a result, when several candidate strings are formed by the tree, the character that is composed of semantic units having the lowest dissimilarity among the recognized results is recognized as the character of recognition.

이와같이 본 발명에서는 문자를 인식함에 있어서, 의미를 가기는 최소단위로 문자를 인식하기 때문에 다중 언어에 대한 문자를 인식할 수 있으며, 또한 하나의 문자 인식 결과만을 취하는 것이 아니라 인식 결과 및 후보 결과까지 취함으로서 오인식을 정정할 수 있는 기능을 가지게 된다.As described above, in the present invention, since the characters are recognized in the smallest unit in meaning, characters in multiple languages can be recognized, and the recognition result and the candidate result are taken instead of only one character recognition result. As a result, it has a function to correct misperceptions.

제 1 도는 본 발명 필기체 다중 문자 인식방법의 과정을 나타낸 도.1 is a diagram showing the process of the present invention handwritten multi-character recognition method.

제 2 도는 제 1 도 의미단위 조합과정을 나타낸 도.2 is a diagram showing a combination process of FIG.

Claims

In the first process of distance filtering, the overlapping points and the points existing within a certain distance are removed from the input point of the cell, which is a set of input strokes consisting of one or more strokes. Therefore, a second process of extracting feature values that can make use of the characteristics of each region after separating the regions, and feature values and dissimilarity diagrams of reference meaning units that have previously stored the feature values extracted by the second process in a database The candidate process by attempting the third process of using the meaning of the reference unit having the lowest dissimilarity as the recognized semantic unit, and all possible cases in which the characters may be composed of the strokes of the semantic unit recognized by the third process. Handwriting, characterized in that the fourth step of making the recognition character of the letter consisting of the lowest semantic unit after the sum of dissimilarity Multiple character recognition method.

The method of claim 1, further comprising a normalization process for reproducing a handwritten character form while minimizing a memory capacity after the second process.

The method of claim 1, wherein the feature value extracted in the second process is information about a stroke movement in the region.

The method of claim 1, wherein the feature value extracted in the second process is information about a location of a size of an area.

The method of claim 1, wherein the feature value extracted in the second process is information about a start point and an end point of a stroke in a region.

The method of claim 1, wherein the database is a plurality of databases having a number of areas corresponding to the number of areas divided by the second process.

4. The handwritten multi-character recognition of claim 3, wherein the information on the movement of the stroke is a difference value between an angle of each point and an angle formed by the beginning and the end of the area by dividing an area into predetermined portions regardless of length. Way.