KR930023866A

KR930023866A - How to Extract Mixed Characters from Document Recognition Device

Info

Publication number: KR930023866A
Application number: KR1019920009215A
Authority: KR
Inventors: 노희호
Original assignee: 이헌조; 주식회사 금성사
Priority date: 1992-05-28
Filing date: 1992-05-28
Publication date: 1993-12-21

Abstract

본 발명은 문서인식장치의 혼용문자 추출방법에 관한 것으로서, 여러개의 문자가 동시에 하나의 문서에 존재하는 혼용문서에서 한글과 한자를 구별하기 위한 것이다.The present invention relates to a method of extracting a mixed character of a document recognition device, and to distinguish between Hangul and Hanja in a mixed document in which several characters exist in one document at the same time.

이와같은 본 발명은 인력 문서를 스캔하여 2차 화상을 발생하는 스캔과정과, 상기 스캔과정을 통한 입력영상으로 부터 그림부분과 문자부분을 구별하고 문자부분에 대해 문자열을 분리하는 문자열 추출과정과, 상기 분리된 문자열로 부터 개별문자를 절출하는 개별문자 절출과정과, 상기 절출된 각 개별문자로 부터 한글과 한자를 구별하는 한글/한자 구별과정과, 상기 구별된 각 문자를 인식하는 인식과정으로 이루어짐을써 달성되는 것이다.As described above, the present invention provides a scanning process of scanning a manpower document to generate a secondary image, a string extraction process of distinguishing a picture part from a character part from the input image through the scanning process and separating a character string from the character part; An individual character extraction process for extracting individual characters from the separated character string, a Hangul / Hanja distinction process for distinguishing Hangul and Hanja from each of the extracted individual characters, and a recognition process for recognizing the distinguished characters It is achieved by fulfillment.

Description

How to Extract Mixed Characters from Document Recognition Device

본 내용은 요부공개 건이므로 전문내용을 수록하지 않았음Since this is an open matter, no full text was included.

제2도는 본 발명 문서인식장치의 혼용문자 절출 시스템 구성도, 제3도는 제2도의 동작설명에 대한 문서인식 신호흐름도, 제4도는 제3도의 한글/한자구별과정의 신호흐름도.2 is a block diagram of a mixed-text extraction system of the document recognition apparatus of the present invention, FIG. 3 is a document recognition signal flow diagram for explaining the operation of FIG. 2, and FIG. 4 is a signal flow diagram of the Hangul / Chinese distinction process of FIG.

Claims

A scanning process of scanning an input document to generate a binary image, a string extracting process of distinguishing a picture part from a character part from the input image through the scanning process, and separating a character string from the character part; From the individual character extraction process to extract the individual characters from, and the Hangul / Hanja distinction process to distinguish between the Hangul and Hanja from each of the individual characters extracted, and the recognition process to recognize each distinct character How to use mixed text in document recognition device.

According to the method of claim 6, the first step (14a) of setting the flag (S) of the individual characters for the individual characters of the input character string, Hangul / Hanja distinction process, and the set flag (S) of the individual characters A second step 14b of searching and detecting a word Ws, a third step 14c of setting an individual letter i in the word detected in the second step 14b, and the word Fourth step (14d) (14e) of searching for Chinese characters for two individual characters (N _i ) (N _{i + 1} ), and if the individual character is recognized as a Chinese character, the other individual characters are designated as Chinese characters. In step 5 (f) and the fifth step (14f), the individual letters (i) forming a word are searched for the last character (n). If not, the last character is searched until the last character is searched. In the sixth step (14g) (14h) and the fourth step (14d) (14e), if the first two individual characters are not Chinese characters, In the seventh step (14i) of searching for whether a symbol is included in the word and searching for the existence of the character surrounded by the sign if there is a sign, and in the sixth step (14g) and the seventh step (14i), When the last character is searched for, the eighth step 14i of increasing the flag of each individual character for each individual character of the next character string, and searching for the flag S of the increased individual character and the searched flag constitutes the character string. If it is the last word to complete the process of cutting individual characters, and if not the last word is fed to the second step (14Y) and the ninth step (14k) to search for the next word characterized in that it consists of a document recognition device .

The method as claimed in claim 2, wherein the fourth steps 14d and 14e are performed by obtaining the number of subpatterns for the input characters and determining the number of subpatterns to be kanji if the number of subpatterns is greater than or equal to a predetermined value. An eleventh step 140b of determining a Chinese character if a straight line similar to the width of the letter exists or a vertical line similar to the height of the letter exists by examining the linear components of the upper and left ends of the circumferential area of the letter with respect to the number; Investigating the contact change frequency of the left, right, and bottom portions of the determined characters, and if the change frequency is equal to or greater than a predetermined value, the twelfth step 140c of determining Chinese characters and the position of the minimum subpattern of the determined characters are examined. The 13th step (140d) to recognize the Chinese characters when the position of the present in the left and right upper end or the left and right lower end, and if there is a vertical line similar to the height of the character by examining the presence or absence of a vertical line in the recognized central character Is a fourteenth step (140e), and the fifteenth step (140f) to examine the position, arrangement and size of the recognized character and determine the kanji if the number of the fused pattern after the fusion is a predetermined value; The 16th step (140g) of determining the Chinese character if there is a vertical line similar to a predetermined area of the character height by examining the vertical portion existing in the left end of the determined Chinese character, and by examining the head component of the determined Chinese character And a seventeenth step (140h) of determining the kanji if present.

The Chinese character search in the fourth step (14d) (14e) is performed by checking whether the first two characters of the word detected from the extracted individual characters are kanji. Method of extracting mixed characters of document recognizing device, characterized in that if one character is identified as Chinese character, another character is also identified as Chinese character.

The method according to claim 3, wherein the number of subpatterns in the eleventh step (140a) is at least five or more.

The method of claim 3, wherein the linear component irradiation of the character circumference region of the eleventh step (140a) is irradiated from the top of the character to the quarter point of the height of the character at the upper end, and the width of the character from the left end at the left end. A method for extracting mixed characters of a document recognition device, characterized by examining an area up to a third position.

[4] The method of claim 3, wherein the frequency of change of contact of the left, right, and bottom portions of the twelfth step 140c is equal to or greater than eight.

The method of claim 3, wherein the presence or absence of the vertical line in the center of the character in the fourteenth step (140e) is irradiated in a third area of the center of the character.

The method of claim 15, wherein in the fifteenth step 140f, if the number of subpatterns irradiated is two, the heights and widths of the two subpatterns are irradiated so that the heights of the two subpatterns are greater than the width and the difference between the positions of the upper and lower portions of the two is four. If the distance between the two patterns is 5 pixels or less, it is determined by Chinese characters.If the number of sub-patterns is 3 or more, the two patterns are merged into one pattern. Mixed Character Extraction Method of Recognition Device.

The method of claim 3, wherein the vertical component irradiation of the left end of the character of the sixteenth step 140g examines two-thirds of the height of the upper and lower characters, and if there is a vertical line similar to two-thirds of the height of the character, Mixed text extraction method of a document recognition device characterized in that.

[4] The document recognition device according to claim 3, wherein the irradiation of the Kanji head component of the seventeenth step (140h) examines the straight component and the central part of the linear component of the upper part of the Chinese character and detects the comb to detect the head component. How to use mixed characters in Korean.

※ Note: The disclosure is based on the initial application.