JP5387378B2

JP5387378B2 - Character identification device and character identification method

Info

Publication number: JP5387378B2
Application number: JP2009283960A
Authority: JP
Inventors: 勇大石; 千織村松
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2009-12-15
Filing date: 2009-12-15
Publication date: 2014-01-15
Anticipated expiration: 2029-12-15
Also published as: CN102096828B; JP2011128688A; CN102096828A

Description

本発明は、文字同定装置及び文字同定方法に関する。 The present invention relates to a character identification device and a character identification method.

例えば、市町村合併等に伴うコンピュータシステムの統合においては、複数のコンピュータシステムが別々に処理していた文字を、新たなコンピュータシステムにおいて統一して処理する必要が生じる。この場合、新たなコンピュータシステムの設計段階において、異なる複数の文字を１つの文字に統合する同定作業が必要となる。同定作業とは、オペレータが、複数の文字を目視で確認して、当該複数の文字を同一の文字として扱ってよいかどうか判断する作業である。 For example, in the integration of computer systems due to mergers of cities, towns and villages, it is necessary to process characters that have been processed separately by a plurality of computer systems in a new computer system. In this case, identification work for integrating a plurality of different characters into one character is required in the design stage of a new computer system. The identification operation is an operation in which an operator visually confirms a plurality of characters and determines whether the plurality of characters can be handled as the same character.

例えば、ＪＩＳに定められていない文字同士、換言すれば、外字同士についての同定作業が必要である。また、外字とＪＩＳに定められた文字とについての同定作業が必要である。このような外字は、例えば人名や地名に多く使用される。 For example, it is necessary to identify characters that are not defined in JIS, in other words, external characters. In addition, identification work for external characters and characters defined in JIS is required. Such external characters are often used for personal names and place names, for example.

このような同定作業は、例えば、同定作業の対象である文字の一覧を印刷して、全ての文字を目視で確認しながら、同定する文字を探すことにより行われる。この際、当該印刷された文字をＯＣＲにより認識する活字ＯＣＲ技術を使用して、同定を効率化することが行われる。 Such identification work is performed, for example, by printing a list of characters that are the object of identification work and searching for characters to be identified while visually checking all the characters. At this time, identification is made efficient by using a type OCR technique for recognizing the printed characters by OCR.

なお、光学的文字読取装置において、文字イメージをｎ×ｎのブロックに分割した後、各ブロックの特徴ベクトルを抽出し、この特徴ブロックを辞書に照らし合わせて、候補文字群を検索し、候補文字が部首に分割できるか否かを判定し、候補文字が部首に分割可能であると判定された場合に前記文字イメージを複数の部首部分に分割し、各部首に対応するイメージ部分を処理することによって各部首部分の候補文字を検索し、各部首の候補文字を部首に持つ漢字群を検索することが提案されている。 In the optical character reader, after the character image is divided into n × n blocks, feature vectors of each block are extracted, candidate character groups are searched by comparing the feature blocks with the dictionary, and candidate characters are searched. Is divided into radicals, and when it is determined that the candidate character can be divided into radicals, the character image is divided into a plurality of radical parts, and image parts corresponding to the radicals are divided. It has been proposed to search for a candidate character for each radical portion by processing and to search for a kanji group having the candidate character for each radical at the radical.

また、文字認識システムにおいて、認識結果中のリジェクト文字または誤認文字の修正時に、リジェクト文字または誤認文字のパターンの特徴量と、このリジェクト文字または誤認文字の正解文字に対応する認識辞書内の特徴量との合成によって新しい特徴量を生成し、この新しい特徴量を認識辞書内の特徴量と入れ替えるか、または認識辞書に追加することが提案されている。 In the character recognition system, when a reject character or a misidentified character in the recognition result is corrected, the feature amount of the reject character or the misidentified character pattern and the feature amount in the recognition dictionary corresponding to the correct character of the reject character or the misidentified character It has been proposed to generate a new feature quantity by combining with and replace the new feature quantity with the feature quantity in the recognition dictionary or add it to the recognition dictionary.

特開平４−２０５０７８号公報Japanese Patent Laid-Open No. 4-205078 特開平２−１８６４８４号公報Japanese Patent Laid-Open No. 2-186484

印刷された文字の一覧を目視で確認する場合において、同定作業の対象である外字が数千文字も存在する場合には、非常に煩雑な作業となる。同様に、統合されるコンピュータシステムが２以上である場合は、極めて煩雑で困難な作業となる。 When visually checking a list of printed characters, if there are thousands of external characters to be identified, the operation is very complicated. Similarly, when there are two or more computer systems to be integrated, it is an extremely complicated and difficult task.

また、活字ＯＣＲ技術を使用する場合でも、文字全体の認識で候補文字を抽出するため、候補文字の精度が低く、ある程度参考にできる資料が得られるに過ぎない。換言すれば、適切な候補文字が得られない場合、結局、オペレータによる文字の一覧の目視による確認も併用しなければならず、また、その比重が大きい。 Even when the type OCR technique is used, candidate characters are extracted by recognizing the entire character, so that the accuracy of the candidate characters is low and only a material that can be referred to to some extent is obtained. In other words, if an appropriate candidate character cannot be obtained, the operator must visually check the list of characters, and the specific gravity is large.

本発明は、同定の候補文字を高い精度で得ることができる文字同定装置を提供することを目的とする。 An object of the present invention is to provide a character identification device that can obtain candidate characters for identification with high accuracy.

開示される文字同定装置は、文字認識格納部と、字形要素格納部と、外字格納部と、外字字形要素格納部と、候補文字リスト生成部と、ＯＣＲ認識部とを備える。文字認識格納部は、文字のドットパターンを格納する。字形要素格納部は、文字認識格納部に格納された文字について、部首の配置を示す配置パターンと、部首を表す部首文字コードを含む部首の字形要素情報と、部首以外の部分を表す部分文字コードを含む部分の字形要素情報とを格納する。外字格納部は、予め定められた文字を表す文字コードにより表される規格化された文字に含まれない文字である外字のドットパターンを格納する。外字字形要素格納部は、外字格納部に格納された外字について、部首の配置を示す配置パターンと、部首を表す部首文字コードを含む部首の字形要素情報と、部首以外の部分を表す部分文字コードを含む部分の字形要素情報とを格納する。ＯＣＲ認識部は、外字格納部から選択した処理対象である外字について、外字格納部に格納された外字のドットパターンと文字認識格納部に格納された文字のドットパターンとに基づいて、文字認識格納部に格納された文字から、処理対象である外字を同定する第１の候補文字を抽出する。候補文字リスト生成部は、処理対象である外字について、字形要素格納部に格納された文字についての部首の字形要素情報と、外字字形要素格納部に格納された外字についての部首の字形要素情報とに基づいて、字形要素格納部に格納された文字から、処理対象である外字を同定する第２の候補文字を抽出し、処理対象である外字について、字形要素格納部に格納された文字についての部分の字形要素情報と、外字字形要素格納部に格納された外字についての部分の字形要素情報とに基づいて、字形要素格納部に格納された文字から、処理対象である外字を同定する第３の候補文字を抽出する。 The disclosed character identification device includes a character recognition storage unit, a character shape element storage unit, an external character storage unit, an external character shape element storage unit, a candidate character list generation unit, and an OCR recognition unit. The character recognition storage unit stores a dot pattern of characters. The character element storage unit includes an arrangement pattern indicating the arrangement of radicals for the characters stored in the character recognition storage unit, radical character information including radical character codes indicating radicals, and parts other than radicals The character shape element information of the portion including the partial character code representing the character is stored. The external character storage unit stores a dot pattern of an external character that is a character not included in the standardized character represented by a character code representing a predetermined character. The external character-shaped element storage unit includes an arrangement pattern indicating the arrangement of the radical, a radical character-shaped element information including a radical character code indicating the radical, and a portion other than the radical for the external character stored in the external character storage unit. The character shape element information of the portion including the partial character code representing the character is stored. The OCR recognition unit stores character recognition for an external character to be processed selected from the external character storage unit based on the external character dot pattern stored in the external character storage unit and the character dot pattern stored in the character recognition storage unit. A first candidate character that identifies an external character to be processed is extracted from the character stored in the section. The candidate character list generation unit, for the external character to be processed, the radical character element information for the character stored in the character element storage unit and the radical character element for the external character stored in the external character element storage unit Based on the information, the second candidate character for identifying the external character to be processed is extracted from the character stored in the character element storage unit, and the character stored in the character element storage unit for the external character to be processed The external character to be processed is identified from the characters stored in the glyph element storage unit based on the character element information of the part about and the character element information of the part about the external character stored in the external character element storage unit A third candidate character is extracted.

開示される文字同定装置によれば、同定作業の対象である外字が多数存在する場合であっても、同定の候補文字を高い精度で得ることができ、同定を行うオペレータの負担を軽減し、複数のコンピュータシステムを統合した新たなコンピュータシステムを構築する期間を短縮することができる。 According to the disclosed character identification device, even if there are a large number of external characters to be identified, identification candidate characters can be obtained with high accuracy, reducing the burden on the operator performing identification, A period for constructing a new computer system in which a plurality of computer systems are integrated can be shortened.

文字同定装置の構成の一例を示す図である。It is a figure which shows an example of a structure of a character identification device. 外字ファイル及び外字字形要素格納ファイルの一例を示す図である。It is a figure which shows an example of an external character file and an external character form element storage file. 文字認識辞書及び字形要素辞書の一例を示す図である。It is a figure which shows an example of a character recognition dictionary and a character form element dictionary. 文字同定の一例を示す図である。It is a figure which shows an example of character identification. 文字同定の一例を示す図である。It is a figure which shows an example of character identification. 文字同定の一例を示す図である。It is a figure which shows an example of character identification. 文字同定の一例を示す図である。It is a figure which shows an example of character identification. 文字同定の一例を示す図である。It is a figure which shows an example of character identification. 文字同定の一例を示す図である。It is a figure which shows an example of character identification. 文字同定の処理フローを示す図である。It is a figure which shows the processing flow of character identification. １文字同定の処理フローを示す図である。It is a figure which shows the processing flow of 1 character identification. １文字同定の処理フローを示す図である。It is a figure which shows the processing flow of 1 character identification. 候補文字リスト生成の処理フローを示す図である。It is a figure which shows the processing flow of candidate character list production | generation. 候補文字学習の処理フローを示す図である。It is a figure which shows the processing flow of candidate character learning.

図１は、文字同定装置１の構成の一例を示す図である。 FIG. 1 is a diagram illustrating an example of the configuration of the character identification device 1.

文字同定装置１は、外字ファイル２、外字字形要素格納ファイル３、文字コード変換定義リスト４、表示部５、キーボード６を備える。また、文字同定装置１は、同定処理部１１、文字認識辞書１２、ＯＣＲ候補文字リスト１３、字形要素辞書１４、部首候補文字リスト１５、部分候補文字リスト１６、表示用候補文字リスト１７、同定元／同定先文字対応関係リスト１８を備える。同定処理部１１は、ＯＣＲ認識部１１１、候補文字リスト生成部１１２、表示用候補文字リスト生成部１１３、文字情報学習部１１４を備える。 The character identification device 1 includes an external character file 2, an external character shape element storage file 3, a character code conversion definition list 4, a display unit 5, and a keyboard 6. The character identification device 1 also includes an identification processing unit 11, a character recognition dictionary 12, an OCR candidate character list 13, a glyph element dictionary 14, a radical candidate character list 15, a partial candidate character list 16, a display candidate character list 17, and an identification. A source / identifier character correspondence list 18 is provided. The identification processing unit 11 includes an OCR recognition unit 111, a candidate character list generation unit 112, a display candidate character list generation unit 113, and a character information learning unit 114.

文字同定装置１において、外字ファイル２及び外字字形要素格納ファイル３は、処理対象である外字（漢字）のデータを格納する外字データセットである。処理対象の文字は、外字以外の文字であっても良い。外字ファイル２及び外字字形要素格納ファイル３は、予め用意される。外字ファイル２と外字字形要素格納ファイル３とは、後述するように、相互に対応するデータを格納する。 In the character identification device 1, the external character file 2 and the external character shape element storage file 3 are external character data sets for storing data of external characters (kanji characters) to be processed. The character to be processed may be a character other than an external character. An external character file 2 and an external character element storage file 3 are prepared in advance. The external character file 2 and the external character shape element storage file 3 store data corresponding to each other, as will be described later.

外字ファイル２は、外字毎に、外字のドットパターンを格納する。外字は、予め定められた文字を表す文字コードにより表される規格化された文字に含まれない文字である。文字コードは、文字や記号をコンピュータで扱うために、文字や記号毎に一意に割り当てられた固有の数字である。文字コードは、例えばＪＩＳコードである。外字は、例えばＪＩＳコードにより表されない文字である。ドットパターンは、文字の表示領域におけるピクセル毎に白（＝０）又は黒（＝１）の値を与えることにより、黒のパターンにより当該文字を表現したデータである。 The external character file 2 stores an external character dot pattern for each external character. An external character is a character that is not included in a standardized character represented by a character code that represents a predetermined character. The character code is a unique number uniquely assigned to each character or symbol so that the character or symbol is handled by the computer. The character code is, for example, a JIS code. An external character is a character that is not represented by, for example, a JIS code. A dot pattern is data representing a character by a black pattern by giving a white (= 0) or black (= 1) value to each pixel in the character display area.

外字字形要素格納ファイル３は、外字ファイル２に格納された外字について、配置パターンと、部首の字形要素情報と、部首以外の部分の字形要素情報とを格納する。配置パターンは、部首の配置を示す。部首の字形要素情報は、部首を表す部首文字コードを含む。部首以外の部分の字形要素情報は、部首以外の部分を表す部分文字コードを含む。 The external character shape element storage file 3 stores an arrangement pattern, radical character element information, and character element information of a portion other than the radical for the external characters stored in the external character file 2. The arrangement pattern indicates the arrangement of radicals. The radical character element information of the radical includes a radical character code representing the radical. The character element information of the part other than the radical includes a partial character code representing the part other than the radical.

ここで、文字の字形要素は、部首と、部首以外の構成部分（以下、単に「部分」という）である。字形は、文字全体の形状、部首の形状、部首以外の構成部分の形状である。部首は、文字（換言すれば、漢字）を構成する字形要素の一つである偏旁（へんぼう）を、漢字を分類する際の基準として定めたものである。偏旁は、漢字の字体を構成する要素の一つで、左右上下内外の部分に分解できる要素である。部首は、漢字配列の目安となる漢字の各部の共通部分であり、例えば、偏（へん）、旁(つくり)、冠（かんむり）等がある。部分は、部首以外の文字の構成要素であり、文字から当該文字の部首を除いた部分である。部首は、部首文字コードにより一意に特定される。部分は、部分文字コードにより一意に特定される。 Here, the character-shaped element of a character is a radical and a component other than the radical (hereinafter simply referred to as “part”). The character shape is the shape of the entire character, the shape of the radical, or the shape of a component other than the radical. The radical is defined as a reference when classifying kanji, which is one of the glyph elements constituting a character (in other words, kanji). A bias is one of the elements that make up a kanji font and can be broken down into left, right, up, down, inside and outside parts. The radical is a common part of each part of the kanji that serves as a guideline for the kanji arrangement, and includes, for example, a bias, a sword, and a crown. The portion is a component of a character other than the radical, and is a portion obtained by removing the radical of the character from the character. The radical is uniquely specified by the radical character code. The part is uniquely specified by the partial character code.

また、配置パターンは、部首の配置される位置を示す。偏旁、換言すれば、部首は、配置される位置によって、例えば、以下のように配置パターンを示す識別番号が割当てられる。 The arrangement pattern indicates a position where the radical is arranged. In other words, the radical is assigned an identification number indicating an arrangement pattern as follows, for example, depending on the position where the radical is arranged.

配置パターン「１」は、「へん」を表す。「へん」は、右と左とに分けられる漢字の左側に位置する。配置パターン「２」は、「つくり」を表す。「つくり」は、右と左とに分けられる漢字の右側に位置する。配置パターン「３」は、「かんむり」を表す。「かんむり」は、上と下とに分けられる漢字の上側に位置する。配置パターン「４」は、「あし」を表す。「あし」は、上と下とに分けられる漢字の下側に位置する。配置パターン「５」は、「たれ」を表す。「たれ」は、上から左下側にたれさがっている形に位置する。配置パターン「６」は、「にょう」を表す。「にょう」は、左から下側に「へん」と「あし」をあわせたような形に位置する。配置パターン「７」は、「かまえ」を表す。「かまえ」は、外側に囲むように位置する。 The arrangement pattern “1” represents “hen”. “Hen” is located on the left side of the kanji divided into right and left. The arrangement pattern “2” represents “production”. “Tsukuri” is located on the right side of the kanji divided into right and left. The arrangement pattern “3” represents “kanmuri”. “Kamuri” is located above the kanji divided into upper and lower. The arrangement pattern “4” represents “ashi”. “Ashi” is located below the kanji divided into upper and lower. The arrangement pattern “5” represents “sag”. The “sag” is positioned so as to sag from the top to the bottom left. The arrangement pattern “6” represents “Nyo”. “Nyo” is located in the shape of “hen” and “ashi” combined from left to bottom. The arrangement pattern “7” represents “Kamae”. “Kamae” is positioned so as to surround the outside.

文字同定装置１において、文字認識辞書１２及び字形要素辞書１４は、処理対象である外字を同定する文字（漢字）のデータを格納する文字データセットである。文字認識辞書１２及び字形要素辞書１４は、予め用意される。字形要素辞書１４は、後述するように、学習処理により更新される。文字認識辞書１２と字形要素辞書１４とは、後述するように、相互に対応するデータを格納する。 In the character identification device 1, the character recognition dictionary 12 and the character form element dictionary 14 are character data sets that store character (Kanji) data for identifying an external character to be processed. The character recognition dictionary 12 and the character form element dictionary 14 are prepared in advance. The character form element dictionary 14 is updated by learning processing, as will be described later. The character recognition dictionary 12 and the character form element dictionary 14 store data corresponding to each other, as will be described later.

文字認識辞書１２は、文字のドットパターンを格納する。文字認識辞書１２に格納された文字は、規格化された文字、換言すれば、ＪＩＳコードで表される文字である。なお、文字認識辞書１２に格納された文字が、規格化された文字及び外字を含むようにしても良い。 The character recognition dictionary 12 stores a dot pattern of characters. The characters stored in the character recognition dictionary 12 are standardized characters, in other words, characters represented by JIS codes. The characters stored in the character recognition dictionary 12 may include standardized characters and external characters.

字形要素辞書１４は、文字認識辞書１２に格納された文字について、配置パターンと、部首の字形要素情報と、部分の字形要素情報とを格納する。配置パターンは、前述したように、部首の配置を示す。部首の字形要素情報は、前述したように、部首を表す部首文字コードを含む。部分の字形要素情報は、前述したように、部首以外の部分を表す部分文字コードを含む。 The character form element dictionary 14 stores an arrangement pattern, radical character form element information, and partial character form element information for characters stored in the character recognition dictionary 12. As described above, the arrangement pattern indicates the arrangement of radicals. The radical shape element information of the radical includes the radical character code representing the radical as described above. As described above, the character-shaped element information of the part includes a partial character code representing a part other than the radical.

ＯＣＲ認識部１１１は、外字ファイル２から、外字を読み出して、処理対象とする。ＯＣＲ認識部１１１は、外字ファイル２から選択した処理対象である外字について、処理対象である外字を同定する第１の候補文字を抽出する。第１の候補文字は、外字ファイル２に格納された外字のドットパターンと文字認識辞書１２に格納された文字のドットパターンとに基づいて、文字認識辞書１２に格納された文字から抽出される。 The OCR recognition unit 111 reads an external character from the external character file 2 and sets it as a processing target. The OCR recognition unit 111 extracts a first candidate character that identifies the external character that is the processing target for the external character that is the processing target selected from the external character file 2. The first candidate character is extracted from the character stored in the character recognition dictionary 12 based on the dot pattern of the external character stored in the external character file 2 and the dot pattern of the character stored in the character recognition dictionary 12.

具体的には、ＯＣＲ認識部１１１は、処理対象である外字のドットパターンと、文字認識辞書１２に格納された文字のドットパターンとについて、ＯＣＲ処理を行う。これにより、ＯＣＲ認識部１１１は、文字認識辞書１２に格納された文字のドットパターンの中から、処理対象である外字のドットパターンと、文字の全体として類似しているものを抽出する。 Specifically, the OCR recognition unit 111 performs OCR processing on the dot pattern of the external character to be processed and the dot pattern of the character stored in the character recognition dictionary 12. As a result, the OCR recognition unit 111 extracts, from the character dot patterns stored in the character recognition dictionary 12, a character that is similar to the external character dot pattern to be processed as a whole.

ＯＣＲ認識部１１１は、抽出したドットパターンの文字を、処理対象である外字を同定する第１の候補文字とする。第１の候補文字は、１又は複数抽出される。ＯＣＲ認識部１１１は、第１の候補文字をＯＣＲ候補文字リスト１３に格納する。これにより、処理対象である外字について、ＯＣＲ候補文字リスト１３が生成される。ＯＣＲ認識部１１１は、ＯＣＲ候補文字リスト１３の生成を、表示用候補文字リスト生成部１１３に通知する。 The OCR recognition unit 111 sets the extracted dot pattern character as a first candidate character for identifying an external character to be processed. One or a plurality of first candidate characters are extracted. The OCR recognition unit 111 stores the first candidate character in the OCR candidate character list 13. Thereby, the OCR candidate character list 13 is generated for the external characters to be processed. The OCR recognition unit 111 notifies the display candidate character list generation unit 113 of the generation of the OCR candidate character list 13.

ＯＣＲ認識部１１１は、処理対象である外字を、候補文字リスト生成部１１２に通知する。これに応じて、候補文字リスト生成部１１２は、外字字形要素格納ファイル３を参照して、処理対象である外字について、外字字形要素格納ファイル３に格納された、当該外字についての部首の字形要素情報と部分の字形要素情報とを読み出す。 The OCR recognition unit 111 notifies the candidate character list generation unit 112 of the external characters to be processed. In response to this, the candidate character list generation unit 112 refers to the external character shape element storage file 3 and sets the radical character shape of the external character stored in the external character shape element storage file 3 for the external character to be processed. Read element information and character information of part.

候補文字リスト生成部１１２は、処理対象である外字について、処理対象である外字を同定する第２の候補文字を抽出する。第２の候補文字は、字形要素辞書１４に格納された文字についての部首の字形要素情報と、外字字形要素格納ファイル３に格納された外字についての部首の字形要素情報とに基づいて、字形要素辞書１４に格納された文字から抽出される。 The candidate character list generation unit 112 extracts a second candidate character that identifies the external character that is the processing target for the external character that is the processing target. The second candidate character is based on the radical character element information for the character stored in the character element dictionary 14 and the radical character element information for the external character stored in the external character element storage file 3. Extracted from the characters stored in the glyph element dictionary 14.

具体的には、候補文字リスト生成部１１２は、処理対象である外字についての部首の字形要素情報における部首文字コードを、字形要素辞書１４に格納された文字についての部首の字形要素情報における部首文字コードと比較する。これにより、候補文字リスト生成部１１２は、字形要素辞書１４に格納された文字の中から、処理対象である外字の部首文字コードと同一の部首文字コードを持つ文字を抽出する。 Specifically, the candidate character list generation unit 112 converts the radical character code in the radical character element information for the external character to be processed into the radical character element information for the character stored in the character element dictionary 14. Compare with the radical code in. Thereby, the candidate character list generation unit 112 extracts a character having the same radical character code as the radical character code of the external character to be processed from the characters stored in the character form element dictionary 14.

候補文字リスト生成部１１２は、抽出した文字を、処理対象である外字を同定する第２の候補文字とする。第２の候補文字は、１又は複数抽出される。候補文字リスト生成部１１２は、第２の候補文字を部首候補文字リスト１５に格納する。これにより、処理対象である外字について、部首候補文字リスト１５が生成される。候補文字リスト生成部１１２は、部首候補文字リスト１５の生成を、表示用候補文字リスト生成部１１３に通知する。 Candidate character list generation unit 112 sets the extracted character as a second candidate character for identifying an external character to be processed. One or a plurality of second candidate characters are extracted. The candidate character list generation unit 112 stores the second candidate character in the radical candidate character list 15. Thereby, the radical candidate character list 15 is generated for the external characters to be processed. The candidate character list generation unit 112 notifies the generation of the radical candidate character list 15 to the display candidate character list generation unit 113.

また、候補文字リスト生成部１１２は、処理対象である外字について、処理対象である外字を同定する第３の候補文字を抽出する。第３の候補文字は、字形要素辞書１４に格納された文字についての部分の字形要素情報と、外字字形要素格納ファイル３に格納された外字についての部分の字形要素情報とに基づいて、字形要素辞書１４に格納された文字から抽出される。 In addition, the candidate character list generation unit 112 extracts a third candidate character that identifies the external character that is the processing target for the external character that is the processing target. The third candidate character is a character element based on the character element information of the part stored in the character element dictionary 14 and the character element information of the part of the external character stored in the external character element storage file 3. Extracted from characters stored in the dictionary 14.

具体的には、候補文字リスト生成部１１２は、処理対象である外字についての部分の字形要素情報における部分文字コードを、字形要素辞書１４に格納された文字についての部分の字形要素情報における部分文字コードと比較する。これにより、候補文字リスト生成部１１２は、字形要素辞書１４に格納された文字の中から、処理対象である外字の部分文字コードと同一の部分文字コードを持つ文字を抽出する。 Specifically, the candidate character list generation unit 112 converts the partial character code in the character element information of the part for the external character to be processed into the partial character in the character element information of the part for the character stored in the character element dictionary 14. Compare with code. Thereby, the candidate character list generation unit 112 extracts a character having the same partial character code as the partial character code of the external character to be processed from the characters stored in the character form element dictionary 14.

候補文字リスト生成部１１２は、抽出した文字を、処理対象である外字を同定する第３の候補文字とする。第３の候補文字は、１又は複数抽出される。候補文字リスト生成部１１２は、第３の候補文字を部分候補文字リスト１６に格納する。これにより、処理対象である外字について、部首候補文字リスト１６が生成される。候補文字リスト生成部１１２は、部首候補文字リスト１６の生成を、表示用候補文字リスト生成部１１３に通知する。 The candidate character list generation unit 112 sets the extracted character as a third candidate character for identifying an external character to be processed. One or a plurality of third candidate characters are extracted. The candidate character list generation unit 112 stores the third candidate character in the partial candidate character list 16. Thereby, the radical candidate character list 16 is generated for the external characters to be processed. The candidate character list generation unit 112 notifies the generation of the radical candidate character list 16 to the display candidate character list generation unit 113.

表示用候補文字リスト生成部１１３は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６の生成が通知されると、これらを参照する。この参照結果に基づいて、表示用候補文字リスト生成部１１３は、第１の候補文字乃至第３の候補文字の各々に共通する文字が存在する場合には、これを文字情報学習部１１４に通知する。 When the generation of the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16 is notified, the display candidate character list generation unit 113 refers to these. Based on the reference result, the display candidate character list generation unit 113 notifies the character information learning unit 114 of a character that is common to each of the first to third candidate characters. To do.

第１の候補文字乃至第３の候補文字の各々に共通する文字が存在しない場合には、表示用候補文字リスト生成部１１３は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６に基づいて、表示用候補文字リスト１７を生成する。表示用候補文字リスト１７は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６をマージすることにより生成される。表示用候補文字リスト生成部１１３は、表示用候補文字リスト１７を表示部５に表示させる。表示部５は、第１の候補文字乃至第３の候補文字を表示する。表示部５は、これに代えて、表示用候補文字リスト１７を出力することができる出力部であれば良い。 When there is no character common to each of the first candidate character to the third candidate character, the display candidate character list generation unit 113 displays the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate characters. Based on the list 16, a display candidate character list 17 is generated. The display candidate character list 17 is generated by merging the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16. The display candidate character list generation unit 113 displays the display candidate character list 17 on the display unit 5. The display unit 5 displays the first candidate character to the third candidate character. Instead of this, the display unit 5 may be an output unit capable of outputting the display candidate character list 17.

ここで、表示用候補文字リスト生成部１１３は、第１の候補文字乃至第３の候補文字において、候補文字が重複して含まれる程度に応じて、優先順位を付する。例えば、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６のいずれか２つに重複して存在する文字は、それ以外の文字より高い優先順位とされる。それ以外の文字とは、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６のいずれか１つにしか存在しない文字である。優先順位の相違は、例えば表示の色の相違、表示の文字の太さ等により表される。 Here, the display candidate character list generation unit 113 assigns priorities to the first candidate character to the third candidate character according to the degree to which the candidate characters are included redundantly. For example, a character that is duplicated in any two of the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16 has a higher priority than the other characters. The other characters are characters that exist only in any one of the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16. The difference in the priority order is represented by, for example, a display color difference, a display character thickness, or the like.

これにより、表示部５に表示された表示用候補文字リスト１７を見たオペレータは、処理対象の文字を同定する文字を容易に選択することができる。例えば、オペレータの入力に応じて、入力装置であるキーボード６は、表示部５に表示された候補文字を指定する選択入力を、文字情報学習部１１４に入力する。 Thereby, the operator who has seen the display candidate character list 17 displayed on the display unit 5 can easily select the character for identifying the character to be processed. For example, in response to an operator input, the keyboard 6 serving as an input device inputs a selection input for designating a candidate character displayed on the display unit 5 to the character information learning unit 114.

なお、第１の候補文字乃至第３の候補文字の各々に共通する文字が存在する場合であっても、表示用候補文字リスト１７を生成して表示するようにしても良い。この場合、共通する文字の優先順位は、最も高いものとされる。 Note that the display candidate character list 17 may be generated and displayed even when there is a character common to each of the first to third candidate characters. In this case, the priority order of the common characters is the highest.

文字情報学習部１１４は、第１の候補文字乃至第３の候補文字の中から、処理対象である外字を同定する文字である同定先文字を決定する。従って、処理対象である外字は、他の文字と同一とされる、換言すれば、他の文字に同定される同定元文字である。第１の候補文字乃至第３の候補文字は、他の文字を同定される可能性のある、換言すれば、同定先文字となる可能性のある文字である。 The character information learning unit 114 determines an identification target character that is a character for identifying an external character to be processed from the first candidate character to the third candidate character. Therefore, the external character to be processed is the same as another character, in other words, an identification source character identified by another character. The first to third candidate characters are characters that may be identified as other characters, in other words, characters that may be the identification target characters.

具体的には、文字情報学習部１１４は、前述したように、表示用候補文字リスト生成部１１３からの通知に応じて、第１の候補文字乃至第３の候補文字の各々に共通する文字が存在する場合に、当該共通する文字を、同定先文字として決定する。また、文字情報学習部１１４は、キーボード６から入力された表示部５に表示された候補文字を指定する選択入力に基づいて、同定先文字を決定する。これは、第１の候補文字乃至第３の候補文字の各々に共通する文字が存在しない場合である。 Specifically, as described above, the character information learning unit 114 determines that characters common to each of the first candidate character to the third candidate character are received in response to the notification from the display candidate character list generation unit 113. If it exists, the common character is determined as the identification target character. Further, the character information learning unit 114 determines an identification destination character based on a selection input for designating a candidate character displayed on the display unit 5 input from the keyboard 6. This is a case where there is no character common to each of the first to third candidate characters.

文字情報学習部１１４は、処理対象である外字を同定元文字とし、同定元文字と同定先文字とを対応付ける文字対応関係リスト１８を生成して、これを文字コード変換定義リスト４として出力する。文字対応関係リスト１８又は文字コード変換定義リスト４は、例えば、同定元文字である外字又は外字コードと、同定先文字である文字又はＪＩＳコードとを対応付けるリストである。 The character information learning unit 114 uses an external character to be processed as an identification source character, generates a character correspondence list 18 that associates the identification source character and the identification destination character, and outputs this as the character code conversion definition list 4. The character correspondence list 18 or the character code conversion definition list 4 is, for example, a list that associates an external character or external character code that is an identification source character with a character or JIS code that is an identification destination character.

また、文字情報学習部１１４は、処理対象である外字を同定元文字とし、同定元文字についての部首の字形要素情報又は部分の字形要素情報を、同定先文字についての学習要素情報として、同定先文字についての部首の字形要素情報又は部分の字形要素情報に追加する。この結果、候補文字リスト生成部１１２は、追加された部首の学習要素情報及び部分の学習要素情報に基づいて、第２の候補文字及び第３の候補文字を抽出する。これにより、一旦同定先文字に同定された同定元文字について再度同定処理が実行された場合には、一旦同定先文字に同定された同定元文字は、実際には、候補文字として認識される。 Further, the character information learning unit 114 identifies the external character to be processed as the identification source character, identifies the radical shape element information of the identification source character or the character shape element information of the portion as the learning element information about the identification destination character. It is added to the radical shape element information or the partial shape element information of the first character. As a result, the candidate character list generation unit 112 extracts the second candidate character and the third candidate character based on the added radical learning element information and the partial learning element information. Thereby, when the identification process is performed again for the identification source character once identified as the identification destination character, the identification source character once identified as the identification destination character is actually recognized as a candidate character.

図２は、外字ファイル２及び外字字形要素格納ファイル３の一例を示す図である。 FIG. 2 is a diagram illustrating an example of the external character file 2 and the external character element storage file 3.

外字ファイル２は、文字データ２１と、文字データ２１に対応するドットパターン２２とを含む。文字データ２１及びドットパターン２２は、外字ファイル２に含まれる外字の各々について設けられる。 The external character file 2 includes character data 21 and a dot pattern 22 corresponding to the character data 21. The character data 21 and the dot pattern 22 are provided for each of the external characters included in the external character file 2.

文字データ２１は、同定元文字と、同定元文字の格納先アドレスとを含む。同定元文字は、ある文字に同定される文字であり、例えば外字である。同定元文字は、例えば、同定元文字である外字を一意に定める識別情報（外字コード）により表すようにしても良い。格納先アドレスは、同定元文字のドットパターン２２が格納されるアドレスである。ドットパターン２２は、同定元文字をドットの集合により表したパターンである。 The character data 21 includes an identification source character and a storage destination address of the identification source character. The identification source character is a character identified as a certain character, for example, an external character. The identification source character may be represented by, for example, identification information (external character code) that uniquely defines the external character that is the identification source character. The storage destination address is an address where the dot pattern 22 of the identification source character is stored. The dot pattern 22 is a pattern in which the identification source character is represented by a set of dots.

なお、文字「鉱」は、本来はＪＩＳコードで規格化された文字であって外字ではないが、この明細書では、説明のために、規格化された文字の一例及び外字の一例として用いるものとする。換言すれば、「鉱」は、規格化された文字「鉱」、又は、外字「鉱」として用いられる。 The character “Mine” is originally a character that is standardized by the JIS code and is not an external character, but in this specification, for the purpose of explanation, it is used as an example of a standardized character and an example of an external character. And In other words, “Mine” is used as the standardized character “Mine” or the external character “Mine”.

外字字形要素格納ファイル３は、複数の外字字形要素情報３１Ａ〜３１Ｄを含む。外字字形要素情報３１Ａ〜３１Ｄは、外字ファイル２に含まれる外字の各々について設けられる。外字字形要素情報３１Ａ〜３１Ｄは、同定元文字と、同定元文字の部首の配置を示す配置パターンと、同定元文字の部首を表す部首文字コードを含む部首の字形要素情報と、同定元文字の部分を表す部分文字コードを含む部分の字形要素情報とを含む。同定元文字は、例えば、同定元文字である外字を一意に定める識別情報（外字コード）により表すようにしても良い。外字ファイル２と外字字形要素格納ファイル３とは、同一の同定元文字を含むことにより、対応付けられる。なお、図２において、配置パターンを「配置」と表し、部首文字コードを「部首」と表し、部分文字コードを「部分」と表す。 The external character form element storage file 3 includes a plurality of external character form element information 31A to 31D. The external character element information 31 </ b> A to 31 </ b> D is provided for each external character included in the external character file 2. The external character element information 31A to 31D includes an identification source character, an arrangement pattern indicating an arrangement of a radical of the identification source character, a radical character element information including a radical character code indicating a radical of the identification source character, And character-shaped element information of a part including a partial character code representing a part of the identification source character. The identification source character may be represented by, for example, identification information (external character code) that uniquely defines the external character that is the identification source character. The external character file 2 and the external character element storage file 3 are associated by including the same identification source character. In FIG. 2, the arrangement pattern is represented as “arrangement”, the radical character code is represented as “radical”, and the partial character code is represented as “part”.

例えば、外字字形要素情報３１Ａは、同定元文字「鉱」について、配置パターン「１」、部首文字コード［金］、部分文字コード［広］を格納する。なお、この明細書において、例えば、部首「金」の部首文字コードを［金］と表し、部分「広」の部分文字コードを［広］と表すものとする。 For example, the external character shape element information 31A stores the arrangement pattern “1”, the radical character code [gold], and the partial character code [wide] for the identification source character “mine”. In this specification, for example, the radical character code of the radical “gold” is represented as [gold], and the partial character code of the portion “wide” is represented as [wide].

図３は、文字認識辞書１２及び字形要素辞書１４の一例を示す図である。 FIG. 3 is a diagram showing an example of the character recognition dictionary 12 and the character form element dictionary 14.

文字認識辞書１２は、文字データ１２１と、文字データ１２１に対応するドットパターン１２２とを含む。文字データ１２１及びドットパターン１２２は、文字認識辞書１２に含まれる文字の各々について設けられる。文字認識辞書１２に含まれる文字は、これに他の文字を同定したい文字（同定先文字）であり、従って、後述するように、同定先文字の候補文字となる。文字認識辞書１２に含まれる文字は、例えば、ＪＩＳコードのように、予め定められた文字を表す文字コードにより表される文字、換言すれば、規格化された文字である。 The character recognition dictionary 12 includes character data 121 and a dot pattern 122 corresponding to the character data 121. The character data 121 and the dot pattern 122 are provided for each character included in the character recognition dictionary 12. Characters included in the character recognition dictionary 12 are characters (identification destination characters) for which other characters are desired to be identified. Accordingly, as will be described later, they are candidate characters for identification destination characters. The character included in the character recognition dictionary 12 is a character represented by a character code representing a predetermined character, such as a JIS code, in other words, a standardized character.

なお、文字認識辞書１２に含まれる文字は、規格化された文字でない文字、換言すれば、外字であっても良い。従って、文字認識辞書１２に含まれる文字は、少なくとも規格化された文字を含み、これに加えて、外字を含むようにしても良い。 The characters included in the character recognition dictionary 12 may be characters that are not standardized characters, in other words, external characters. Therefore, the characters included in the character recognition dictionary 12 include at least standardized characters, and may include external characters in addition to these.

文字データ１２１は、同定先文字と、同定先文字の格納先アドレスとを含む。同定先文字は、これに他の文字が同定される文字であり、例えば規格化された文字である。同定先文字は、例えば、同定先文字を一意に定める識別情報（文字コード）により表すようにしても良い。格納先アドレスは、同定先文字のドットパターン１２２が格納されるアドレスである。ドットパターン１２２は、同定先文字をドットの集合により表したパターンである。 The character data 121 includes an identification destination character and a storage destination address of the identification destination character. The identification target character is a character with which another character is identified, for example, a standardized character. For example, the identification destination character may be represented by identification information (character code) that uniquely defines the identification destination character. The storage destination address is an address where the dot pattern 122 of the identification destination character is stored. The dot pattern 122 is a pattern in which the identification target character is represented by a set of dots.

字形要素辞書１４は、複数の字形要素構造体１４１を含む。字形要素構造体１４１は、文字認識辞書１２に含まれる文字の各々について設けられる。字形要素構造体１４１は、同定先文字と、学習文字数と、同定先文字の部首の配置を示す配置パターンと、同定先文字の部首を表す部首文字コードを含む部首の字形要素情報と、同定先文字の部分を表す部分文字コードを含む部分の字形要素情報とを含む。同定先文字は、例えば、同定先文字を一意に定める識別情報（文字コード）により表すようにしても良い。文字認識辞書１２と字形要素辞書１４とは、同一の同定先文字を含むことにより、対応付けられる。なお、図３において、配置パターンを「配置」と表し、部首文字コードを「部首」と表し、部分文字コードを「部分」と表す。 The glyph element dictionary 14 includes a plurality of glyph element structures 141. The character element structure 141 is provided for each character included in the character recognition dictionary 12. The glyph element structure 141 is a radical character element information including an identification target character, the number of learning characters, an arrangement pattern indicating the arrangement of the radical of the identification target character, and a radical character code representing the radical of the identification target character. And character shape element information of a part including a partial character code representing the part of the identification target character. For example, the identification destination character may be represented by identification information (character code) that uniquely defines the identification destination character. The character recognition dictionary 12 and the character form element dictionary 14 are associated by including the same identification target character. In FIG. 3, the arrangement pattern is represented as “arrangement”, the radical character code is represented as “radical”, and the partial character code is represented as “part”.

実際には、字形要素構造体１４１は、図３に示すように、部首の行と、部分の行とを含む。部首の行は、部首について、学習文字数と、配置パターンと、複数の部首文字コードとを含む。部首についての学習文字数は、部首の行に含まれる部首文字コードの数である。部分の行は、部分について、学習文字数と、配置パターンと、複数の部分文字コードとを含む。部分についての学習文字数は、部分の行に含まれる部分文字コードの数である。配置パターンは、同一の字形要素構造体１４１においては、全て同一の値となる。 In practice, the glyph element structure 141 includes radical rows and partial rows, as shown in FIG. The radical row includes the number of learned characters, the arrangement pattern, and a plurality of radical character codes for the radical. The learning character number for the radical is the number of radical character codes included in the radical row. The row of the portion includes the number of learned characters, the arrangement pattern, and a plurality of partial character codes for the portion. The number of learning characters for a portion is the number of partial character codes included in the portion line. All the arrangement patterns have the same value in the same character element structure 141.

複数の部首文字コード及び配置パターンは、部首の行において、学習要素の配列である配列［０］、配列［１］・・・に先頭から順に格納される。複数の部首文字コード及び配置パターンが格納されていない配列は、「ＮＵＬＬ（空）」とされる。部分文字コード及び配置パターンは、部分の行において、学習要素の配列である配列［０］、配列［１］・・・に先頭から順に格納される。部分文字コード及び配置パターンが格納されていない配列は、「ＮＵＬＬ（空）」とされる。 A plurality of radical character codes and arrangement patterns are stored in order from the top in an array [0], an array [1],. An array in which a plurality of radical codes and arrangement patterns are not stored is “NULL (empty)”. The partial character code and the arrangement pattern are stored in order from the top in the array [0], the array [1],. An array in which the partial character code and the arrangement pattern are not stored is “NULL (empty)”.

例えば、字形要素構造体１４１は、同定先文字「鉱」の部首「金」について、学習文字数「１」を格納し、配列［０］に配置パターン「１」及び部首文字コード［金］を格納する。また、字形要素構造体１４１は、同定先文字「鉱」の部分「広」について、学習文字数「１」を格納し、配列［０］に配置パターン「１」及び部分文字コード［広］を格納する。 For example, the character element structure 141 stores the learning character number “1” for the radical “gold” of the identification target character “mine”, and the arrangement pattern “1” and the radical character code [gold] in the array [0]. Is stored. The character element structure 141 stores the learning character number “1” for the part “wide” of the identification target character “mine”, and the arrangement pattern “1” and the partial character code [wide] in the array [0]. To do.

ここで、配列［０］に格納される配置パターン及び部首文字コードは、字形要素構造体１４１に格納された同定先文字の本来の（デフォルトの）部首の配置パターン及び部首文字コードを表す。また、配列［０］に格納される配置パターン及び部分文字コードは、字形要素構造体１４１に格納された同定先文字の本来の（デフォルトの）部分の配置パターン及び部分文字コードを表す。換言すれば、配列［０］に格納される配置パターン、部首文字コード及び部分文字コードは、字形要素構造体１４１に格納された同定先文字のデフォルト値であり、予め格納される。 Here, the arrangement pattern and radical character code stored in the array [0] are the original (default) radical arrangement pattern and radical character code of the identification target character stored in the character element structure 141. Represent. The arrangement pattern and partial character code stored in the array [0] represent the original (default) partial arrangement pattern and partial character code of the identification target character stored in the character element structure 141. In other words, the arrangement pattern, radical character code, and partial character code stored in the array [0] are the default values of the identification target character stored in the character element structure 141, and are stored in advance.

これに対して、配列［１］以降に格納される配置パターン及び部首文字コードは、文字の同定処理に基づく学習処理により獲得される。また、配列［１］以降に格納される配置パターン及び部分文字コードは、文字の同定処理に基づく学習処理により獲得される。換言すれば、配列［１］以降に格納される配置パターン、部首文字コード及び部分文字コードは、字形要素構造体１４１に格納された同定先文字に同定された同定元文字を表す値であり、学習処理の結果として付加的に格納される。 On the other hand, the arrangement patterns and radical character codes stored after the array [1] are acquired by a learning process based on the character identification process. Further, the arrangement patterns and partial character codes stored after the array [1] are acquired by a learning process based on the character identification process. In other words, the arrangement pattern, radical character code, and partial character code stored after the array [1] are values representing the identification source character identified by the identification destination character stored in the character element structure 141. , Additionally stored as a result of the learning process.

図４は、文字同定の一例を示す図である。 FIG. 4 is a diagram illustrating an example of character identification.

図４に示す例において、前述したように、文字「鉱」は、ＪＩＳコードにより規格化された文字であり、かつ、外字でもあるものとする。この場合、第１の同定文字である外字「鉱」は、ＪＩＳコードにより規格化された文字「鉱」の異字体であるものとする。異字体とは、綴りは同一であるが、フォント（ドットパターン）の異なるものを言う。また、第２〜第４の同定元文字は、ＪＩＳコードにより規格化された文字ではなく、外字であるものとする。図４の例を用いて、以下の図５〜図９について説明する。 In the example illustrated in FIG. 4, as described above, the character “Mine” is a character standardized by the JIS code and is also an external character. In this case, it is assumed that the external character “Mine”, which is the first identification character, is a variant of the character “Mine” standardized by the JIS code. An alloface means a font that has the same spelling but a different font (dot pattern). In addition, the second to fourth identification source characters are not characters standardized by the JIS code but external characters. The following FIGS. 5 to 9 will be described using the example of FIG.

例えば、第１の同定元文字（外字「鉱」）は、部首「金」と部分「広」とにより構成される外字である。この場合、「ドットパターン」が類似しているので、文字「鉱」が、第１の候補文字として抽出されて、ＯＣＲ候補文字リスト１３に含まれる。「ドットパターン」の類似については後述する。また、「部首」が一致しているので、文字「鉱」が、第２の候補文字として抽出されて、部首候補文字リスト１５に含まれる。また、「部分」が一致しているので、文字「鉱」が、第３の候補文字として抽出されて、部分候補文字リスト１６に含まれる。この結果、文字「鉱」が第１の候補文字〜第３の候補文字に共通に含まれるので、第１の同定元文字は、同定先文字「鉱」に同定される。 For example, the first identification source character (external character “mine”) is an external character composed of a radical “gold” and a portion “wide”. In this case, since the “dot pattern” is similar, the character “Mine” is extracted as the first candidate character and included in the OCR candidate character list 13. The similarity of “dot pattern” will be described later. In addition, since the “radical” matches, the character “mine” is extracted as the second candidate character and included in the radical candidate character list 15. Also, since “part” matches, the character “mine” is extracted as the third candidate character and included in the partial candidate character list 16. As a result, since the character “Mine” is included in common among the first to third candidate characters, the first identification source character is identified as the identification destination character “Mine”.

第２の同定元文字は、部首「金」と部分「廣」とにより構成される外字である。この場合、「ドットパターン」が類似していないので、文字「鉱」が、第１の候補文字としては抽出されず、ＯＣＲ候補文字リスト１３には含まれない。また、「部首」が一致しているので、文字「鉱」が、第２の候補文字として抽出され、部首候補文字リスト１５に含まれる。一方、「部分」が一致していないので、文字「鉱」が、第３の候補文字としては抽出されず、部分候補文字リスト１６には含まれない。この結果、文字「鉱」が第２の候補文字に含まれるので、第２の同定元文字は、文字情報学習部１１４への選択入力に従って、同定先文字「鉱」に同定される。 The second identification source character is an external character composed of a radical “gold” and a portion “廣”. In this case, since the “dot pattern” is not similar, the character “mine” is not extracted as the first candidate character and is not included in the OCR candidate character list 13. In addition, since the “radical” matches, the character “mine” is extracted as the second candidate character and included in the radical candidate character list 15. On the other hand, since the “part” does not match, the character “mine” is not extracted as the third candidate character and is not included in the partial candidate character list 16. As a result, since the character “mine” is included in the second candidate character, the second identification source character is identified as the identification destination character “mine” in accordance with the selection input to the character information learning unit 114.

第３の同定元文字は、部首「石」と部分「広」とにより構成される外字である。この場合、「ドットパターン」が類似していないので、文字「鉱」が、第１の候補文字としては抽出されず、ＯＣＲ候補文字リスト１３には含まれない。また、「部首」が一致しないので、文字「鉱」が、第２の候補文字としては抽出されず、部首候補文字リスト１５には含まれない。しかし、「部分」が一致しているので、文字「鉱」が、第３の候補文字として抽出されて、部分候補文字リスト１６に含まれる。この結果、文字「鉱」が第３の候補文字に含まれるので、第３の同定元文字は、文字情報学習部１１４への選択入力に従って、同定先文字「鉱」に同定される。 The third identification source character is an external character composed of a radical “stone” and a portion “wide”. In this case, since the “dot pattern” is not similar, the character “mine” is not extracted as the first candidate character and is not included in the OCR candidate character list 13. In addition, since the “radical” does not match, the character “mine” is not extracted as the second candidate character and is not included in the radical candidate character list 15. However, since the “part” matches, the character “mine” is extracted as the third candidate character and included in the partial candidate character list 16. As a result, since the character “Mine” is included in the third candidate character, the third identification source character is identified as the identification destination character “Mine” according to the selection input to the character information learning unit 114.

第４の同定元文字は、部首「石」と部分「廣」とにより構成される外字である。この場合、「ドットパターン」が類似していないので、文字「鉱」が、第１の候補文字として抽出されず、ＯＣＲ候補文字リスト１３に含まれない。また、「部首」が一致していないので、文字「鉱」が、第２の候補文字としては抽出されず、部首候補文字リスト１５には含まれない。また、「部分」が一致していないので、文字「鉱」が、第３の候補文字としては抽出されず、部分候補文字リスト１６には含まれない。 The fourth identification source character is an external character composed of a radical “stone” and a portion “廣”. In this case, since the “dot pattern” is not similar, the character “mine” is not extracted as the first candidate character and is not included in the OCR candidate character list 13. In addition, since the “radical” does not match, the character “mine” is not extracted as the second candidate character and is not included in the radical candidate character list 15. In addition, since the “part” does not match, the character “mine” is not extracted as the third candidate character and is not included in the partial candidate character list 16.

このように、第４の同定元文字は、最初は、換言すれば、学習処理の前においては、文字「鉱」が、第１乃至第３の候補文字として抽出されないので、同定先文字「鉱」とは一致しない。しかし、図８を参照して後述するように、学習処理の結果、文字「鉱」が第２及び第３の候補文字に含まれるので、第４の同定元文字は、文字情報学習部１１４への選択入力に従って、同定先文字「鉱」に同定される。 In this way, the fourth identification source character is initially, in other words, the character “Mine” is not extracted as the first to third candidate characters before the learning process. Does not match. However, as will be described later with reference to FIG. 8, as a result of the learning process, the character “mine” is included in the second and third candidate characters, and therefore the fourth identification source character is sent to the character information learning unit 114. According to the selection input, the identification target character “Mine” is identified.

図５〜図９は、文字同定の一例を示す図である。特に、図５〜図８は、前述の第１の同定元文字〜第４の同定元文字をこの順に同定する場合について示す。図９は、前述の第１の同定元文字〜第４の同定元文字を同定した後、再度、第４の同定元文字を同定する場合について示す。 5 to 9 are diagrams illustrating an example of character identification. In particular, FIGS. 5 to 8 show the case where the first identification source character to the fourth identification source character are identified in this order. FIG. 9 shows a case where the first identification source character to the fourth identification source character are identified and then the fourth identification source character is identified again.

図５は、部首「金」と部分「広」とにより構成される第１の同定元文字（外字「鉱」）についての同定処理について示す。 FIG. 5 shows an identification process for a first identification source character (external character “Mine”) composed of a radical “Gold” and a portion “Wide”.

前述したように、文字「鉱」は、ＪＩＳコードにより規格化された文字であり、図３に示すように、字形要素辞書１４の字形要素構造体１４１に格納されている。一方、文字「鉱」の異字体である外字「鉱」は、いずれの文字にも同定されていない。従って、文字「鉱」の異字体である外字「鉱」は、同定処理の対象の外字として、図２に示すように、外字字形要素格納ファイル３の外字字形要素情報３１Ａに格納されている。 As described above, the character “Mine” is a character standardized by the JIS code, and is stored in the character element structure 141 of the character element dictionary 14 as shown in FIG. On the other hand, the external character “Mine” which is a variant of the character “Mine” is not identified in any character. Therefore, the external character “Mine”, which is a variant of the character “Mine”, is stored in the external character element information 31A of the external character element storage file 3 as shown in FIG.

外字ファイル２は、外字である第１の同定元文字について、ドットパターンを格納する。外字字形要素格納ファイル３は、外字ファイル２に格納された第１の同定元文字について、外字字形要素情報３１Ａを格納する。この場合、外字字形要素情報３１Ａは、第１の同定元文字について、配置パターン「１」と、部首「金」を表す部首文字コードを含む部首の字形要素情報と、部分「広」を表す部分文字コードを含む部分の字形要素情報とを格納する。 The external character file 2 stores a dot pattern for the first identification source character that is an external character. The external character form element storage file 3 stores external character form element information 31 </ b> A for the first identification source character stored in the external character file 2. In this case, the external character element information 31A includes the arrangement pattern “1”, the radical character information including the radical character code indicating the radical “gold”, and the part “wide” for the first identification source character. The character shape element information of the portion including the partial character code representing the character is stored.

例えば、ＯＣＲ認識部１１１は、部首「金」と部分「広」とにより構成される第１の同定元文字について、文字認識辞書１２を用いて、文字認識処理を行う。これにより、ＯＣＲ認識部１１１は、第１の同定元文字についての第１の候補文字として、「ドットパターン」が類似する複数の文字を抽出して、ＯＣＲ候補文字リスト１３に格納する。「ドットパターン」の類似により抽出される文字には、文字「鉱」が含まれる。 For example, the OCR recognition unit 111 performs a character recognition process using the character recognition dictionary 12 for the first identification source character composed of the radical “gold” and the portion “wide”. Thereby, the OCR recognition unit 111 extracts a plurality of characters having similar “dot patterns” as the first candidate characters for the first identification source character, and stores them in the OCR candidate character list 13. The character extracted by the similarity of “dot pattern” includes the character “Mine”.

なお、「ドットパターン」の類似とは、比較対象である２個のドットパターンにおいて、例えば、予め定められた割合以上のピクセルの値が一致する場合を含む。前記割合は、経験的に定めることができ、比較的小さい値とされる。また、「ドットパターン」の類似には、部首が一致又は類似する場合、部分が一致又は類似する場合等を含むようにしても良い。これにより、図５〜図９に示すように、比較的多くの文字を類似する文字として抽出することができる。ＪＩＳコードにより規格化された文字「鉱」と外字「鉱」のドットパターンは、相互に異なるものの、異字体であるので、類似する。 The similarity of “dot pattern” includes, for example, the case where the values of pixels equal to or higher than a predetermined ratio match in two dot patterns to be compared. The ratio can be determined empirically and is a relatively small value. Further, the similarity of the “dot pattern” may include a case where the radicals are matched or similar, a case where the parts are matched or similar, and the like. Thereby, as shown in FIGS. 5 to 9, a relatively large number of characters can be extracted as similar characters. The dot patterns of the characters “Mine” and the external character “Mine” standardized by the JIS code are similar to each other because they are different characters.

また、候補文字リスト生成部１１２は、第１の同定元文字について、特に、第１の同定元文字の部首「金」について、字形要素辞書１４を用いて、字形要素の比較を行う。これにより、候補文字リスト生成部１１２は、第１の同定元文字についての第２の候補文字として、「部首」が一致する複数の文字を抽出して、部首候補文字リスト１５に格納する。「部首」の一致により抽出される文字には、文字「鉱」が含まれる。 In addition, the candidate character list generation unit 112 compares the character elements using the character element dictionary 14 for the first identification source character, in particular, for the radical “gold” of the first identification source character. As a result, the candidate character list generation unit 112 extracts a plurality of characters having the same “radical” as the second candidate character for the first identification source character, and stores them in the radical candidate character list 15. . The character extracted by the match of “radical” includes the character “Mine”.

この時、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第１の同定元文字の部首「金」は、配列［０］に格納される。従って、この場合、候補文字リスト生成部１１２は、部首候補文字リスト１５において、文字「鉱」を、「学習候補」のフィールドではなく、「候補文字」のフィールドに格納する。 At this time, in the character element structure 141 which is the learning data of the identification target character “Mine” in the character element dictionary 14, the radical “gold” of the first identification character is stored in the array [0]. Therefore, in this case, the candidate character list generation unit 112 stores the character “Mine” in the “candidate character” field instead of the “learning candidate” field in the radical candidate character list 15.

また、候補文字リスト生成部１１２は、第１の同定元文字について、特に、第１の同定元文字の部分「広」について、字形要素辞書１４を用いて、字形要素の比較を行う。これにより、候補文字リスト生成部１１２は、第１の同定元文字についての第３の候補文字として、「部分」が一致する複数の文字を抽出して、部分候補文字リスト１６に格納する。「部分」の一致により抽出される文字には、文字「鉱」が含まれる（以上、処理＃５１）。 In addition, the candidate character list generation unit 112 compares the character elements using the character element dictionary 14 for the first identification source character, in particular, for the portion “wide” of the first identification source character. As a result, the candidate character list generation unit 112 extracts a plurality of characters that match “part” as the third candidate character for the first identification source character, and stores them in the partial candidate character list 16. The character extracted by the match of “part” includes the character “Mine” (process # 51).

この時、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第１の同定元文字の部分「広」は、配列［０］に格納される。従って、この場合、候補文字リスト生成部１１２は、部首候補文字リスト１５において、文字「鉱」を、「学習候補」のフィールドではなく、「候補文字」のフィールドに格納する。 At this time, in the character element structure 141 which is the learning data of the identification target character “Mine” in the character element dictionary 14, the portion “Wide” of the first identification source character is stored in the array [0]. Therefore, in this case, the candidate character list generation unit 112 stores the character “Mine” in the “candidate character” field instead of the “learning candidate” field in the radical candidate character list 15.

この後、表示用候補文字リスト生成部１１３は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６の各々に同一の文字「鉱」が共通に含まれるので、同定先文字として、文字「鉱」を決定する（処理＃５２）。これにより、外字である同定元文字「鉱」についての同定先文字「鉱」が定まる。 Thereafter, the display candidate character list generation unit 113 includes the same character “Mine” in each of the OCR candidate character list 13, radical candidate character list 15, and partial candidate character list 16. The character “Mine” is determined (Process # 52). As a result, the identification character “Mine” for the identification source character “Mine” which is an external character is determined.

従って、文字情報学習部１１４は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６の各々に同一の文字「鉱」が含まれる場合、第１の同定元文字を決定された同定先文字「鉱」の学習データとして学習する学習処理を行わない（処理＃５３）。換言すれば、表示用候補文字リスト１７は、生成されず、表示部５に表示されない。 Therefore, the character information learning unit 114 determines the first identification source character when the same character “Mine” is included in each of the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16. The learning process for learning as the learning data of the identified character “Mine” is not performed (process # 53). In other words, the display candidate character list 17 is not generated and is not displayed on the display unit 5.

具体的には、この場合、同定元文字「鉱」の部首文字コード［金］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［金］と同一である。従って、同定元文字「鉱」の部首文字コード［金］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。また、同定元文字「鉱」の部分文字コード［広］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［広］と同一である。従って、同定元文字「鉱」の部分文字コード［広］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。 Specifically, in this case, the radical code [gold] of the identification source character “Mine” is the same as the radical code [gold] stored in the array [0] of the character element structure 141. Therefore, the radical code [gold] of the identification source character “Mine” is not added to the character-shaped element structure 141 as learning data of the identification destination character “Mine”. Further, the partial character code [Wide] of the identification source character “Mine” is the same as the radical character code [Wide] stored in the array [0] of the character element structure 141. Therefore, the partial character code [Wide] of the identification source character “Mine” is not added to the character-shaped element structure 141 as learning data of the identification destination character “Mine”.

以上から、この場合、第１の同定元文字は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。従って、同定先文字「鉱」の字形要素構造体１４１は、第１の同定元文字の同定前と同様に、配列［０］に、配置パターンと、同定先文字「鉱」の部首「金」と、同定先文字「鉱」の部分「広」を格納する。 From the above, in this case, the first identification source character is not added to the glyph element structure 141 as learning data of the identification target character “Mine”. Therefore, the glyph element structure 141 of the identification target character “Mine” has the arrangement pattern and the radical “Gold” of the identification target character “Mine” in the array [0], as before the identification of the first identification source character. ”And a part“ Wide ”of the identification character“ Mine ”.

図６は、部首「金」と部分「廣」とにより構成される第２の同定元文字についての同定処理について示す。第２の同定元文字は、前述したように、ＪＩＳコードにより規格化された文字ではなく、外字である。 FIG. 6 shows an identification process for the second identification source character composed of the radical “gold” and the part “廣”. As described above, the second identification source character is not a character standardized by the JIS code but an external character.

この場合、外字字形要素情報３１Ｂは、第２の同定元文字について、配置パターン「１」と、部首「金」を表す部首文字コードを含む部首の字形要素情報と、部分「廣」を表す部分文字コードを含む部分の字形要素情報とを格納する。 In this case, the external character element information 31B includes the arrangement pattern “1”, the radical character information including the radical character code indicating the radical “gold”, and the part “廣” for the second identification source character. The character shape element information of the portion including the partial character code representing the character is stored.

例えば、ＯＣＲ認識部１１１は、第２の同定元文字について、文字認識辞書１２を用いて、文字認識処理を行うことにより、第２の同定元文字についての第１の候補文字として、「ドットパターン」が類似する複数の文字を抽出して、ＯＣＲ候補文字リスト１３に格納する。また、候補文字リスト生成部１１２は、第２の同定元文字の部首「金」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、第２の同定元文字についての第２の候補文字として、「部首」が一致する複数の文字を抽出して、部首候補文字リスト１５に格納する。また、候補文字リスト生成部１１２は、第２の同定元文字の部分「廣」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、第２の同定元文字についての第３の候補文字として、「部分」が一致する複数の文字を抽出して、部分候補文字リスト１６に格納する（処理＃６１）。 For example, the OCR recognizing unit 111 performs a character recognition process on the second identification source character using the character recognition dictionary 12 to obtain “dot pattern” as the first candidate character for the second identification source character. Are extracted and stored in the OCR candidate character list 13. In addition, the candidate character list generation unit 112 compares the glyph elements for the radical “gold” of the second identification source character by using the glyph element dictionary 14, so that the second identification source character for the second identification source character is compared. As the second candidate character, a plurality of characters having the same “radical” are extracted and stored in the radical candidate character list 15. In addition, the candidate character list generation unit 112 compares the glyph elements with respect to the part “部分” of the second identification source character by using the glyph element dictionary 14, thereby performing the third identification of the second identification source character. As the candidate characters, a plurality of characters with the same “part” are extracted and stored in the partial candidate character list 16 (process # 61).

この時、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第２の同定元文字の部首「金」は、配列［０］に格納される。従って、この場合、候補文字リスト生成部１１２は、部首候補文字リスト１５において、文字「鉱」を、「学習候補」のフィールドではなく、「候補文字」のフィールドに格納する。 At this time, in the character element structure 141 that is the learning data of the identification target character “Mine” in the character element dictionary 14, the radical “gold” of the second identification source character is stored in the array [0]. Therefore, in this case, the candidate character list generation unit 112 stores the character “Mine” in the “candidate character” field instead of the “learning candidate” field in the radical candidate character list 15.

この後、表示用候補文字リスト生成部１１３は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６に、同一の文字が共通に含まれないので、同定先文字を決定する処理を行う。具体的には、表示用候補文字リスト生成部１１３は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６に基づいて表示用候補文字リスト１７を生成して、表示部５に表示する。 Thereafter, the display candidate character list generation unit 113 determines the identification target character because the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16 do not include the same character in common. Process. Specifically, the display candidate character list generating unit 113 generates a display candidate character list 17 based on the OCR candidate character list 13, the radical candidate character list 15, and the partial candidate character list 16, and the display unit 5 To display.

これを見たオペレータが、キーボード６から、同定先文字を選択する指示として、例えば、文字「鉱」を入力する。なお、文字「鉱」は、例えば、表示用候補文字リスト１７における部首候補文字リスト１５に対応する部分に表示された文字の中から選択することにより、入力するようにしても良い。これは、図８〜図９においても同様である。この指示に応じて、文字情報学習部１１４は、同定先文字として、文字「鉱」を決定する（処理＃６２）。これにより、第２の同定元文字についての同定先文字「鉱」が定まる。 The operator who sees this inputs, for example, the character “Mine” from the keyboard 6 as an instruction to select the identification target character. Note that the character “mine” may be input by selecting from the characters displayed in the portion corresponding to the radical candidate character list 15 in the display candidate character list 17. The same applies to FIGS. 8 to 9. In response to this instruction, the character information learning unit 114 determines the character “mine” as the identification target character (processing # 62). As a result, the identification target character “Mine” for the second identification source character is determined.

この後、文字情報学習部１１４は、第２の同定元文字を、決定された同定先文字「鉱」の学習データとして学習する学習処理を行う（処理＃６３）。これにより、文字情報学習部１１４は、字形要素辞書１４に格納された、決定された同定先文字「鉱」の字形要素構造体１４１に学習要素の配列を追加する。 Thereafter, the character information learning unit 114 performs a learning process of learning the second identification source character as learning data of the determined identification destination character “Mine” (processing # 63). As a result, the character information learning unit 114 adds the learning element array to the character element structure 141 of the determined identification target character “Mine” stored in the character element dictionary 14.

この場合、第２の同定元文字の部首文字コード［金］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［金］と同一である。従って、第２の同定元文字の部首文字コード［金］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。一方、第２の同定元文字の部分文字コード［廣］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［広］と異なる。従って、第２の同定元文字の部分文字コード［廣］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加される。 In this case, the radical character code [gold] of the second identification source character is the same as the radical character code [gold] stored in the array [0] of the character element structure 141. Therefore, the radical character code [gold] of the second identification source character is not added to the character element structure 141 as learning data of the identification target character “Mine”. On the other hand, the partial character code [廣] of the second identification source character is different from the radical character code [wide] stored in the array [0] of the character element structure 141. Therefore, the partial character code [廣] of the second identification source character is added to the character-shaped element structure 141 as learning data of the identification destination character “Mine”.

以上から、この場合、同定先文字「鉱」の字形要素構造体１４１は、第２の同定元文字の学習により、配列［１］の「部分」の格納フィールドにおいて、配置パターン「１」と、第２の同定元文字の部分「廣」を格納する。また、配列［１］の「部分」の格納フィールドには新たに１個の部分文字コード等が格納されたので、「部分」についての学習文字数が、「２」とされる。この時、配列［１］の「部首」の格納フィールドは、格納する部首文字コード等が存在しないので、「ＮＵＬＬ（空）」とされる。また、配列［１］の「部首」の格納フィールドには新たな部首文字コード等が格納されないので、「部首」についての学習文字数も「１」のままとされる。 From the above, in this case, the character-shaped element structure 141 of the identification target character “Mine” has the arrangement pattern “1” in the storage field of “part” of the array [1] by learning the second identification source character. The part “部分” of the second identification source character is stored. In addition, since one partial character code or the like is newly stored in the “part” storage field of the array [1], the number of learning characters for “part” is set to “2”. At this time, the storage field of “radical” in the array [1] is set to “NULL (empty)” because there is no radical character code to be stored. In addition, since a new radical character code or the like is not stored in the “radical” storage field of the array [1], the number of learning characters for “radical” is also kept at “1”.

この学習処理の結果、第２の同定元文字は、字形要素「廣」が同定先文字「鉱」の字形要素として学習されることにより、文字「鉱」の候補文字として認識されることになる。これにより、字形要素「廣」を有する文字は、文字「鉱」の候補文字とされる。 As a result of this learning process, the second identification source character is recognized as a candidate character of the character “Mine” by learning the character element “廣” as the character element of the identification target character “Mine”. . Thus, the character having the character form element “要素” is set as a candidate character for the character “Mine”.

図７は、部首「石」と部分「広」とにより構成される第３の同定元文字についての同定処理について示す。第３の同定元文字は、前述したように、ＪＩＳコードにより規格化された文字ではなく、外字である。 FIG. 7 shows an identification process for the third identification source character composed of the radical “stone” and the portion “wide”. As described above, the third identification source character is not a character standardized by the JIS code but an external character.

この場合、外字字形要素情報３１Ｃは、第３の同定元文字について、配置パターン「１」と、部首「石」を表す部首文字コードを含む部首の字形要素情報と、部分「広」を表す部分文字コードを含む部分の字形要素情報とを格納する。 In this case, the external character element information 31C includes the arrangement pattern “1”, the radical character element information including the radical character code indicating the radical “stone”, and the part “wide” for the third identification source character. The character shape element information of the portion including the partial character code representing the character is stored.

例えば、ＯＣＲ認識部１１１は、第３の同定元文字について、文字認識辞書１２を用いて、文字認識処理を行うことにより、第３の同定元文字についての第１の候補文字として、「ドットパターン」が類似する複数の文字を抽出して、ＯＣＲ候補文字リスト１３に格納する。また、候補文字リスト生成部１１２は、第３の同定元文字の部首「石」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、第３の同定元文字についての第２の候補文字として、「部首」が一致する複数の文字を抽出して、部首候補文字リスト１５に格納する。また、候補文字リスト生成部１１２は、第３の同定元文字の部分「広」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、第３の同定元文字についての第３の候補文字として、「部分」が一致する複数の文字を抽出して、部分候補文字リスト１６に格納する（処理＃７１）。 For example, the OCR recognition unit 111 performs a character recognition process on the third identification source character using the character recognition dictionary 12, so that “dot pattern” is set as the first candidate character for the third identification source character. Are extracted and stored in the OCR candidate character list 13. In addition, the candidate character list generation unit 112 compares the glyph elements for the radical “stone” of the third identification source character by using the glyph element dictionary 14, so that the third identification source character for the third identification source character is compared. As the second candidate character, a plurality of characters having the same “radical” are extracted and stored in the radical candidate character list 15. In addition, the candidate character list generation unit 112 compares the character element with the character element dictionary 14 using the character element dictionary 14 for the part “wide” of the third identifier character, so that a third character character about the third identifier character is displayed. As the candidate characters, a plurality of characters having the same “part” are extracted and stored in the partial candidate character list 16 (process # 71).

この時、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第３の同定元文字の部分「広」は、配列［０］に格納される。従って、この場合、候補文字リスト生成部１１２は、部分候補文字リスト１６において、文字「鉱」を、「学習候補」のフィールドではなく、「候補文字」のフィールドに格納する。 At this time, in the character element structure 141 which is the learning data of the identification target character “Mine” in the character element dictionary 14, the portion “Wide” of the third identification source character is stored in the array [0]. Therefore, in this case, the candidate character list generation unit 112 stores the character “Mine” in the “candidate character” field instead of the “learning candidate” field in the partial candidate character list 16.

これを見たオペレータが、キーボード６から、同定先文字を選択する指示として、例えば、文字「鉱」を入力する。この指示に応じて、文字情報学習部１１４は、同定先文字として、文字「鉱」を決定する（処理＃７２）。これにより、第３の同定元文字についての同定先文字「鉱」が定まる。 The operator who sees this inputs, for example, the character “Mine” from the keyboard 6 as an instruction to select the identification target character. In response to this instruction, the character information learning unit 114 determines the character “Mine” as the identification target character (processing # 72). As a result, the identification target character “Mine” for the third identification source character is determined.

この後、文字情報学習部１１４は、第３の同定元文字を、決定された同定先文字「鉱」の学習データとして学習する学習処理を行う（処理＃７３）。これにより、文字情報学習部１１４は、字形要素辞書１４に格納された、決定された同定先文字「鉱」の字形要素構造体１４１に学習要素の配列を追加する。 Thereafter, the character information learning unit 114 performs a learning process of learning the third identification source character as learning data of the determined identification destination character “Mine” (processing # 73). As a result, the character information learning unit 114 adds the learning element array to the character element structure 141 of the determined identification target character “Mine” stored in the character element dictionary 14.

この場合、第３の同定元文字の部首文字コード［石］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［金］と異なる。従って、第３の同定元文字の部首文字コード［石］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加される。一方、第３の同定元文字の部分文字コード［広］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［広］と同一である。従って、第３の同定元文字の部分文字コード［広］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。 In this case, the radical character code [stone] of the third identification source character is different from the radical character code [gold] stored in the array [0] of the character element structure 141. Therefore, the radical code [stone] of the third identification source character is added to the character-shaped element structure 141 as learning data of the identification destination character “Mine”. On the other hand, the partial character code [wide] of the third identification source character is the same as the radical character code [wide] stored in the array [0] of the character element structure 141. Therefore, the partial character code [Wide] of the third identification source character is not added to the character-shaped element structure 141 as learning data of the identification target character “Mine”.

以上から、この場合、同定先文字「鉱」の字形要素構造体１４１は、第３の同定元文字の学習により、配列［１］の「部首」の格納フィールドにおいて、配置パターン「１」と、第３の同定元文字の部首「石」を格納する。また、配列［１］の「部首」の格納フィールドには新たに１個の部首文字コード等が格納されたので、「部首」についての学習文字数が、「２」とされる。また、配列［１］の「部分」の格納フィールドには新たな部分文字コード等が格納されないので、「部分」についての学習文字数も「２」のままとされる。 From the above, in this case, the character-shaped element structure 141 of the identification target character “Mine” has the arrangement pattern “1” in the storage field of “radical” of the array [1] by learning the third identification source character. The radical “stone” of the third identification source character is stored. In addition, since one radical character code or the like is newly stored in the “radical” storage field of the array [1], the number of learning characters for “radical” is set to “2”. In addition, since a new partial character code or the like is not stored in the “part” storage field of array [1], the number of learning characters for “part” is also kept at “2”.

この学習処理の結果、第３の同定元文字は、字形要素「石」が同定先文字「鉱」の字形要素として学習されることにより、文字「鉱」の候補文字として認識されることになる。これにより、字形要素「石」を有する文字は、文字「鉱」の候補文字とされる。 As a result of this learning process, the third identification source character is recognized as a candidate character of the character “Mine” by learning the character shape element “Stone” as the character shape element of the identification destination character “Mine”. . Thereby, the character having the character element “stone” is set as a candidate character for the character “Mine”.

図８は、部首「石」と部分「廣」とにより構成される第４の同定元文字についての同定処理について示す。第４の同定元文字は、前述したように、ＪＩＳコードにより規格化された文字ではなく、外字である。 FIG. 8 shows an identification process for the fourth identification source character composed of the radical “stone” and the portion “廣”. As described above, the fourth identification source character is not a character standardized by the JIS code but an external character.

この場合、外字字形要素情報３１Ｄは、第４の同定元文字について、配置パターン「１」と、部首「石」を表す部首文字コードを含む部首の字形要素情報と、部分「廣」を表す部分文字コードを含む部分の字形要素情報とを格納する。 In this case, the external character element information 31D includes the arrangement pattern “1”, the radical character element information including the radical character code indicating the radical “stone”, and the part “廣” for the fourth identification source character. The character shape element information of the portion including the partial character code representing the character is stored.

例えば、ＯＣＲ認識部１１１は、第４の同定元文字について、文字認識辞書１２を用いて、文字認識処理を行うことにより、第４の同定元文字についての第１の候補文字として、「ドットパターン」が類似する複数の文字を抽出して、ＯＣＲ候補文字リスト１３に格納する。また、候補文字リスト生成部１１２は、第４の同定元文字の部首「石」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、第４の同定元文字についての第２の候補文字として、「部首」が一致する複数の文字を抽出して、部首候補文字リスト１５に格納する。また、候補文字リスト生成部１１２は、第４の同定元文字の部分「廣」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、第４の同定元文字についての第３の候補文字として、「部分」が一致する複数の文字を抽出して、部分候補文字リスト１６に格納する（処理＃８１）。 For example, the OCR recognizing unit 111 performs a character recognition process on the fourth identification source character using the character recognition dictionary 12, thereby obtaining “dot pattern” as the first candidate character for the fourth identification source character. Are extracted and stored in the OCR candidate character list 13. In addition, the candidate character list generation unit 112 compares the glyph elements for the radical “stone” of the fourth identification source character by using the glyph element dictionary 14, so that the fourth character for the fourth identification source character is compared. As the second candidate character, a plurality of characters having the same “radical” are extracted and stored in the radical candidate character list 15. In addition, the candidate character list generation unit 112 compares the glyph elements of the fourth identification source character portion “廣” by using the glyph element dictionary 14, thereby performing the third identification of the fourth identification source character. Are extracted as a candidate character, and stored in the partial candidate character list 16 (process # 81).

この時、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第４の同定元文字の部首「石」は、配列［１］に格納される。従って、この場合、候補文字リスト生成部１１２は、部首候補文字リスト１５において、文字「鉱」を、「候補文字」のフィールドではなく、「学習候補」のフィールドに格納する。 At this time, the radical “stone” of the fourth identification source character is stored in the array [1] in the character shape element structure 141 which is the learning data of the identification target character “Mine” in the character shape element dictionary 14. Therefore, in this case, the candidate character list generation unit 112 stores the character “Mine” in the “learning candidate” field, not the “candidate character” field, in the radical candidate character list 15.

また、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第４の同定元文字の部分「廣」は、配列［１］に格納される。この場合、候補文字リスト生成部１１２は、部分候補文字リスト１６において、文字「鉱」を、「候補文字」のフィールドではなく、「学習候補」のフィールドに格納する。 Further, in the character shape element structure 141 which is the learning data of the identification target character “Mine” in the character shape element dictionary 14, the portion “廣” of the fourth identification source character is stored in the array [1]. In this case, the candidate character list generation unit 112 stores the character “Mine” in the “candidate character” field instead of the “candidate character” field in the partial candidate character list 16.

この時、文字「鉱」は、部首候補文字リスト１５及び部分候補文字リスト１６の双方に、共通に含まれる。従って、文字「鉱」は、優先順位が高い候補文字であるので、例えば、優先順位が低い候補文字とは異なる色で表示される。 At this time, the character “Mine” is included in both the radical candidate character list 15 and the partial candidate character list 16 in common. Therefore, since the character “Mine” is a candidate character with a high priority, for example, it is displayed in a color different from that of a candidate character with a low priority.

これを見たオペレータが、キーボード６から、同定先文字を選択する指示として、例えば、文字「鉱」を入力する。この指示に応じて、文字情報学習部１１４は、同定先文字として、文字「鉱」を決定する（処理＃８２）。これにより、第４の同定元文字についての同定先文字「鉱」が定まる。 The operator who sees this inputs, for example, the character “Mine” from the keyboard 6 as an instruction to select the identification target character. In response to this instruction, the character information learning unit 114 determines the character “Mine” as the identification target character (processing # 82). As a result, the identification target character “Mine” for the fourth identification source character is determined.

この後、文字情報学習部１１４は、第４の同定元文字を、決定された同定先文字「鉱」の学習データとして学習する学習処理を行う（処理＃８３）。これにより、文字情報学習部１１４は、字形要素辞書１４に格納された、決定された同定先文字「鉱」の字形要素構造体１４１に学習要素の配列を追加する。 Thereafter, the character information learning unit 114 performs a learning process of learning the fourth identification source character as learning data of the determined identification destination character “Mine” (processing # 83). As a result, the character information learning unit 114 adds the learning element array to the character element structure 141 of the determined identification target character “Mine” stored in the character element dictionary 14.

この場合、第４の同定元文字の部首文字コード［石］が、字形要素構造体１４１の配列［１］に格納された部首文字コード［石］と同一である。従って、第４の同定元文字の部首文字コード［石］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。また、第４の同定元文字の部分文字コード［廣］が、字形要素構造体１４１の配列［１］に格納された部首文字コード［廣］と同一である。従って、第４の同定元文字の部分文字コード［廣］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。 In this case, the radical character code [stone] of the fourth identification source character is the same as the radical character code [stone] stored in the array [1] of the character element structure 141. Therefore, the radical code [stone] of the fourth identification source character is not added to the character-shaped element structure 141 as learning data of the identification target character “Mine”. Further, the partial character code [の] of the fourth identification source character is the same as the radical character code [廣] stored in the array [1] of the character element structure 141. Therefore, the partial character code [廣] of the fourth identification source character is not added to the character shape element structure 141 as learning data of the identification destination character “Mine”.

以上から、この場合、第４の同定元文字の学習処理は実行されるが、同定先文字「鉱」の字形要素構造体１４１のいずれの格納フィールドにも、新たな部首文字コード及び部分文字コード等が格納されないので、「部首」及び「部分」についての学習文字数も「２」のままとされる。 From the above, in this case, the fourth identification source character learning process is executed, but a new radical character code and partial character are stored in any storage field of the character element structure 141 of the identification destination character “Mine”. Since the code or the like is not stored, the number of learning characters for “radical” and “part” is also kept at “2”.

この学習処理の結果、第４の同定元文字は、字形要素「石」が同定先文字「鉱」の字形要素として学習されることにより、文字「鉱」の候補文字として認識されることになる。 As a result of this learning process, the fourth identification source character is recognized as a candidate character of the character “Mine” by learning the character element “Stone” as the character element of the identification target character “Mine”. .

図９は、部首「石」と部分「廣」とにより構成される新たな同定元文字についての同定処理について示す。新たな同定元文字は、ＪＩＳコードにより規格化された文字ではなく、外字である。新たな同定元文字は、前述した部首「金」と部分「廣」とにより構成される第２の同定元文字の異字体であるものとする。 FIG. 9 shows an identification process for a new identification source character composed of the radical “stone” and the portion “廣”. The new identification source character is not a character standardized by the JIS code but an external character. The new identification source character is assumed to be a variant of the second identification source character composed of the above-described radical “gold” and the portion “廣”.

この場合、外字字形要素情報３１Ｅは、新たな同定元文字について、配置パターン「１」と、部首「金」を表す部首文字コードを含む部首の字形要素情報と、部分「廣」を表す部分文字コードを含む部分の字形要素情報とを格納する。 In this case, the external character-shaped element information 31E includes the arrangement pattern “1”, the radical-shaped element information including the radical character code representing the radical “gold”, and the part “廣” for the new identification source character. The character shape element information of the part including the partial character code to be expressed is stored.

例えば、ＯＣＲ認識部１１１は、新たな同定元文字について、文字認識辞書１２を用いて、文字認識処理を行うことにより、新たな同定元文字についての第１の候補文字として、「ドットパターン」が類似する複数の文字を抽出して、ＯＣＲ候補文字リスト１３に格納する。また、候補文字リスト生成部１１２は、新たな同定元文字の部首「金」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、新たな同定元文字についての第２の候補文字として、「部首」が一致する複数の文字を抽出して、部首候補文字リスト１５に格納する。また、候補文字リスト生成部１１２は、新たな同定元文字の部分「廣」について、字形要素辞書１４を用いて、字形要素の比較を行うことにより、新たな同定元文字についての第３の候補文字として、「部分」が一致する複数の文字を抽出して、部分候補文字リスト１６に格納する（処理＃９１）。 For example, the OCR recognition unit 111 performs a character recognition process on the new identification source character using the character recognition dictionary 12, so that “dot pattern” is set as the first candidate character for the new identification source character. A plurality of similar characters are extracted and stored in the OCR candidate character list 13. In addition, the candidate character list generation unit 112 compares the glyph elements of the radical “gold” of the new identification source character by using the glyph element dictionary 14, thereby performing the second identification of the new identification source character. A plurality of characters with the same “radical” are extracted as candidate characters and stored in the radical candidate character list 15. Further, the candidate character list generation unit 112 compares the glyph elements with respect to the new identification source character portion “廣” using the glyph element dictionary 14, thereby providing a third candidate for the new identification source character. A plurality of characters that match “part” are extracted as characters and stored in the partial candidate character list 16 (process # 91).

この時、字形要素辞書１４における同定先文字「鉱」の学習データである字形要素構造体１４１において、第４の同定元文字の部首「金」は、配列［０］に格納される。従って、この場合、候補文字リスト生成部１１２は、部首候補文字リスト１５において、文字「鉱」を、「学習候補」のフィールドではなく、「候補文字」のフィールドに格納する。 At this time, the radical “gold” of the fourth identification source character is stored in the array [0] in the character element structure 141 which is the learning data of the identification target character “Mine” in the character element dictionary 14. Therefore, in this case, the candidate character list generation unit 112 stores the character “Mine” in the “candidate character” field instead of the “learning candidate” field in the radical candidate character list 15.

この後、表示用候補文字リスト生成部１１３は、ＯＣＲ候補文字リスト１３、部首候補文字リスト１５及び部分候補文字リスト１６の各々に同一の文字「鉱」が共通に含まれるので、同定先文字として、文字「鉱」を決定する（処理＃９２）。これにより、外字である同定元文字「鉱」についての同定先文字「鉱」が定まる。 Thereafter, the display candidate character list generation unit 113 includes the same character “Mine” in each of the OCR candidate character list 13, radical candidate character list 15, and partial candidate character list 16. Then, the character “mine” is determined (process # 92). As a result, the identification character “Mine” for the identification source character “Mine” which is an external character is determined.

従って、文字情報学習部１１４は、新たな同定元文字を決定された同定先文字「鉱」の学習データとして学習する学習処理を行わず（処理＃９３）、表示用候補文字リスト１７は、生成されず、表示部５に表示されない。 Therefore, the character information learning unit 114 does not perform a learning process of learning a new identification source character as learning data of the identified identification target character “Mine” (processing # 93), and the display candidate character list 17 is generated. Is not displayed on the display unit 5.

具体的には、この場合、同定元文字の部首文字コード［金］が、字形要素構造体１４１の配列［０］に格納された部首文字コード［金］と同一である。従って、同定元文字「鉱」の部首文字コード［金］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。また、同定元文字の部分文字コード［廣］が、字形要素構造体１４１の配列［１］に格納された部首文字コード［廣］と同一である。従って、同定元文字の部分文字コード［廣］は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。以上から、この場合、新たな同定元文字は、同定先文字「鉱」の学習データとして字形要素構造体１４１に追加されない。 Specifically, in this case, the radical code [gold] of the identification source character is the same as the radical code [gold] stored in the array [0] of the character element structure 141. Therefore, the radical code [gold] of the identification source character “Mine” is not added to the character-shaped element structure 141 as learning data of the identification destination character “Mine”. Further, the partial character code [廣] of the identification source character is the same as the radical character code [廣] stored in the array [1] of the character element structure 141. Therefore, the partial character code [廣] of the identification source character is not added to the glyph element structure 141 as learning data of the identification destination character “Mine”. From the above, in this case, the new identification source character is not added to the character element structure 141 as learning data of the identification destination character “Mine”.

図１０は、文字の同定処理フローである。 FIG. 10 is a flow of character identification processing.

例えば、オペレータが、外字ファイル２に格納された文字について、字形要素の生成を行う（ステップＳ１）。これにより、外字ファイル２に対応する外字字形要素格納ファイル３が得られる。この後、同定処理部１１が、外字ファイル２に格納された全ての文字（外字）について同定処理を終了したか否かを調べる（ステップＳ２）。 For example, the operator generates a character element for characters stored in the external character file 2 (step S1). Thereby, the external character shape element storage file 3 corresponding to the external character file 2 is obtained. Thereafter, the identification processing unit 11 checks whether or not the identification process has been completed for all characters (external characters) stored in the external character file 2 (step S2).

全ての文字（外字）について同定処理を終了していない場合（ステップＳ２Ｎｏ）、同定処理部１１が、１文字分の文字パターンを、処理対象の文字として、外字ファイル２から選択して読み出し、処理対象である文字について、部首の字形要素情報と部分の字形要素情報とを、外字字形要素格納ファイル３から読み出す（ステップＳ３）。 If the identification process has not been completed for all characters (external characters) (No in step S2), the identification processing unit 11 selects and reads out a character pattern for one character from the external character file 2 as a character to be processed, For the character to be processed, the radical character element information and the partial character element information of the radical are read from the external character element storage file 3 (step S3).

同定処理部１１が、処理対象である文字について、同定処理を行い（ステップＳ４）、ステップＳ２を実行する。 The identification processing unit 11 performs an identification process on the character to be processed (step S4), and executes step S2.

ステップＳ２において、全ての文字（外字）について同定処理を終了した場合（ステップＳ２Ｙｅｓ）、同定処理部１１が、同定元の外字ファイル２の文字と同定先文字との対応関係に基づいて、文字コード変換定義リスト４を生成する（ステップＳ５）。 When the identification process is completed for all characters (external characters) in step S2 (Yes in step S2), the identification processing unit 11 determines whether the characters in the identification source external character file 2 and the identification destination characters correspond to each other. The code conversion definition list 4 is generated (step S5).

図１１及び図１２は、１文字についての同定処理フローである。 11 and 12 are identification processing flows for one character.

ＯＣＲ認識部１１１が、処理対象である文字について、文字認識辞書１２を用いて文字認識処理を行い（ステップＳ１１）、文字認識の結果であるＯＣＲ候補文字リスト１３を生成する（ステップＳ１２）。 The OCR recognition unit 111 performs character recognition processing on the character to be processed using the character recognition dictionary 12 (step S11), and generates an OCR candidate character list 13 as a result of character recognition (step S12).

候補文字リスト生成部１１２が、部首が存在しないか否かを調べる（ステップＳ１３）。部首が存在する場合（ステップＳ１３Ｎｏ）、候補文字リスト生成部１１２が、字形要素を用いて、部首候補文字リスト１５及び部分候補文字リスト１６を生成する（ステップＳ１４）。部首が存在しない場合（ステップＳ１３Ｙｅｓ）、ステップＳ１４は省略される。 The candidate character list generation unit 112 checks whether or not there is a radical (step S13). If there is a radical (No in step S13), the candidate character list generation unit 112 generates the radical candidate character list 15 and the partial candidate character list 16 using the glyph elements (step S14). If there is no radical (step S13 Yes), step S14 is omitted.

この後、表示用候補文字リスト生成部１１３が、表示用候補文字リストを生成する（ステップＳ１５）。この時、いずれの候補文字リストにも存在する文字は、精度の高い候補として、その優先順位を高くされる。 Thereafter, the display candidate character list generation unit 113 generates a display candidate character list (step S15). At this time, the character existing in any candidate character list is given high priority as a highly accurate candidate.

この後、表示用候補文字リスト生成部１１３が、いずれの候補文字リストにも文字が存在するか否かを調べる（ステップＳ１６）。いずれの候補文字リストにも共通に存在する文字がない場合（ステップＳ１６Ｎｏ）、表示用候補文字リスト生成部１１３が、候補文字リストを表示部５に表示する（ステップＳ１７）。これを見たオペレータによる選択入力に基づいて、文字情報学習部１１４が、同定先文字を確定し（ステップＳ１８）、同定先文字の字形要素辞書１４に同定元文字の字形要素情報を学習させる（ステップＳ１９）。 Thereafter, the display candidate character list generation unit 113 checks whether or not there is a character in any candidate character list (step S16). If there is no character that exists in common in any candidate character list (No in step S16), the display candidate character list generation unit 113 displays the candidate character list on the display unit 5 (step S17). Based on the selection input by the operator who sees this, the character information learning unit 114 finalizes the identification destination character (step S18), and causes the character shape element dictionary 14 of the identification destination character to learn character shape element information of the identification source character ( Step S19).

ステップＳ１６において、いずれの候補文字リストにも共通に存在する文字がある場合（ステップＳ１６Ｙｅｓ）、ステップＳ１７〜Ｓ１９は省略され、当該共通に存在する文字が同定先文字として確定される。 In step S16, when there is a character that exists in common in any candidate character list (step S16 Yes), steps S17 to S19 are omitted, and the character that exists in common is determined as the identification target character.

図１３は、候補文字リスト生成の処理フローである。 FIG. 13 is a processing flow for generating a candidate character list.

候補文字リスト生成部１１２が、同定先文字のセットの文字を全て処理したか否かを調べる（ステップＳ２１）。同定先文字のセットの文字を全て処理していない場合（ステップＳ２１Ｎｏ）、候補文字リスト生成部１１２が、字形要素辞書１４を用いて、配置パターン及び部首文字コードが一致する場合には当該部首を部首候補文字リスト１５に追加し（ステップＳ２２）、配置パターン及び部分文字コードが一致する場合には当該部分を部分候補文字リスト１６に追加し（ステップＳ２３）、この後、ステップＳ２１を実行する。 The candidate character list generation unit 112 checks whether or not all characters in the set of identification target characters have been processed (step S21). When not all the characters in the set of identification target characters have been processed (No in step S21), the candidate character list generation unit 112 uses the glyph element dictionary 14 to match the arrangement pattern and the radical character code. The radical is added to the radical candidate character list 15 (step S22), and when the arrangement pattern and the partial character code match, the portion is added to the partial candidate character list 16 (step S23), and then step S21. Execute.

ステップＳ２１において、同定先文字のセットの文字を全て処理した場合（ステップＳ２１Ｙｅｓ）、処理を終了する。 In step S21, when all the characters in the set of identification target characters have been processed (step S21 Yes), the process ends.

図１４は、候補文字学習の処理フローである。 FIG. 14 is a processing flow of candidate character learning.

文字情報学習部１１４が、字形要素辞書１４において、同定先文字の部首の学習リストに、配置パターンと同定元文字の部首文字コードを追加し、学習文字数をインクリメントする（ステップＳ３１）。 The character information learning unit 114 adds the arrangement pattern and the radical character code of the identification source character to the radical learning list of the identification target character in the character shape dictionary 14, and increments the number of learning characters (step S31).

また、文字情報学習部１１４が、字形要素辞書１４において、同定先文字の部分の学習リストに、配置パターンと同定元文字の部分文字コードを追加し、学習文字数をインクリメントする（ステップＳ３２）。 In addition, the character information learning unit 114 adds the arrangement pattern and the partial character code of the identification source character to the learning list of the identification target character portion in the character shape element dictionary 14, and increments the number of learned characters (step S32).

この後、文字情報学習部１１４が、同定元文字の文字パターンデータを文字認識辞書１２に登録し、同定先文字が候補文字となるように学習させる（ステップＳ３３）。 Thereafter, the character information learning unit 114 registers the character pattern data of the identification source character in the character recognition dictionary 12, and learns that the identification destination character is a candidate character (step S33).

１文字同定装置
２外字ファイル
３外字字形要素格納ファイル
４文字コード変換定義リスト
５表示部
６キーボード
１１同定処理部
１２文字認識辞書
１３ＯＣＲ候補文字リスト
１４字形要素辞書
１５部首候補文字リスト
１６部分候補文字リスト
１７表示用候補文字リスト
１８同定元／同定先文字対応関係リスト
１１１ＯＣＲ認識部
１１２候補文字リスト生成部
１１３表示用候補文字リスト生成部
１１４文字情報学習部 DESCRIPTION OF SYMBOLS 1 Character identification device 2 External character file 3 External character form element storage file 4 Character code conversion definition list 5 Display part 6 Keyboard 11 Identification process part 12 Character recognition dictionary 13 OCR candidate character list 14 Character form element dictionary 15 radical candidate character list 16 Partial candidate Character list 17 Display candidate character list 18 Identification source / identification destination character correspondence list 111 OCR recognition unit 112 Candidate character list generation unit 113 Display candidate character list generation unit 114 Character information learning unit

Claims

A character recognition storage unit for storing a dot pattern of characters;
About the character stored in the character recognition storage unit, an arrangement pattern indicating an arrangement of radicals, radical character information including a radical character code representing a radical, and a part representing a part other than the radical A glyph element storage for storing glyph element information of a part including a character code;
An external character storage unit for storing a dot pattern of an external character that is a character that is not included in a standardized character represented by a character code representing a predetermined character;
About the external character stored in the external character storage unit, an arrangement pattern indicating the arrangement of the radical, radical character information including a radical character code indicating the radical, and a partial character indicating a part other than the radical An external character-shaped element storage for storing character-shaped element information of a part including a code;
For the external character to be processed selected from the external character storage unit, the character recognition storage based on the external character dot pattern stored in the external character storage unit and the character dot pattern stored in the character recognition storage unit An OCR recognition unit that extracts a first candidate character that identifies an external character to be processed from characters stored in the unit;
For the external character to be processed, the radical character element information for the character stored in the character element storage unit and the radical character element information for the external character stored in the external character element storage unit Based on the above, a second candidate character for identifying the external character to be processed is extracted from the characters stored in the character-shaped element storage unit, and the external character to be processed is stored in the character-shaped element storage unit From the characters stored in the glyph element storage unit based on the glyph element information of the part for the character that has been performed and the glyph element information of the part of the external character stored in the external character shape element storage unit, A character identification device comprising: a candidate character list generation unit that extracts a third candidate character that identifies an external character to be processed.

The character identification apparatus according to claim 1, wherein the character stored in the character recognition storage unit includes the standardized character, or the standardized character and the external character.

The character identification device further includes:
The display candidate character list generation unit that assigns priorities according to the extent to which the candidate characters are included in the first to third candidate characters in an overlapping manner. The character identification device described in 1.

The character information learning part which determines the identification destination character which is a character which identifies the external character which is the said process target from the said 1st candidate character thru | or a 3rd candidate character is provided. Character identification device.

The character information learning unit, when there is a character common to each of the first candidate character to the third candidate character, determines the common character as the identification target character. Item 5. The character identification device according to Item 4.

The character identification device further includes:
An output unit for outputting the first to third candidate characters;
The character identification device according to claim 4, wherein the character information learning unit determines the identification destination character based on a selection input designating the candidate character output to the output unit.

5. The character according to claim 4, wherein the character information learning unit generates a character correspondence list associating the identification source character and the identification destination character with the external character to be processed as the identification source character. Identification device.

The character information learning unit uses the external character to be processed as an identification source character, the radical shape element information of the radical for the identification source character or the character shape element information of the portion, and the learning element information for the identification destination character The character identification device according to claim 4, wherein the character identification element information is added to the radical shape element information of the radical or the character shape element information of the portion of the identification target character.

The candidate character list generation unit extracts the second candidate character and the third candidate character based on the added learning element information of the radical and the learning element information of the portion. The character identification device according to claim 8.

For characters, an arrangement pattern indicating the arrangement of radicals, radical character element information including radical character codes representing radicals, and character element information of parts including partial character codes representing parts other than the radicals Is stored in the glyph element storage unit,
A process of storing the dot pattern of the character stored in the character-shaped element storage unit in a character recognition storage unit;
Processing for storing a dot pattern of an external character that is a character not included in the standardized character represented by a character code representing a predetermined character in the external character storage unit;
About the external character stored in the external character storage unit, an arrangement pattern indicating the arrangement of the radical, radical character information including a radical character code indicating the radical, and a partial character indicating a part other than the radical Processing to store the glyph element information of the portion including the code in the external character element storage unit;
For the external character to be processed selected from the external character storage unit, the character form element storage is performed based on the external character dot pattern stored in the external character storage unit and the character dot pattern stored in the character recognition storage unit. A process of extracting a first candidate character for identifying an external character to be processed from characters stored in a section;
For the external character to be processed, the radical character element information for the character stored in the character element storage unit and the radical character element information for the external character stored in the external character element storage unit Based on the above, a second candidate character for identifying the external character to be processed is extracted from the characters stored in the character-shaped element storage unit, and the external character to be processed is stored in the character-shaped element storage unit From the characters stored in the glyph element storage unit based on the glyph element information of the part for the character that has been performed and the glyph element information of the part of the external character stored in the external character shape element storage unit, A character identification method, comprising: causing a computer to execute a process of extracting a third candidate character that identifies an external character to be processed.