JPH01183795A

JPH01183795A - Dictionary consulting system in post-processing for document reader

Info

Publication number: JPH01183795A
Application number: JP63007709A
Authority: JP
Inventors: Noriyasu Takao; 高尾　哲康; Fumito Nishino; 文人西野; Yuji Uchida; 裕士内田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1988-01-18
Filing date: 1988-01-18
Publication date: 1989-07-21
Anticipated expiration: 2012-01-16
Also published as: JP2570784B2

Abstract

PURPOSE:To reduce the number of times of retrieving a word dictionary and to improve processing speed by retrieving all words starting at a character on all the characters of an inputted candidate character table from the word dictionary by one consulting with the dictionary. CONSTITUTION:In the word dictionary 12, word information such as a word notation, part of speech information, connecting information with other word is stored. The dictionary 12 is sorted in the sequence of the dictionary and the word starting at the same leading character is present at one place in the dictionary. A dictionary retrieving means 111 retrieves all the words starting at the character of all the characters of the inputted candidate character table from the dictionary by one consulting with the dictionary. On the retrieved word, only the word in which all the characters are present in the candidate character table is selected. Thereby, the quantity of a processing and a time required for collating the character word in a post-processing can be reduced.

Description

【発明の詳細な説明】〔概　要〕日本語文章を対象とした文書を読み取り機械可読な形式
に変換する文書リーダの後処理装置に関し、単語辞書検索の回数を削減して処理速度を向上すること
を目的とし、文書リーダの文字認識装置から出力される候補文字表か
ら、単語情報、文法情報、文脈情報、関連語情報等を利
用して、文字認識装置では一意に確定できなかった文字
を一意に確定させる後処理装置において、候補文字表の
文字を組み合わせてできる単語と、単語表記、品詞情報
等の単語情報を格納する単語辞書との照合を行う単語照
合部に、入力された候補文字表の全文字についてその文
字で始まる単語の全てを単語辞書から１回の辞書引きで
検索する辞書検索手段と、検索された単語について単語
の全ての文字が候補文字表にあるもののみを選別する単
語選別手段を備えるよう構成する。[Detailed Description of the Invention] [Summary] This invention relates to a post-processing device for a document reader that reads and converts Japanese documents into a machine-readable format, and improves processing speed by reducing the number of word dictionary searches. With the aim of In the post-processing device that uniquely determines the candidate characters, the input candidate characters are entered into the word matching unit that matches the word formed by combining the characters in the candidate character table with the word dictionary that stores word information such as word notation and part of speech information. A dictionary search means that searches for all words starting with that letter from a word dictionary in one dictionary lookup for all the letters in the table, and selects only those words for which all the letters of the searched words are in the candidate character list. The device is configured to include word selection means.

[Industrial application field]

本発明は文書読取り認識装置（文書リーダ装置）に係わ
り、特に日本語文章を対象とした文書リーダ後処理装置
に関する。The present invention relates to a document reading recognition device (document reader device), and particularly to a document reader post-processing device for Japanese text.

文書リーダ装置は、人間に可読な文字の形式で書かれた
大量の印刷物、出版物、書類、手書き文書等を機械可読
の形式に変換する装置として需要が高まっている。この
装置は処理する文書量にもよるが、できるかぎり高速な
処理が望まれる。Document reader devices are in increasing demand as devices that convert large amounts of printed matter, publications, documents, handwritten documents, etc. written in human-readable character format into machine-readable format. Although it depends on the amount of documents to be processed by this device, it is desired that the processing speed is as high as possible.

文書リーダ後処理装置は、文書リーダの文字認識装置か
ら出力される候補文字列から単語情報、文法情報、文脈
情報、関連語情報等を利用して文字認識装置では一意に
確定できなかった文字を一意に確定させる機能を持つ。The document reader post-processing device uses word information, grammar information, context information, related word information, etc. from the candidate character string output from the character recognition device of the document reader to identify characters that could not be uniquely determined by the character recognition device. It has the function of uniquely determining.

本発明は、文書リーダ後処理装置で利用する各種情報の
うち単語情報を利用する際に必要な単語辞書との照合を
高速に行う方式に関する。The present invention relates to a method for quickly comparing word information with a word dictionary, which is necessary when using word information among various types of information used in a document reader post-processing device.

[Conventional technology]

従来の文書リーダ後処理装置における単語照合は、文字
認識装置から得られる候補文字を組み合わせて作られる
単語のすべてについて単語辞書との照合を行い、単語辞
書中に存在するかどうかをチエツクするものであった。Word matching in conventional document reader post-processing devices involves checking all words created by combining candidate characters obtained from character recognition devices against a word dictionary to see if they exist in the word dictionary. there were.

[Problem to be solved by the invention]

文書リーダ後処理装置においては、文字認識装置から得
られた候補文字を組み合わせてできる単語を単語辞書と
の照合を行うが、この際、文字認識装置から得られる候
補文字は通常、−位候補、三位候補、−１ｎ位候補（ｎ
は任意の正整数）のように複数であることが多い。候補
文字を単純に組み合わせてできる単語数は以下のように
なる。In the document reader post-processing device, words formed by combining candidate characters obtained from the character recognition device are checked against a word dictionary. At this time, the candidate characters obtained from the character recognition device are usually -position candidates, 3rd place candidate, -1nth place candidate (n
is often plural, such as any positive integer). The number of words that can be created by simply combining candidate characters is as follows.

ここで、ｋは単語の文字数である。Here, k is the number of characters in the word.

ｎ１＋ｎ２＋・・・＋ｎｋしかし、ｋに無限に大きな数値までとらせることは実際
上は不可能であるから、一般には文字種（ひらがな、カ
タカナ、記号、数式、漢字）の切れ目が単語の区切りに
なることが多いという経験に基づいて最大のｋの値を単
語候補を決定する時に決定している。n1+n2+...+nk However, it is practically impossible to make k take on an infinitely large number, so in general, the breaks between character types (hiragana, katakana, symbols, mathematical formulas, kanji) are used to separate words. The maximum value of k is determined when determining word candidates based on the experience that there are many words.

例えば、第６図に示すように、上記の区切りにより漢字
６文字が決定され、各文字についてそれぞれ四つの候補
が得られた時、従来はその全ての組み合わせについて単
語辞書にあるか否かを辞書引きしていた。即ち、１文字
の単語として、価、廊、晒、版があるかどうか、２文字
の単語として価格、価栢、価捲、価椅、廊格、廊栢、・
・があるかどうか、３文字の単語として価格対、価格栢
、価格捲、価格椅、価栢対、・・・があるかどうか、４
文字の単語として価格対性、価格対性、・・・があるか
どうか、５文字の単語として価格対性能、価格対性能、
・・・があるかどうか、６文字の単語として価格対性熊
比、価格対性熊此、・・・があるかどうかを−々検索す
る。その回数は、４’　＋４２＋４’　＋４’　＋４’
　＋４’　＝５４６０となる。For example, as shown in Figure 6, when six kanji characters are determined by the above separation and four candidates are obtained for each character, conventionally the dictionary checks whether all combinations are in the word dictionary or not. It was pulling. That is, as one-letter words, price, lang, sarashi, whether or not there is a version, and as two-letter words, price, price, price, price, price chair, name, name, name, name, etc.
・Whether there is a three-letter word such as price pair, price index, price turn, price chair, price index pair, etc., 4
Whether there is price vs. performance, price vs. performance, etc. as a letter word, price vs. performance, price vs. performance, etc. as a 5 letter word.
. . . 6-letter words such as price vs. sex bear ratio, price vs. sex bear this, . . . are searched. The number of times is 4' + 42 + 4' + 4' + 4'
+4'=5460.

このように、単語の最大文字数が多く、候補文字数が多
いと組み合わせてできる単語数は増大しで、後処理のス
ループットが悪くなるという問題があった。As described above, if the maximum number of characters in a word is large and the number of candidate characters is large, the number of words that can be combined increases, resulting in a problem of poor post-processing throughput.

また、前述の文字種の切れ目が単語の区切りになること
が多いという経験にも、例えば「ろ過」、「お手伝い」
、「Ａ級」のように、例外が存在するため、このような
場合には、単語照合そのものがうまくいかない場合があ
るという問題があった。Also, from the experience that the breaks in the character types mentioned above often serve as word breaks, for example, ``filtration'', ``help'', etc.
Since there are exceptions such as "Class A", there is a problem in that the word matching itself may not go well in such cases.

本発明が解決しようとする課題は、このような従来の問
題点を解消した文書リーダ後処理における辞書引き方式
を提供することにある。The problem to be solved by the present invention is to provide a dictionary lookup method in document reader post-processing that eliminates the above-mentioned conventional problems.

[Means to solve the problem]

第１図は、本発明の文書リーダ後処理における辞書引き
方式の原理ブロック図を示す。FIG. 1 shows a block diagram of the principle of a dictionary lookup method in document reader post-processing of the present invention.

図において、１１は単語照合部であり、入力された候補
文字表の文字を組み合わせてできる単語と単語辞書との
照合を行う。In the figure, reference numeral 11 denotes a word matching unit, which matches words formed by combining characters in the input candidate character list with a word dictionary.

１２は単語辞書であり、単語表記、品詞情報、他単語と
の接続情報等の単語情報を格納する。A word dictionary 12 stores word information such as word notation, part of speech information, and connection information with other words.

１１１は辞書検索手段であり、入力された候補文字表の
全文字についてその文字で始まる単語の全てを単語辞書
１２から１回の辞書引きで検索する。Reference numeral 111 denotes a dictionary search means, which searches the word dictionary 12 for all words starting with all characters in the input candidate character table in one dictionary lookup.

１１２は単語選別手段であり、検索された単語について
単語の全ての文字が候補文字表にあるもののみを選別す
る。Reference numeral 112 denotes a word selection means, which selects searched words only from words in which all the characters are in the candidate character list.

[For production]

本発明の構成によれば、文字候補から始まるすべての単
語を一度の辞書引きで検索し、単語辞書へのアクセスを
減らす。単語辞書は、辞書順にソートされているので、
同じ先頭文字で始まる単語は、辞書の内部では１個所に
まとまって存在するので、検索対象の辞書が二次記憶（
磁気ディスク装置等）上にあっても、従来の方法に比べ
て入出力の回数を大幅に減らすことができる。According to the configuration of the present invention, all words starting from a character candidate are searched in one dictionary lookup, thereby reducing access to the word dictionary. The word dictionary is sorted in dictionary order, so
Words that start with the same first letter are grouped together in one place in the dictionary, so the dictionary to be searched is stored in secondary memory (
Even on a magnetic disk device (such as a magnetic disk drive), the number of input/output operations can be significantly reduced compared to conventional methods.

これによって、後処理の単語照合にかかる処理量および
時間を大幅に減らすことができる。This can significantly reduce the amount of processing and time required for word matching in post-processing.

〔Example〕

以下第２図〜第５図に示す実施例により、本発明をさら
に具体的に説明する。The present invention will be explained in more detail below with reference to embodiments shown in FIGS. 2 to 5.

第２図は、本発明の一実施例のシステム構成として、文
書リーダ装置の構成を示す。FIG. 2 shows the configuration of a document reader device as a system configuration of an embodiment of the present invention.

図において、６は一般印刷文書や手書き文書を画像イメ
ージで読み込むイメージスキャナである。In the figure, 6 is an image scanner that reads general printed documents and handwritten documents as images.

５は文字ブロック切り出し装置であり、イメージスキャ
ナ３０から得られた画像から文章が書かれている文字ブ
ロックを切り出す。Reference numeral 5 denotes a character block cutting device, which cuts out a character block in which a sentence is written from an image obtained from the image scanner 30.

４は行切り出し装置であり、文字ブロックから行ブロッ
クを切り出す。4 is a line cutting device which cuts out line blocks from character blocks.

３は文字切り出し装置であり、行ブロックから文字を１
文字車位に切り出す。3 is a character cutting device, which cuts one character from a line block.
Cut out to the letter position.

２は文字認識装置であり、文字認識を行い、各文字に対
応して候補文字群を各候補ごとに距離値（重み付けのた
めの量で値が小さいものほど正解に近いと判断した）を
付け、−位候補、三位候補、・・・、ｎ位候補（ｎは任
意の数値）として出力する。2 is a character recognition device that performs character recognition and assigns a distance value (an amount for weighting, the smaller the value is judged to be closer to the correct answer) to each candidate character group corresponding to each character. , − candidate, third candidate, ..., n candidate (n is any numerical value).

１は本発明の辞書引き方式を持つ文書リーダ後処理装置
であり、文字認識装置２から得られた候補文字列から正
解と思われる文字を決定する。Reference numeral 1 denotes a document reader post-processing device having a dictionary lookup method according to the present invention, which determines characters that are considered to be correct from candidate character strings obtained from the character recognition device 2.

第３図は、本発明の一実施例の文書リーダ後処理装置の
構成を示す図である。FIG. 3 is a diagram showing the configuration of a document reader post-processing device according to an embodiment of the present invention.

図において、１６は文字認識装置２の出力する候補文字
列を入力し、文書リーダ後処理本体１５の作業領域に格
納する。これを候補文字表と呼ぶ。In the figure, reference numeral 16 inputs the candidate character string output from the character recognition device 2 and stores it in the work area of the document reader post-processing main body 15. This is called a candidate character table.

１５は後処理本体であり、後処理の機能の主制御部であ
る。15 is a post-processing main body, which is a main control unit for post-processing functions.

１７は後処理装置の後処理部であり、後処理本体１５で
確定できなかった単語についての後処理等を行う。Reference numeral 17 denotes a post-processing section of the post-processing device, which performs post-processing on words that could not be determined by the post-processing main body 15.

１１は本発明の辞書引き方式を採用した単語照合部であ
り、候補文字表の文字を組み合わせてできる単語と１２
の単語辞書との照合を行う。単語辞書１２は、単語表記
および品詞情報、並びにこの単語に隣接可能な単語の品
詞情報（隣接情報と呼ぶ）を格納しである。Reference numeral 11 is a word matching unit that employs the dictionary lookup method of the present invention, and 11 and 12 are words formed by combining characters in the candidate character list.
Check against the word dictionary. The word dictionary 12 stores word notations, part-of-speech information, and part-of-speech information of words that can be adjacent to this word (referred to as adjacent information).

１３は文法照合部であり、単語照合部１１から得られた
隣接情報を基に、１４の文法辞書を参照して単語間で隣
接可能かどうかをチエツクする。文法辞書１４は、品詞
側に隣接可能な単語の品詞情報を格納しである。Reference numeral 13 denotes a grammar checking unit, which checks whether or not words can be adjacent by referring to the grammar dictionary 14 based on the adjacency information obtained from the word checking unit 11. The grammar dictionary 14 stores part-of-speech information of words that can be adjacent to each other on the part-of-speech side.

第４図は、本発明の一実施例による文書リーダ後処理装
置の単語照合部の処理を示すフローチャートである。FIG. 4 is a flowchart showing the processing of the word matching section of the document reader post-processing device according to an embodiment of the present invention.

■まず、単語リストをリセットする。■First, reset the word list.

■候補文字表の現在確定位置（次に処理を行う単語の候
補文字表上の位置）の全ての候補文字について■および
■の処理を行う。(2) Processes (2) and (2) are performed for all candidate characters at the currently determined position of the candidate character table (the position on the candidate character table of the word to be processed next).

■キー文字、即ちこれから処理しようとする一つの候補
文字を基に単語辞書を検索し、キー文字で始まる全ての
単語を検索する。即ち、先頭文字から始まる単語を含む
辞書上の領域を１ブロツク（通常５１２〜４０９６バイ
ト）読み込む。(2) Search the word dictionary based on the key character, that is, one candidate character to be processed, and search for all words starting with the key character. That is, one block (usually 512 to 4096 bytes) of the area on the dictionary containing the word starting from the first character is read.

■■により検索された全ての単語について、単語の全て
の文字が候補文字表にあるもののみを選別して単語リス
トに加え、出力する。For all the words searched by ■■, only those words in which all the characters are in the candidate character list are selected, added to the word list, and output.

この単語照合で得られた単語リストは、さらに文法照合
部１３によるチエツクによりさらにしぼられることにな
る。The word list obtained by this word matching is further narrowed down by checking by the grammar matching section 13.

第５図は、本発明の一実施例による単語照合過程例を示
す図である。FIG. 5 is a diagram illustrating an example of a word matching process according to an embodiment of the present invention.

本例は、文字種の区切り等により漢字６文字が単語とし
て決定され、各文字についてそれぞれ四つの候補が得ら
れて、これが候補文字表として入力された場合の単語照
合処理である。This example is a word matching process in which six Kanji characters are determined as a word by character type separation, etc., four candidates are obtained for each character, and these are input as a candidate character table.

まず、文字位置１の一位候補の「価」を先頭文字とする
単語を一度に検索する。その結果、価、価格、価値、価
値づけの四つが得られた。三位候補の「廊」を先頭文字
とする単語では、廊下が一つだけ検索された。三位候補
の「晒」を先頭文字とする単語では、晒、晒しの二つが
検索された。First, words whose first character is "valence", which is the first candidate for character position 1, are searched at once. As a result, four items were obtained: value, price, value, and valuation. Among the words starting with the third-place candidate, ``ro'', only one word for ``corridor'' was found. Among the words starting with the third-place candidate "sarashi", two words were searched: "sarashi" and "sarashi".

回位候補の「版」を先頭文字とする単語では、版、版下
、版画、版権、版元、版数の六つが検索され、文字位置
１を候補を先頭文字とする単語で計１３候補が検索され
た。Among the words whose first character is "ban" as a circulation candidate, six words are searched: han, hanshita, print, copyright, publisher, and edition number, and a total of 13 candidates are found for words whose first character is a candidate at character position 1. was searched.

同様に、文字位置２の候補文字「格、栢、捲、椅」を先
頭文字とする単語では、格、格安など計１５候補が検索
された。Similarly, for words whose first character is the candidate character ``kaku, haku, maki, chair'' in character position 2, a total of 15 candidates, such as kaku and kaku, were searched.

同様に、文字位置３の候補文字「対、封、村、材」を先
頭文字とする単語では、対、対ソなど８０候補が検索さ
れた。Similarly, 80 candidates such as tai, tai-so, etc. were searched for the word whose first character is the candidate character ``tai, hou, mura, material'' in character position 3.

同様に、文字位置４の候補文字「性、住、佐、牲」を先
頭文字とする単語では、性、性格など３０候補が検索さ
れた。Similarly, 30 candidates such as gender, personality, etc. were searched for the word whose first character is the candidate character "sex, residence, service, sacrifice" in character position 4.

同様に、文字位置５の候補文字を先頭文字とする単語で
は６候補、文字位置６の候補文字を先頭文字とする単語
では４０候補が検索され、合計１８４候補が検索された
。Similarly, 6 candidates were searched for words whose first character was the candidate character at character position 5, and 40 candidates were searched for words whose first character was the candidate character at character position 6, for a total of 184 candidates.

以上の検索は、一般用約７万語の単語辞書の場合である
。The above search is for a general word dictionary containing about 70,000 words.

これを、第６図に示した従来例の場合の５６４０候補に
比べると大幅に削減されており、且つ一つの文字を先頭
文字とする単語を１度のアクセスで全て取り出すから二
次記憶へのアクセス回数は１６回で済むこととなり、大
幅の削減となる。This is a significant reduction compared to the 5,640 candidates in the conventional example shown in Figure 6, and since all words starting with one letter are retrieved in one access, the number of candidates is reduced to secondary memory. The number of accesses is only 16, which is a significant reduction.

本例では、文書リーダ後処理の入力となる候補文字を上
位回位までにしぼったが、これをもっと緩くするとその
差は益々開くことが見込まれる。In this example, the candidate characters to be input into the document reader post-processing are limited to the top rankings, but if this is made more lenient, the difference is expected to widen further.

次ぎに、検索された全ての単語について、単語の全ての
文字が候補文字表にあるもののみを選別する。その結果
棄却されたものがＸ印で示すもので、残されたものがＯ
印で示すものである。次いで、○の単語について、文字
位置の１から文字位置２、文字位置３へと順に接続する
。文字位置１で検索した単語で文字位置２の候補文字を
含む場合は、文字位置３で検索した単語に接続する。Next, for all the searched words, only those words in which all the characters are in the candidate character list are selected. The ones that were rejected as a result are those marked with an X, and the ones that remained are O.
This is indicated by a mark. Next, the word ○ is connected in order from character position 1 to character position 2 to character position 3. If the word searched at character position 1 includes a candidate character at character position 2, it is connected to the word searched at character position 3.

このようにして、比較的上位の文字候補を使用し、次の
文字位置の候補文字と接続する比較的長い単語というこ
とで、図にアンダーライン（下線）を引いて示した単語
が、最終的に単語リストとして出力される。In this way, relatively high-ranking character candidates are used, and the final word, which is underlined in the diagram, is a relatively long word that connects with the candidate character in the next character position. is output as a word list.

〔Effect of the invention〕

以上説明のように本発明によれば、文書リーダ後処理装
置の単語照合部において、辞書引きの回数を大幅に減ら
し、高速で効率の良い単語照合を行うことができ、文書
リーダ後処理の処理能力の向上に寄与する効果は極めて
大である。As described above, according to the present invention, the number of dictionary lookups can be significantly reduced in the word matching section of the document reader post-processing device, and word matching can be performed at high speed and efficiently. The effect of contributing to the improvement of abilities is extremely large.

[Brief explanation of the drawing]

第１図は本発明の原理ブロック図、第２図は本発明の一実施例のシステム構成を示す図、第３図は本発明の一実施例の文書リーダ後処理装置の構
成を示す図、第４図は本発明の一実施例による処理を示すフローチャ
ート、第５図は本発明の一実施例による単語照合過程例を示す
図、第６図は従来例による単語辞書引き例を示す図である。図面において、１は後処理装置、　　　　２は文字認識装置、３は文字
切り出し装置、　４は行切り出し装置、５は文字ブロッ
ク切り出し装置、６はイメージスキャナ、　１１は単語照合部、１１１は
辞書検索手段、　　１１２は単語選別手段、１２は単語
辞書、　　　　　１３は文法照合部、１４は文法辞書、
　　　　　１５は後処理本体、１６は候補文字列入力部
、　１７は後処理後処理部、をそれぞれ示す。候補文字表本発明の原理ブロック図第１図犀〒本発明の一実施例による処理を示すフローチャート第　
　　４　　　図候補文字表第　　　５　　　図FIG. 1 is a block diagram of the principle of the present invention; FIG. 2 is a diagram showing a system configuration of an embodiment of the present invention; FIG. 3 is a diagram showing the configuration of a document reader post-processing device of an embodiment of the present invention; FIG. 4 is a flowchart showing a process according to an embodiment of the present invention, FIG. 5 is a diagram showing an example of a word matching process according to an embodiment of the present invention, and FIG. 6 is a diagram showing an example of a word dictionary lookup according to a conventional example. be. In the drawings, 1 is a post-processing device, 2 is a character recognition device, 3 is a character segmentation device, 4 is a line segmentation device, 5 is a character block segmentation device, 6 is an image scanner, 11 is a word matching unit, and 111 is a dictionary search means. , 112 is a word selection means, 12 is a word dictionary, 13 is a grammar matching unit, 14 is a grammar dictionary,
Reference numeral 15 indicates a post-processing main body, 16 indicates a candidate character string input section, and 17 indicates a post-processing post-processing section. Candidate character table Figure 1 is a block diagram of the principle of the present invention.Flowchart illustrating the processing according to an embodiment of the present invention
4 Candidate character table Figure 5

Claims

[Claims] From the candidate character table output from the character recognition device of the document reader, the characters that could not be uniquely determined by the character recognition device are determined by using word information, grammar information, context information, related word information, etc. In the post-processing device that uniquely determines the character, there is a word dictionary (12) that stores words formed by combining the characters in the candidate character list, and word information such as word notation and part-of-speech information.
The word matching unit (11) performs matching with the word dictionary (11), which searches the word dictionary (12) for all words starting with the characters in the input candidate character list in one dictionary lookup. ), and a word selection means (112) for selecting only those words in which all the characters of the searched words are in the candidate character list.