JP2011180687A

JP2011180687A - Multilingual document analysis device

Info

Publication number: JP2011180687A
Application number: JP2010042321A
Authority: JP
Inventors: Takashi Hirano; 敬平野; Takashi Mikami; 崇志三上; Takenori Kawamata; 武典川又
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2010-02-26
Filing date: 2010-02-26
Publication date: 2011-09-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a multilingual document analysis device capable of quickly and determining the type of the language of a character written in an image in a document with high accuracy. <P>SOLUTION: The type of language of a text extracted from an electronic document 101 is determined, and the type of language in performing the character recognition of the character written in the image is selected from the determination result, and the character recognition of the image extracted from the electronic document 101 is performed on the basis of the type of language selected, and the type of language of the character written in the image is determined from the character recognition result. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

この発明は、文書中の画像に記載された文字の言語種類を判定する多言語文書解析装置に関するものである。 The present invention relates to a multilingual document analysis apparatus that determines the language type of characters described in an image in a document.

複数の言語が混在している文書ファイル群に対し、これらを横通しで全文検索する要求がある。文書内の電子的なテキストは、言語の種類が不明であっても既存のＮ−Ｇｒａｍ検索方式を用いれば全文検索できる。また、文書内の画像部分は、文字認識処理を行ってテキストを抽出することができれば、全文検索が可能である。
しかしながら、画像中の文字を文字認識する場合、その言語に対応した文字認識処理を適用しなければならず、このため、画像中の文字言語を自動判定する処理が必要となる。このような画像に対する言語判定処理として大きく２つの従来技術が存在する。
１つは、画像中から画像処理的に抽出した特徴量を用いて言語の種類を判別するものである。例えば、特許文献１及び特許文献２がある。
また、もう１つは、文字認識処理の結果に基づいて言語の種類を判定するものである。この例として特許文献３がある。 There is a request for a full text search across a document file group in which a plurality of languages are mixed. Electronic text in a document can be searched in full text using an existing N-Gram search method even if the language type is unknown. Further, if the image portion in the document can be extracted by performing character recognition processing, a full-text search can be performed.
However, when a character in an image is recognized, a character recognition process corresponding to the language must be applied. For this reason, a process for automatically determining the character language in the image is required. There are roughly two conventional techniques for language determination processing for such images.
One is to determine the type of language using a feature amount extracted from an image in image processing. For example, there are Patent Document 1 and Patent Document 2.
The other is to determine the language type based on the result of the character recognition process. There exists patent document 3 as this example.

図７は、従来の言語種類の判定処理を説明するための図である。図７の例では、日本語の文字列７０１及び英語の文字列７０２を示しており、これらの文字列は、各文字が矩形領域７０３で囲まれている。
特許文献１に記載の発明では、図７に示す文字列７０１，７０２から、個々の文字を囲む矩形領域７０３の高さを算出し、矩形領域の高さ／文字列の高さの比率が閾値を超える個数をＮ、閾値以下の個数をＭとして、Ｎ／Ｍの値が大きいと、その文字列は日本語であると判定し、Ｎ／Ｍの値が小さい場合には英語であると判定する。
また、特許文献２では、文字を囲む矩形領域７０３の縦横比や、隣り合う文字の矩形領域間のピッチ等の簡易な情報から、言語の種類を統計的に推定している。
さらに、特許文献３の発明では、複数の言語の音声認識エンジンを用いて、言語が未知の音声に対して認識処理を行い、その処理結果として得られる各言語のスコア（尤度）がもっとも高い言語を判定結果とする。同様な仕組みは文字認識にも適用できる。 FIG. 7 is a diagram for explaining a conventional language type determination process. In the example of FIG. 7, a Japanese character string 701 and an English character string 702 are shown. In these character strings, each character is surrounded by a rectangular area 703.
In the invention described in Patent Document 1, the height of the rectangular area 703 surrounding each character is calculated from the character strings 701 and 702 shown in FIG. 7, and the ratio of the height of the rectangular area / the height of the character string is a threshold value. If the number exceeds N and the number below the threshold is M, if the value of N / M is large, the character string is determined to be Japanese, and if the value of N / M is small, it is determined to be English. To do.
Further, in Patent Document 2, the language type is statistically estimated from simple information such as the aspect ratio of the rectangular area 703 surrounding the character and the pitch between the rectangular areas of adjacent characters.
Furthermore, in the invention of Patent Document 3, recognition processing is performed on speech whose language is unknown using speech recognition engines of a plurality of languages, and the score (likelihood) of each language obtained as a result of the processing is the highest. The language is the judgment result. A similar mechanism can be applied to character recognition.

特許第３８３５６５２号公報Japanese Patent No. 3835652 特許第４０７９３３３号公報Japanese Patent No. 4079333 特開２００４−３４７７３２号公報JP 2004-347732 A

特許文献１，２に代表される従来技術では、画像から簡単な画像処理で抽出した特徴量を用いて言語種類を判別するので、文字認識処理が不要で処理速度が速い利点を持つ。
しかしながら、簡単な画像処理で得られた特徴量を用いて判別しているため、このような特徴量が類似した言語に対して十分な判別精度を得ることが難しいという課題がある。 The conventional techniques represented by Patent Documents 1 and 2 have the advantage that the character recognition process is unnecessary and the processing speed is fast because the language type is determined using the feature amount extracted from the image by simple image processing.
However, since discrimination is performed using feature amounts obtained by simple image processing, there is a problem that it is difficult to obtain sufficient discrimination accuracy for languages having similar feature amounts.

また、特許文献３に代表される従来技術は、複数の言語で文字認識処理した結果のスコア（尤度）を用いて最も良いスコアを持つ言語を判別結果とする。このため、簡単な画像処理で抽出した特徴量を用いて言語種類を判別する場合と比べて高い判別精度を得ることができる。
しかしながら、中国語と日本語のように共通する文字（漢字）が多数存在する言語を判別する場合、スコアの差異が現れ難くて言語種類の判別が難しくなるという課題がある。
さらに、重い文字認識処理を言語数回だけ実施する必要があるため、処理速度が遅くなるという課題もある。 Moreover, the prior art represented by patent document 3 uses the score (likelihood) as a result of character recognition processing in a plurality of languages as a discrimination result. For this reason, it is possible to obtain higher discrimination accuracy compared to the case where the language type is discriminated using the feature amount extracted by simple image processing.
However, when a language having many common characters (Chinese characters) such as Chinese and Japanese is discriminated, there is a problem that it is difficult to discriminate the language type because a difference in scores hardly appears.
Furthermore, since it is necessary to carry out heavy character recognition processing only several times in a language, there is a problem that processing speed becomes slow.

この発明は、上記のような課題を解決するためになされたもので、文書中の画像に記載された文字の言語の種類を高速かつ高精度に判定できる多言語文書解析装置を得ることを目的とする。 The present invention has been made to solve the above-described problems, and an object of the present invention is to provide a multilingual document analysis apparatus that can determine the language type of characters described in an image in a document at high speed and with high accuracy. And

この発明に係る多言語文書解析装置は、電子文書からテキストを抽出するテキスト抽出部と、テキスト抽出部が抽出したテキストの言語種類を判定するテキスト言語判定部と、テキスト言語判定部によるテキストに対する言語種類の判定結果から、画像に記載された文字を文字認識する際の言語種類を選定する文字認識言語選定部と、電子文書から画像を抽出する画像抽出部と、文字認識言語選定部が選定した言語種類で、画像抽出部が抽出した画像を文字認識する多言語文字認識処理部と、多言語文字認識処理部による文字認識結果から、画像に記載された文字の言語種類を判定する画像言語判定部とを備えるものである。 A multilingual document analysis apparatus according to the present invention includes a text extraction unit that extracts text from an electronic document, a text language determination unit that determines a language type of the text extracted by the text extraction unit, and a language for the text by the text language determination unit From the type determination results, the character recognition language selection unit that selects the language type for recognizing the characters described in the image, the image extraction unit that extracts the image from the electronic document, and the character recognition language selection unit selected Multilingual character recognition processing unit that recognizes the image extracted by the image extraction unit by language type, and image language determination that determines the language type of the character described in the image from the character recognition result by the multilingual character recognition processing unit Part.

この発明によれば、電子文書から抽出したテキストの言語種類を判定し、この判定結果から、画像に記載された文字を文字認識する際の言語種類を選定するとともに、選定した言語種類で、電子文書から抽出した画像を文字認識し、この文字認識結果から当該画像に記載された文字の言語種類を判定する。このように構成することで、文書中の画像に記載された文字の言語の種類を高速かつ高精度に判定できるという効果がある。 According to the present invention, the language type of the text extracted from the electronic document is determined, and from this determination result, the language type for recognizing the character described in the image is selected, and the selected language type Character recognition is performed on the image extracted from the document, and the language type of the character described in the image is determined from the character recognition result. With this configuration, there is an effect that the language type of characters described in the image in the document can be determined at high speed and with high accuracy.

この発明の実施の形態１による多言語文書解析装置の構成を示すブロック図である。It is a block diagram which shows the structure of the multilingual document analysis apparatus by Embodiment 1 of this invention. 電子文書の一例を示す図である。It is a figure which shows an example of an electronic document. 図２（ａ）の電子文書から抽出したテキストの内容を示す図である。It is a figure which shows the content of the text extracted from the electronic document of Fig.2 (a). 電子的なテキストの言語種類を判定する処理を説明するための説明図である。It is explanatory drawing for demonstrating the process which determines the language type of an electronic text. 文字認識処理のための候補言語のリストを示す図である。It is a figure which shows the list of the candidate languages for a character recognition process. 文書中に複数の言語が混在した電子文書の例である。This is an example of an electronic document in which a plurality of languages are mixed in the document. 従来の言語種類の判定処理を説明するための図である。It is a figure for demonstrating the determination process of the conventional language type.

実施の形態１．
図１は、この発明の実施の形態１による多言語文書解析装置の構成を示すブロック図である。図１において、実施の形態１における多言語文書解析装置は、テキスト抽出部１０２、テキスト言語判定部１０３、文字認識言語選定部１０４、画像抽出部１０５、多言語文字認識処理部１０６、画像言語判定部１０７、テキスト言語判定辞書の記憶部１０８、及び多言語文字認識辞書の記憶部１０９を備える。 Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a multilingual document analysis apparatus according to Embodiment 1 of the present invention. 1, the multilingual document analysis apparatus according to Embodiment 1 includes a text extraction unit 102, a text language determination unit 103, a character recognition language selection unit 104, an image extraction unit 105, a multilingual character recognition processing unit 106, and an image language determination. Unit 107, a text language determination dictionary storage unit 108, and a multilingual character recognition dictionary storage unit 109.

テキスト抽出部１０２は、電子文書１０１の入力を受け付ける構成部であり、入力した電子文書１０１からテキスト形式のデータを抽出する。テキスト言語判定部１０３は、テキスト抽出部１０２により抽出されたテキストデータの内容から、テキストの言語種類を判定する構成部である。文字認識言語選定部１０４は、テキスト言語判定部１０３による言語種類の判定結果を受ける構成部であり、当該判定結果から画像に記載された文字の文字認識処理を行う際に利用する言語種類を選定する。 The text extraction unit 102 is a component that accepts an input of the electronic document 101, and extracts text format data from the input electronic document 101. The text language determination unit 103 is a component that determines the language type of the text from the contents of the text data extracted by the text extraction unit 102. The character recognition language selection unit 104 is a component that receives the determination result of the language type by the text language determination unit 103, and selects a language type to be used when performing character recognition processing of characters described in the image from the determination result. To do.

画像抽出部１０５は、電子文書１０１の入力を受け付ける構成部であり、入力した電子文書１０１に含まれる画像を抽出する。多言語文字認識処理部１０６は、文字認識言語選定部１０４により選定された言語種類を用いて、画像抽出部１０５により抽出された画像に記載された文字の文字認識処理を実行する構成部である。画像言語判定部１０７は、多言語文字認識処理部１０６による文字認識処理の結果を受ける構成部であり、当該文字認識の結果から、画像抽出部１０５によって抽出された画像に記載される文字の言語種類を判定する。 The image extraction unit 105 is a component that receives input of the electronic document 101 and extracts an image included in the input electronic document 101. The multilingual character recognition processing unit 106 is a component that executes character recognition processing of characters described in the image extracted by the image extraction unit 105 using the language type selected by the character recognition language selection unit 104. . The image language determination unit 107 is a component that receives the result of the character recognition processing by the multilingual character recognition processing unit 106, and the language of the characters described in the image extracted by the image extraction unit 105 from the result of the character recognition Determine the type.

記憶部１０８は、言語毎の特徴を記載したテキスト言語判定辞書を記憶する記憶部であり、テキスト言語判定部１０３が言語判定を行う際に当該テキスト言語判定辞書が参照される。記憶部１０９は、多言語文字認識処理部１０６による文字認識処理の際に参照される文字認識用辞書を記憶する記憶部であり、言語判定の候補となる言語毎の文字認識辞書が文字認識用辞書として格納される。 The storage unit 108 is a storage unit that stores a text language determination dictionary describing characteristics for each language, and the text language determination dictionary is referred to when the text language determination unit 103 performs language determination. The storage unit 109 is a storage unit that stores a character recognition dictionary that is referred to during character recognition processing by the multilingual character recognition processing unit 106. The character recognition dictionary for each language that is a candidate for language determination is used for character recognition. Stored as a dictionary.

なお、テキスト抽出部１０２、テキスト言語判定部１０３、文字認識言語選定部１０４、画像抽出部１０５、多言語文字認識処理部１０６及び画像言語判定部１０７は、この発明の趣旨に従う多言語文書解析用プログラムをコンピュータに実行させることで、ハードウエアとソフトウエアとが協働した具体的な手段として、当該コンピュータ上で実現することができる。また、記憶部１０８，１０９は、上記コンピュータが搭載する記憶装置、例えば、ハードディスク装置や外部記憶メディア等に構築される。この他、多言語文書解析装置との間で有線又は無線で通信接続が可能なコンピュータ装置が備える記憶装置に構築しても構わない。 The text extraction unit 102, the text language determination unit 103, the character recognition language selection unit 104, the image extraction unit 105, the multilingual character recognition processing unit 106, and the image language determination unit 107 are for multilingual document analysis according to the gist of the present invention. By causing the computer to execute the program, it can be realized on the computer as a specific means in which hardware and software cooperate. The storage units 108 and 109 are constructed in a storage device mounted on the computer, for example, a hard disk device or an external storage medium. In addition, a storage device provided in a computer device capable of wired or wireless communication connection with the multilingual document analysis device may be constructed.

次に動作について説明する。
先ず、テキスト抽出部１０２は、入力した電子文書１０１から電子的なテキストを抽出する。ここで、具体例を挙げてテキスト抽出処理の詳細を説明する。
図２は、電子文書の一例を示す図であり、図３は、図２（ａ）の電子文書から抽出したテキストの内容を示す図である。図２（ａ）に示す電子文書１０１ａは、電子的なテキスト２０１，２０２と画像２０３とを含む電子文書である。画像２０３には、“操作パネル”や“上”、“下”の各文字が記載されている。図２（ｂ）に示す電子文書１０１ｂは、ページ全体が画像のみで構成された電子文書である。この電子文書１０１ｂ中の画像２０４においても、文字認識の対象となる文字が記載されている。 Next, the operation will be described.
First, the text extraction unit 102 extracts electronic text from the input electronic document 101. Here, the details of the text extraction process will be described with a specific example.
FIG. 2 is a diagram illustrating an example of an electronic document, and FIG. 3 is a diagram illustrating the contents of text extracted from the electronic document in FIG. An electronic document 101 a shown in FIG. 2A is an electronic document including electronic texts 201 and 202 and an image 203. In the image 203, characters “operation panel”, “upper”, and “lower” are described. An electronic document 101b shown in FIG. 2B is an electronic document in which the entire page is composed only of images. Also in the image 204 in the electronic document 101b, characters to be recognized are described.

テキスト抽出部１０２は、図２（ａ）に示す電子文書１０１ａから、電子的なテキスト２０１，２０２の内容が抽出される。電子文書１０１ａ，１０１ｂから電子的なテキストの内容を抽出する方法としては、例えば、下記の参考文献１に示す手法を利用する。参考文献１では、電子的なテキストをページ単位で取得し、かつそのページ中のテキスト位置情報を得る方法が記載されている。このような方法で抽出した電子的なテキストは、図３に示すように、ページ番号とテキスト位置を示す情報付きで管理される。
（参考文献１）
平野，岡野，岡田，依田，“ページ記述言語の解析に基づく多様な文書からの構造化内容情報の抽出”，信学論Ｄ，Ｖｏｌ．Ｊ９１−Ｄ，Ｎｏ．５，ｐｐ．１４０６−１４１７，（２００８） The text extraction unit 102 extracts the contents of the electronic texts 201 and 202 from the electronic document 101a shown in FIG. As a method for extracting the contents of the electronic text from the electronic documents 101a and 101b, for example, the technique shown in Reference Document 1 below is used. Reference 1 describes a method of obtaining electronic text in units of pages and obtaining text position information in the pages. The electronic text extracted by such a method is managed with information indicating the page number and text position, as shown in FIG.
(Reference 1)
Hirano, Okano, Okada, Yoda, “Extracting Structured Content Information from Various Documents Based on Analysis of Page Description Languages”, Science Theory D, Vol. J91-D, no. 5, pp. 1406-1417, (2008)

次に、テキスト言語判定部１０３は、テキスト抽出部１０２によって抽出された電子的なテキストを、記憶部１０８から読み出したテキスト言語判定辞書に記載されるプロファイルデータと比較することで、そのテキストの言語種類を推定する。ここで、テキストの言語種類を判定する方法では、ページ単位か、もしくは図３に示したテキスト位置の単位で個々に実施される。テキストから言語種類を推定する方法としては、例えば参考文献２に示す手法を利用することができる。
（参考文献２）
ＷｉｌｌｉａｍＢ．Ｃａｖｎａｒ，ＪｏｈｎＭ．Ｔｒｅｎｋｌｅ， “Ｎ−Ｇｒａｍ−ＢａｓｅｄＴｅｘｔＣａｔｅｇｏｒｉｚａｔｉｏｎ”，ＳＤＡＩＲ−９４，３ｒｄＡｎｎｕａｌＳｙｍｐｏｓｉｕｍｏｎＤｏｃｕｍｅｎｔＡｎａｌｙｓｉｓａｎｄＩｎｆｏｒｍａｔｉｏｎＲｅｔｒｉｅｖａｌ． Next, the text language determination unit 103 compares the electronic text extracted by the text extraction unit 102 with the profile data described in the text language determination dictionary read from the storage unit 108, thereby determining the language of the text. Estimate the type. Here, the method for determining the language type of the text is performed individually for each page or for each text position shown in FIG. As a method for estimating the language type from the text, for example, the technique shown in Reference Document 2 can be used.
(Reference 2)
William B.B. Cavnar, John M. Trunkle, “N-Gram-Based Text Category”, SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.

図４は、電子的なテキストの言語種類を判定する処理を説明するための説明図である。
上記参考文献２では、予め収集しておいた大量のテキストデータから、言語毎のプロファイルデータを作成しておく。このプロファイルデータは、テキストをＮ文字ずつに分割して得た文字列要素が、発生頻度の高い順に格納されている。
図４の例では、予め収集しておいた大量の各言語毎（日本語、中国語、英語）のテキストデータから、テキスト言語判定辞書として、日本語のプロファイルデータ４０２、中国語のプロファイルデータ４０３及び英語のプロファイルデータ４０４が記憶部１０８に記憶される。また、プロファイルデータ４０２，４０３，４０４は、テキストを２文字ずつに分割して得た文字列要素が発生頻度の高い順に格納される。 FIG. 4 is an explanatory diagram for explaining processing for determining the language type of an electronic text.
In Reference Document 2, profile data for each language is created from a large amount of text data collected in advance. In this profile data, character string elements obtained by dividing the text into N characters are stored in the order of occurrence frequency.
In the example of FIG. 4, Japanese profile data 402 and Chinese profile data 403 are used as a text language determination dictionary from a large amount of text data collected in advance for each language (Japanese, Chinese, English). And English profile data 404 is stored in the storage unit 108. Profile data 402, 403, and 404 are stored in descending order of occurrence of character string elements obtained by dividing text into two characters.

言語種類を判定したいテキストを入力した場合、このテキストに対しても同様にＮ文字ずつに分割した文字列要素を抽出する。図４では、テキスト言語判定部１０３が、テキスト抽出部１０２によって図２（ａ）に示した電子文書１０１ａから抽出された電子的なテキスト２０１のテキスト内容を、２文字ずつに分割して文字列要素４０１を得る。
この後、テキスト言語判定部１０３は、抽出した文字列要素４０１が、各言語のプロファイルデータ４０２，４０３，４０４中に含まれるか否かを調べる。
例えば、文字列要素４０１中の２文字の文字列要素“Ｆｉ”は、英語のプロファイルデータ４０４に含まれている。同様に、文字列要素４０１中の文字列要素“操作”は、日本語のプロファイルデータ４０２に含まれている。
テキスト言語判定部１０３は、言語種類を判定したいテキストから得た文字列要素１０１の各文字列要素について、上述したプロファイルデータとの照合から、プロファイルデータに含まれる割合を算出する。続いて、算出した割合を基に、テキスト言語判定部１０３は、テキスト言語の判定結果の信頼度を示すスコア値を算出して、スコアが高い言語を判定結果とする。 When a text whose language type is to be determined is input, a character string element divided into N characters is similarly extracted for this text. In FIG. 4, the text language determination unit 103 divides the text content of the electronic text 201 extracted from the electronic document 101a shown in FIG. Element 401 is obtained.
Thereafter, the text language determination unit 103 checks whether or not the extracted character string element 401 is included in the profile data 402, 403, and 404 of each language.
For example, a two-character string element “Fi” in the string element 401 is included in the English profile data 404. Similarly, the character string element “operation” in the character string element 401 is included in the Japanese profile data 402.
The text language determination unit 103 calculates the ratio included in the profile data for each character string element of the character string element 101 obtained from the text whose language type is to be determined, by collating with the above-described profile data. Subsequently, based on the calculated ratio, the text language determination unit 103 calculates a score value indicating the reliability of the determination result of the text language, and determines a language with a high score as the determination result.

次に、文字認識言語選定部１０４は、テキスト言語判定部１０３から上記のスコア値を入力し、このスコア値に基づいて、利用可能な全言語の中から、以降の文字認識処理で利用する言語を選定する。具体的には、スコア値が所定の閾値よりも高い言語を、文字認識処理の候補言語とする。これにより、テキスト言語判定部１０３によって、ある程度言語が絞りこまれると、以降の文字認識処理において、全言語数分、文字認識処理を繰り返す必要がなくなり、処理時間を短縮することができる。
なお、スコア値が所定の閾値を超える言語がない場合は、利用可能な全言語が候補言語となる。例えば、図２（ｂ）に示した電子文書１０１ｂのように、画像しか含まない電子文書では、テキストがないためにスコアが低くなり、結果として全言語で文字認識処理を行うことになる。 Next, the character recognition language selection unit 104 inputs the above score value from the text language determination unit 103, and based on this score value, the language used in the subsequent character recognition processing from all the available languages. Is selected. Specifically, a language having a score value higher than a predetermined threshold is set as a candidate language for character recognition processing. Thereby, when the language is narrowed down to some extent by the text language determination unit 103, it is not necessary to repeat the character recognition process for all the languages in the subsequent character recognition process, and the processing time can be shortened.
If there is no language whose score value exceeds a predetermined threshold, all available languages are candidate languages. For example, an electronic document including only an image like the electronic document 101b shown in FIG. 2B has a low score because there is no text, and as a result, character recognition processing is performed in all languages.

図５は、文字認識処理のための候補言語のリストを示す図であり、文字認識言語選定部１０４によって、図４に示したテキスト言語判定結果から選定された文字認識処理のための候補言語を示している。図５に示すように、文書中のページやテキスト位置単位で文字認識処理のための候補言語が選定できる。このため、文書の途中で言語が変わっても対応可能である。 FIG. 5 is a diagram showing a list of candidate languages for character recognition processing. The candidate language for character recognition processing selected from the text language determination result shown in FIG. Show. As shown in FIG. 5, a candidate language for character recognition processing can be selected for each page or text position in the document. For this reason, even if the language changes in the middle of the document, it can be handled.

次に、画像抽出部１０５が、電子文書１０１から画像を抽出する。この画像抽出部１０５による画像抽出も、上述の参考文献１に示す方法で実現できる。参考文献１によれば、画像を抽出する際、本画像が含まれるページ番号とページ中の位置情報を一緒に得ることができる。 Next, the image extraction unit 105 extracts an image from the electronic document 101. The image extraction by the image extraction unit 105 can also be realized by the method shown in Reference Document 1 described above. According to Reference 1, when an image is extracted, a page number including the main image and position information in the page can be obtained together.

続いて、多言語文字認識処理部１０６は、文字認識言語選定部１０４で得られた候補言語を用いて、画像抽出部１０５で抽出された画像に対する文字認識処理を実行する。ここでは、各言語用の文字認識辞書を、多言語文字認識辞書として事前に記憶部１０９に格納しておく。多言語文字認識処理部１０６は、図５に示す文字認識処理の候補言語のデータのうち、これから文字認識処理する画像に記載される文字と関連性のある候補言語（図５中の日本語や英語等）を参照して、文字認識処理するための言語を得る。
具体的には、画像が含まれているのと同一ページの候補言語か、画像の近くに位置するテキストの候補言語を用いて、当該画像を文字認識処理する。
例えば、図５に示すように、ページ番号が１の場合における文字認識処理のための候補言語は、日本語か英語のどちらかだと判断されている。このため、多言語文字認識処理部１０６は、図２（ａ）に示した電子文書１０１ａの第１ページに含まれる画像２０３に対して、日本語と英語の文字認識辞書を用いて、２回、文字認識処理を行う。 Subsequently, the multilingual character recognition processing unit 106 executes character recognition processing on the image extracted by the image extraction unit 105 using the candidate language obtained by the character recognition language selection unit 104. Here, the character recognition dictionary for each language is stored in advance in the storage unit 109 as a multilingual character recognition dictionary. The multilingual character recognition processing unit 106, among the candidate language data of the character recognition process shown in FIG. 5, has a candidate language (Japanese or Japanese in FIG. A language for character recognition processing is obtained with reference to English or the like.
Specifically, character recognition processing is performed on the image using the candidate language of the same page as that containing the image or the candidate language of the text located near the image.
For example, as shown in FIG. 5, it is determined that the candidate language for the character recognition process when the page number is 1 is either Japanese or English. For this reason, the multilingual character recognition processing unit 106 uses the Japanese and English character recognition dictionaries twice for the image 203 included in the first page of the electronic document 101a shown in FIG. , Character recognition processing is performed.

最後に、画像言語判定部１０７は、多言語文字認識処理部１０６で得た文字認識結果を用いて、画像中に記載された文字の言語を判断する。ここでは、下記式（１）を用いて、文字認識結果から得た評価値Ｄ_ｌが最も高い言語ｌを判定結果とする。
但し、Ｓ_ｊバーは文字カテゴリｊに含まれる文字の平均文字認識類似度であり、γは重み係数、Ｍ_ｌは言語ｌの文字カテゴリ数、Ｚ_ｌは言語毎の平均類似度を揃えるバイアス値である。また、Ｃ_ｌは言語ｌの文字カテゴリである。
例えば、日本語、中国語及び英語の３言語で判定を行う場合は、文字カテゴリとして、「ＵｎｉｃｏｄｅのＣＪＫ漢字領域」「平仮名・カタカナ領域」「英数字・記号領域」を用いる。

Finally, the image language determination unit 107 determines the language of the characters described in the image using the character recognition result obtained by the multilingual character recognition processing unit 106. Here, using the following equation (1), the language l having the highest evaluation value D _l obtained from the character recognition result is set as the determination result.
Where S _j bar is the average character recognition similarity of the characters included in the character category j, γ is a weighting factor, M _l is the number of character categories of language l, and Z _l is a bias value for aligning the average similarity for each language. It is. C _l is a character category of language l.
For example, when the determination is made in three languages, Japanese, Chinese, and English, “Unicode CJK Kanji Area”, “Hiragana / Katakana Area”, and “Alphanumeric / Symbol Area” are used.

なお、特許文献３に記載される方法は、上記式（１）の第１項及び第３項のみを用いることと同等の内容である。これに対して、本発明では、評価値Ｄ_ｌが画像と異なる言語で文字認識した場合に文字カテゴリ毎の平均類似度の分散値が増加するというヒューリスティックな特性を、上記式（１）の第２項で捕らえる。このように、画像言語判定部１０７は、複数の言語で文字認識して得られた文字認識結果を定量的に示すスコア（尤度や類似度、距離値等）を、文字カテゴリ毎に集計して、文字カテゴリ毎に算出したスコアの平均の分散値を、画像の言語種類の判定基準に用いる。このようにすることで、同じ文字コードを含む中国語や日本語に対しても高精度に言語を判定することができる。 The method described in Patent Document 3 is equivalent to using only the first and third terms of the above formula (1). In contrast, in the present invention, a heuristic characteristic that dispersion values of the average similarity of each character category if the evaluation value D _l is character recognition in different languages and image increases, the above formula (1) the Captured in item 2. In this way, the image language determination unit 107 aggregates scores (likelihood, similarity, distance value, etc.) that quantitatively indicate the character recognition results obtained by character recognition in a plurality of languages for each character category. Thus, the average variance value of the scores calculated for each character category is used as a criterion for determining the language type of the image. In this way, it is possible to determine the language with high accuracy even for Chinese and Japanese including the same character code.

上述した処理を文書中の全ページに対して実施することで、文書に含まれる言語の種別が不明な画像からもテキスト化された情報を取得することができ、画像中の文字に対しても全文検索が可能となる。 By performing the above-described processing on all pages in the document, it is possible to acquire textual information from an image whose language type is unknown in the document, and even for characters in the image. Full-text search is possible.

なお、画像に対して関連するテキスト情報の言語判定結果を用いて、当該画像を文字認識処理し言語種類を判定することで、文書の途中で言語が変わった場合でも対応できる。
図６は、文書中に複数の言語が混在した電子文書の例である。図６に示す電子文書は、日本語で記述された第１ページ目の文書６０１、中国語で記述された第２ページ目の文書６０２、英語で記述された第３ページ目の文書６０３を含んで構成される。
このように、同一文書中で、日本語ページ文書６０１と中国語ページ文書６０２と英語ページ文書６０３とが混在した文書においても、各ページに含まれる周辺のテキスト情報を元に画像に対して正しく言語判定を行い、正しい文字認識結果を抽出できる。 Note that, by using the language determination result of the text information related to the image and performing character recognition processing on the image and determining the language type, it is possible to cope with a case where the language changes in the middle of the document.
FIG. 6 is an example of an electronic document in which a plurality of languages are mixed in the document. The electronic document shown in FIG. 6 includes a first page document 601 written in Japanese, a second page document 602 written in Chinese, and a third page document 603 written in English. Consists of.
As described above, even in a document in which the Japanese page document 601, the Chinese page document 602, and the English page document 603 are mixed in the same document, the image is correctly obtained based on the surrounding text information included in each page. Language determination can be performed and correct character recognition results can be extracted.

なお、画像文字認識処理する候補言語を絞り込み、この関連性として「同一ページにある」場合を例に説明したが、画像から前後数ページの範囲のテキストを関連性のあるテキストとしても良い。また、画像の近くにあるテキストや、画像と同一パラグラフ内に存在するテキストを、関連性のあるテキストであるとしても良い。 In addition, although the case where the candidate language for image character recognition processing is narrowed down and the relationship is “on the same page” has been described as an example, the text in the range of several pages before and after the image may be related text. Also, text that is close to the image or text that exists in the same paragraph as the image may be related text.

以上のように、この実施の形態１では、電子文書１０１から抽出したテキストの言語種類を判定し、この判定結果から画像に記載された文字を文字認識する際の言語種類を選定するとともに、選定した言語種類で電子文書１０１から抽出した画像を文字認識し、この文字認識結果から当該画像に記載された文字の言語種類を判定する。
このように、画像の周辺にあるテキストに対する言語判定結果のスコア値を元に、画像に対して文字認識処理を行う際の言語を絞り込むので、全言語で文字認識処理を行うことが不要となり、高速に画像の言語種類を判定することが可能である。
また、文書の途中で言語が変わった場合でも、正しく言語を特定することができる。
さらに、画像に記載された文字の言語種類を判定するにあたり、文字認識処理結果のスコアを元に言語種類を判定するのではなく、画像と異なる言語で文字認識した場合に文字カテゴリ毎の平均類似度の分散値が増加するというヒューリスティックな特性を考慮した評価値を用いて言語を判定する。これにより、同一の文字コードが存在する日本語や中国語でも、高精度に言語種類を判定することが可能である。 As described above, in the first embodiment, the language type of the text extracted from the electronic document 101 is determined, and from this determination result, the language type for recognizing the character described in the image is selected and selected. The character extracted from the image extracted from the electronic document 101 with the selected language type is determined, and the language type of the character described in the image is determined from the character recognition result.
In this way, since the language for performing character recognition processing on the image is narrowed based on the score value of the language determination result for the text around the image, it becomes unnecessary to perform character recognition processing in all languages. It is possible to determine the language type of the image at high speed.
Even if the language changes in the middle of the document, the language can be correctly specified.
Furthermore, when determining the language type of the characters described in the image, instead of determining the language type based on the score of the character recognition processing results, the average similarity for each character category when characters are recognized in a language different from the image The language is determined using an evaluation value that takes into account the heuristic characteristic that the variance value of the degree increases. Thereby, it is possible to determine the language type with high accuracy even in Japanese or Chinese in which the same character code exists.

１０１，１０１ａ，１０１ｂ，６０１，６０２，６０３電子文書、１０２テキスト抽出部、１０３テキスト言語判定部、１０４文字認識言語選定部、１０５画像抽出部、１０６多言語文字認識処理部、１０７画像言語判定部、１０８，１０９記憶部、２０１，２０２テキスト、２０３，２０４画像、４０１文字列要素、４０２，４０３，４０４プロファイルデータ、６０１日本語ページ文書、６０２中国語ページ文書、６０３英語ページ文書。 101, 101a, 101b, 601, 602, 603 Electronic document, 102 text extraction unit, 103 text language determination unit, 104 character recognition language selection unit, 105 image extraction unit, 106 multilingual character recognition processing unit, 107 image language determination unit , 108, 109 storage unit, 201, 202 text, 203, 204 image, 401 character string element, 402, 403, 404 profile data, 601 Japanese page document, 602 Chinese page document, 603 English page document.

Claims

A text extractor for extracting text from an electronic document;
A text language determination unit for determining the language type of the text extracted by the text extraction unit;
A character recognition language selection unit that selects a language type for character recognition of characters described in an image from the determination result of the language type for the text by the text language determination unit;
An image extraction unit for extracting an image from the electronic document;
A language type selected by the character recognition language selection unit, a multilingual character recognition processing unit for recognizing the image extracted by the image extraction unit;
A multilingual document analysis apparatus comprising: an image language determination unit that determines a language type of a character described in the image from a character recognition result by the multilingual character recognition processing unit.

The text extraction unit extracts text from the electronic document with position information in the document,
The image extraction unit extracts an image from the electronic document with position information in the document,
The character recognition language selection unit identifies an image related to the position of the text based on the position information, and recognizes the image from the language type determination result for the text by the text language determination unit. The multilingual document analysis apparatus according to claim 1, wherein the language type is selected.

3. The character recognition language selection unit selects a language type for character recognition of an image in the same page from a determination result of a language type for text in the same page of the electronic document. The multilingual document analysis device described.

The image language determination unit aggregates, for each character category, a score that quantitatively indicates each character recognition result obtained by character recognition in a plurality of languages by the multilingual character recognition processing unit. The multilingual document analysis apparatus according to claim 1, wherein the average variance value of the scores calculated in step 1 is used as a criterion for determining the language type of the image.