JP2016122404A

JP2016122404A - Information processing apparatus, information processing method, program, and recording medium

Info

Publication number: JP2016122404A
Application number: JP2014263172A
Authority: JP
Inventors: 暁艶戴; Xiao Yan Dai
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2014-12-25
Filing date: 2014-12-25
Publication date: 2016-07-07
Anticipated expiration: 2034-12-25
Also published as: JP6529254B2

Abstract

PROBLEM TO BE SOLVED: To automatically extract information easily and quickly from a paper document converted into an image.SOLUTION: An information processing apparatus comprises: first extraction means that extracts a plurality of areas from document data converted into an image; second extraction means that extracts, from the plurality of areas, an area including a first character or word; and third extraction means that extracts information different from the first character or word from the area extracted by the second extraction means.SELECTED DRAWING: Figure 5

Description

開示の技術は、情報処理装置、情報処理方法、プログラムおよび記憶媒体に関する。 The disclosed technology relates to an information processing apparatus, an information processing method, a program, and a storage medium.

電子カルテを中心に医療分野におけるＩＴ化が急速に進みつつある。一方、院内には依然として様々な紙媒体の診療情報が存在する。紙媒体の診療情報とは例えば、診療情報提供書、説明同意書類、入退院時の必要書類、診断書類といった診療関連の文書、また、オーダー伝票や、予約票、申込書といった事務手続き用の文書である。 IT is rapidly progressing in the medical field, centering on electronic medical records. On the other hand, medical information on various paper media still exists in the hospital. For example, medical information on paper media includes medical information documents, explanation consent documents, documents required for entrance and exit, medical documents such as diagnostic documents, and documents for office procedures such as order slips, reservation slips, and application forms. is there.

紙媒体の診療情報（紙文書）と電子カルテ等の電子情報とが混在する環境において、電子情報だけではなく紙文書も迅速に検索・活用できることが望ましい。 In an environment where medical information (paper documents) on paper media and electronic information such as electronic medical records are mixed, it is desirable that not only electronic information but also paper documents can be quickly searched and utilized.

そこで、紙文書の見読性を確保するため、紙文書をスキャナで電子化し、そして、紙文書の種類を示す種別情報、診療科情報、患者番号といった基本情報を人手によって登録し、電子システムに紐付けるワークフローが一般的に実施されている。しかし、病院で利用されている紙文書の種類は数千種類以上にも及ぶ場合があり、病院それぞれ独自の書式があるため、紙文書から上述の基本情報を登録する作業には膨大な時間と手間がかかる。 Therefore, in order to ensure readability of paper documents, the paper documents are digitized with a scanner, and basic information such as type information indicating the type of paper document, clinical department information, and patient number is manually registered and stored in the electronic system. A linking workflow is generally implemented. However, there are cases where the number of paper documents used in hospitals can reach thousands, and each hospital has its own format. Therefore, it takes a lot of time to register the above basic information from paper documents. It takes time and effort.

紙文書に含まれる基本情報の登録作業の省力化を図るものとして、特許文献１において、紙文書にバーコードを付加し、バーコードリーダによってバーコードを読み取ることで紙文書に含まれる基本情報を抽出・登録する方法が開示されている。 In order to save labor for registration of basic information included in a paper document, in Patent Document 1, a barcode is added to a paper document, and the barcode is read by a barcode reader so that the basic information included in the paper document is obtained. A method of extracting and registering is disclosed.

また、特許文献２においては、帳票から抽出したい文字列（帳票の発行元の名前）を記憶しておき、この文字列を帳票の認識結果と照合して帳票の認識を行うことが開示されている。 Patent Document 2 discloses that a character string to be extracted from a form (name of the issuer of the form) is stored and the form is recognized by collating the character string with a form recognition result. Yes.

特許第５３５６９０５号Patent No. 5356905 特開２００１−３１２６９４号公報JP 2001-31694 A

しかしながら、特許文献１の方法では、大量の診療記録や問診票の各用紙を電子化するにあたって予めバーコードを紙文書に付与することが必要なため、人手を介する作業が煩雑で負荷が大きい。さらに特許文献２の方法では、抽出したい文字列全体と帳票の認識結果とを照合しているため、照合できなかった場合には所合できなかった文字とは異なる新たな文字列全体と認識結果とを過去と同様に照合する必要があるため帳票の認識に時間を要する。 However, in the method of Patent Document 1, it is necessary to add a barcode to a paper document in advance when digitizing a large amount of medical records and questionnaires, which requires a complicated manual operation and a heavy load. Further, in the method of Patent Document 2, since the entire character string to be extracted and the recognition result of the form are collated, if the collation cannot be performed, a new whole character string and a recognition result different from the characters that could not be matched. It takes time to recognize a form because it is necessary to collate the same as in the past.

開示の技術はこのような状況に鑑みてなされたものであり、紙文書からより簡単且つ迅速に情報を自動抽出することを目的の１つとする。 The disclosed technology has been made in view of such a situation, and an object thereof is to automatically and quickly extract information from a paper document.

なお、前記目的に限らず、後述する発明を実施するための形態に示す各構成により導かれる作用効果であって、従来の技術によっては得られない作用効果を奏することも本件の他の目的の１つとして位置付けることができる。 In addition, the present invention is not limited to the above-described object, and is a function and effect derived from each configuration shown in the embodiment for carrying out the present invention, which is another object of the present invention. It can be positioned as one.

開示の技術に係る情報処理装置は、画像化された文書データから複数の領域を抽出する第１抽出手段と、前記複数の領域から第１の文字または単語を含む領域を抽出する第２抽出手段と、前記第２抽出手段によって抽出された領域から前記第１の文字または単語とは異なる情報を抽出する第３抽出手段と、を備える。 An information processing apparatus according to the disclosed technique includes a first extraction unit that extracts a plurality of regions from imaged document data, and a second extraction unit that extracts a region including a first character or word from the plurality of regions. And third extraction means for extracting information different from the first character or word from the area extracted by the second extraction means.

開示の技術によれば画像化された紙文書から簡単且つ迅速に情報を自動抽出することができる。 According to the disclosed technique, information can be automatically extracted from an imaged paper document easily and quickly.

第１の実施形態に係る情報処理システムの構成の一例を示す図である。It is a figure which shows an example of a structure of the information processing system which concerns on 1st Embodiment. 第１実施形態に係る情報処理装置の機能構成の一例を示すブロック図である。It is a block diagram which shows an example of a function structure of the information processing apparatus which concerns on 1st Embodiment. 第１実施形態に係る情報処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the information processing apparatus which concerns on 1st Embodiment. 第１の実施形態に係る図３のステップＳ１２０における候補領域の設定処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the setting process of the candidate area | region in FIG.3 S120 which concerns on 1st Embodiment. 第１の実施形態に係る図３のステップＳ１４０における抽出対象の同定処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the identification process of the extraction target in FIG.3 S140 which concerns on 1st Embodiment. 第１の実施形態に係る、図４のステップＳ１２０における候補領域の設定処理および図６のステップＳ１４０における抽出対象の同定処理の一例を示す模式図である。It is a schematic diagram which shows an example of the setting process of the candidate area | region in step S120 of FIG. 4, and the identification process of the extraction object in FIG.6 S140 based on 1st Embodiment. 第１の実施形態に係る知識構成の一例を示す模式図である。It is a schematic diagram which shows an example of the knowledge structure which concerns on 1st Embodiment. 第２の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 2nd Embodiment. 第２の実施形態に係る図８のステップＳ’２３０における候補領域の補正処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the correction process of the candidate area | region in step S'230 of FIG. 8 which concerns on 2nd Embodiment. 第２の実施形態に係る図８のステップＳ’２３０における候補領域の補正処理の一例を示す模式図である。It is a schematic diagram which shows an example of the correction process of the candidate area | region in step S'230 of FIG. 8 which concerns on 2nd Embodiment. 第３の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 3rd Embodiment. 第３の実施形態に係る図１１のステップＳ２４０における候補領域の絞込み処理の手順の一例を示すフローチャートである。12 is a flowchart illustrating an example of a procedure of a candidate area narrowing process in step S240 of FIG. 11 according to the third embodiment. 第３の実施形態に係る図１１のステップＳ２４０における候補領域の絞込み処理の一例を示す模式図である。It is a schematic diagram which shows an example of the narrowing-down process of the candidate area | region in FIG.11 S240 which concerns on 3rd Embodiment. 第４の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 4th Embodiment. 第５の実施形態に係る、図３のステップＳ１４０における診療科の抽出処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the extraction process of the medical department in FIG.3 S140 based on 5th Embodiment. 第６の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 6th Embodiment. 第６の実施形態に係る図１６に関わる抽出対象の構造上の特性有無、抽出対象の知識管理の一例を示す模式図である。It is a schematic diagram which shows an example of the structural characteristic presence / absence of the extraction object regarding FIG. 16 which concerns on 6th Embodiment, and knowledge management of an extraction object. 第７の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 7th Embodiment. 第７の実施形態に係る図１７の情報処理の一例を示す模式図である。It is a schematic diagram which shows an example of the information processing of FIG. 17 which concerns on 7th Embodiment. 第８の実施形態に係る図３のステップＳ１４０における知識に基づく抽出対象を同定し、取引規制対象であるかどうかの確認作業支援の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of the confirmation work assistance which identifies the extraction object based on the knowledge in FIG.3 S140 which concerns on 8th Embodiment, and is a transaction regulation object. 第８の実施形態に係る図１９の情報処理の一例を示す模式図である。It is a schematic diagram which shows an example of the information processing of FIG. 19 which concerns on 8th Embodiment.

以下、図面を参照して、本実施形態に係る情報処理装置について詳細に説明する。ただし、この実施の形態に記載されている構成要素はあくまで例示であり、本発明の技術の範囲は、特許請求の範囲によって確定されるものであって、以下の個別の実施形態によって限定されるわけではない。 Hereinafter, the information processing apparatus according to the present embodiment will be described in detail with reference to the drawings. However, the components described in this embodiment are merely examples, and the scope of the technology of the present invention is determined by the scope of the claims, and is limited by the following individual embodiments. Do not mean.

（第１の実施形態）
まず、第１の実施形態について説明する。 (First embodiment)
First, the first embodiment will be described.

図１は、第１の実施形態に係る情報処理システムの構成の一例を示すものである。 FIG. 1 shows an example of the configuration of an information processing system according to the first embodiment.

図１に示すように、情報処理システムは、登録部１（情報処理装置）、格納部２を備える。また、登録部１および格納部２は有線もしくは無線のネットワーク６を介して互いに通信可能に接続されている。また、登録部１および格納部２はネットワーク６を介して病院内の各種システム（電子カルテシステム３、オーダリングシステム４、その他のシステム５）と通信可能に接続されている。なお、登録部１および格納部２は複数台あっても構わない。 As shown in FIG. 1, the information processing system includes a registration unit 1 (information processing device) and a storage unit 2. The registration unit 1 and the storage unit 2 are connected to each other via a wired or wireless network 6 so as to communicate with each other. Further, the registration unit 1 and the storage unit 2 are communicably connected to various systems in the hospital (electronic medical record system 3, ordering system 4, and other systems 5) via a network 6. A plurality of registration units 1 and storage units 2 may be provided.

登録部１について詳細に述べる。登録部１は例えばＰＣ等の情報処理装置である。登録部１はＵＩデバイス１０１、ＣＰＵ１０２、ＲＡＭ１０３、通信ＩＦ１０４、ＵＩ表示部１０５、プログラム用記憶領域１０６およびデータ用記憶領域１０７を備える。 The registration unit 1 will be described in detail. The registration unit 1 is an information processing apparatus such as a PC. The registration unit 1 includes a UI device 101, a CPU 102, a RAM 103, a communication IF 104, a UI display unit 105, a program storage area 106, and a data storage area 107.

ＵＩデバイス１０１はマウス、デジタイザおよびキーボード等の少なくとも１つを含むものであり、ユーザによる登録情報の確認、修正、送信のために用いられる。 The UI device 101 includes at least one of a mouse, a digitizer, a keyboard, and the like, and is used for confirmation, correction, and transmission of registration information by a user.

ＣＰＵ１０２はプログラム用記憶領域１０６からＲＡＭ１０３に読み込んだプログラムを解釈・実行することによって装置内の各種制御や計算、ＵＩの表示が可能である。例えば、ＣＰＵ１０２は、プログラムを実行することで図２に示すように、文書画像解析部１１０、候補領域設定部１２０、候補領域認識部１３０、抽出情報同定部１４０および登録部１５０として機能する。なお、登録部１が備えるＣＰＵ１０２およびＲＡＭ１０３は１つであってもよいし複数であってもよい。すなわち、少なくとも１以上の処理装置（ＣＰＵ）と少なくとも１つの記憶装置（ＲＡＭ）とが接続されており、少なくとも１以上の処理装置が少なくとも１以上の記憶装置に記憶されたプログラムを実行した場合に登録部１は上記の各部として機能する。 The CPU 102 can perform various controls, calculations, and UI display in the apparatus by interpreting and executing the program read from the program storage area 106 into the RAM 103. For example, the CPU 102 functions as a document image analysis unit 110, a candidate region setting unit 120, a candidate region recognition unit 130, an extraction information identification unit 140, and a registration unit 150 as shown in FIG. In addition, the CPU 102 and the RAM 103 included in the registration unit 1 may be one or plural. That is, when at least one processing device (CPU) and at least one storage device (RAM) are connected and at least one processing device executes a program stored in at least one storage device. The registration unit 1 functions as each unit described above.

文書画像解析部１１０は図示しないスキャナにより得られた紙文書が電子化された文書画像を取得し、解析を行う。ここでスキャナによる電子化とは画像化と言い換えることができる。すなわち、文書画像は画像化された文書データの一例に相当する。なお、画像化された医療文書を医療文書データという。文書画像解析部１１０はスキャナにより得られた電子化された文書画像をスキャナから直接取得してもよいし、スキャナにより得られた文書画像が格納部２に保存されている場合には文書画像解析部１１０は格納部２から文書画像を取得することとしてもよい。 The document image analysis unit 110 acquires and analyzes a document image obtained by digitizing a paper document obtained by a scanner (not shown). Here, computerization with a scanner can be referred to as imaging. That is, the document image corresponds to an example of imaged document data. An imaged medical document is referred to as medical document data. The document image analysis unit 110 may directly acquire the digitized document image obtained by the scanner from the scanner, or if the document image obtained by the scanner is stored in the storage unit 2, the document image analysis The unit 110 may acquire a document image from the storage unit 2.

文書画像解析部１１０は、紙文書の電子化された文書画像のレイアウトを解析し、文字領域や写真領域の複数の領域に分割（領域分割）して領域を抽出する。すなわち、文書画像解析部１１０は画像化された文書データから複数の領域を抽出する第１抽出手段の一例に相当する。 The document image analysis unit 110 analyzes a digitized document image layout of a paper document, and divides it into a plurality of character areas and photo areas (area division) to extract areas. In other words, the document image analysis unit 110 corresponds to an example of a first extraction unit that extracts a plurality of regions from imaged document data.

なお、領域分割によって、文書画像解析部１１０は領域分割した各領域の座標および各領域が文字領域か写真領域かを示す属性情報を領域毎に取得する。文字領域か写真領域かを示す属性情報は既知の種々の手法により取得可能である。なお、紙文書を電子化する手段はスキャナに限定されるものではなく他の手段であってもよい。 Note that the document image analysis unit 110 acquires the coordinates of each divided area and attribute information indicating whether each area is a character area or a photographic area for each area. The attribute information indicating the character area or the photographic area can be acquired by various known methods. The means for digitizing the paper document is not limited to the scanner, and may be other means.

候補領域設定部１２０は、文書解析部１１０により分割された領域から情報を抽出する対象となる候補領域を設定する。具体的には、候補領域設定部１２０は文字領域を候補領域として設定する。言い換えれば、候補領域設定部１２０は文書解析部１１０により分割された領域のうち写真領域は候補領域としない。なお、候補領域設定部１２０の処理を省略して、候補領域を設定することなく抽出情報同定部１４０により後述する辞書を用いて文書の種別等を同定することとしてもよい。候補領域設定部１２０の処理により文書の種別等を同定するまでの時間は短縮されるが、候補領域設定部１２０の処理を省略しても上述の効果を奏することが可能である。 The candidate area setting unit 120 sets a candidate area as a target for extracting information from the area divided by the document analysis unit 110. Specifically, the candidate area setting unit 120 sets a character area as a candidate area. In other words, the candidate area setting unit 120 does not set the photo area as a candidate area among the areas divided by the document analysis unit 110. Note that the processing of the candidate area setting unit 120 may be omitted, and the type of the document may be identified by using the dictionary described later by the extraction information identification unit 140 without setting the candidate area. Although the time until the document type is identified by the process of the candidate area setting unit 120 is shortened, the above-described effect can be obtained even if the process of the candidate area setting unit 120 is omitted.

候補領域認識部１３０は、候補領域設定部１２０により設定された候補領域の中身を認識する処理を行うことで文字認識情報を取得する。文字認識情報は候補領域の中身の認識結果である。 The candidate area recognition unit 130 acquires character recognition information by performing processing for recognizing the contents of the candidate area set by the candidate area setting unit 120. The character recognition information is a recognition result of the contents of the candidate area.

抽出情報同定部１４０は、候補領域認識部１３０の認識結果に基づいて候補領域から抽出対象領域を同定し、同定した領域の記載から基本情報を同定する。具体的には、抽出情報同定部１４０は予め作成された辞書等の知識を用いて候補領域から抽出対象領域を同定する。そして、抽出情報同定部１４０は、同定した領域から予め作成された辞書等の知識を用いて例えば文書の種別を同定する。辞書等の知識についての詳細は後述する。なお、辞書等の知識はＲＡＭ１０３に記憶されていてもよいしデータ記憶領域１０７に記憶されていてもよい。また辞書等の知識は登録部１が備える不図示のＲＯＭに記憶されることとしてもよい。 The extraction information identification unit 140 identifies the extraction target region from the candidate region based on the recognition result of the candidate region recognition unit 130, and identifies basic information from the description of the identified region. Specifically, the extraction information identification unit 140 identifies the extraction target region from the candidate region using knowledge such as a dictionary created in advance. Then, the extracted information identification unit 140 identifies, for example, the type of the document using knowledge such as a dictionary created in advance from the identified area. Details of knowledge such as a dictionary will be described later. Note that knowledge such as a dictionary may be stored in the RAM 103 or in the data storage area 107. Knowledge such as a dictionary may be stored in a ROM (not shown) included in the registration unit 1.

登録部１５０は、抽出情報同定部１４０によって同定された情報を用いて文書画像を所定の記憶手段に登録（記録）する。例えば、登録部１５０は抽出情報同定部１４０によって同定された紙文書の種別を文書画像と対応付けて登録情報１０としてデータ記憶領域１０７等に登録する。なお、登録部１５０は登録情報１０を格納部２に記憶することとしてもよい。 The registration unit 150 registers (records) the document image in a predetermined storage unit using the information identified by the extracted information identification unit 140. For example, the registration unit 150 registers the paper document type identified by the extraction information identification unit 140 in the data storage area 107 as the registration information 10 in association with the document image. The registration unit 150 may store the registration information 10 in the storage unit 2.

なお、上記の例ではＣＰＵ１０２が図２に示す各部として機能することとしているが、これに限定されるものではなくＦＰＧＡに上記の機能の少なくとも一部を持たせることとしてもよい。また、複数のＣＰＵに上記の機能を分散させることとしてもよい。さらに、プログラム用記憶領域１０６は登録部１の内部に備えられることとしてもよいし登録部１の外部に備えられることとしてもよい。また、プログラム用記憶領域１０６は１つもメモリ等の記憶装置により構成されていてもよいし、複数の記憶装置により構成されることとしてもよい。 In the above example, the CPU 102 functions as each unit shown in FIG. 2, but the present invention is not limited to this, and the FPGA may have at least a part of the above functions. Further, the above functions may be distributed to a plurality of CPUs. Furthermore, the program storage area 106 may be provided inside the registration unit 1 or may be provided outside the registration unit 1. Further, one program storage area 106 may be constituted by a storage device such as a memory, or may be constituted by a plurality of storage devices.

通信ＩＦ１０４はネットワーク６に繋がっており、登録部１と格納部２および病院内の各種サーバ３〜５との間の通信インタフェースである。 The communication IF 104 is connected to the network 6 and is a communication interface between the registration unit 1, the storage unit 2, and various servers 3 to 5 in the hospital.

ＵＩ表示部１０５は装置の状態や画像情報や登録内容を表示するＬＥＤや液晶パネル等である。 The UI display unit 105 is an LED, a liquid crystal panel, or the like that displays the state of the apparatus, image information, and registered contents.

プログラム用記憶領域１０６およびデータ用記憶領域１０７は具体的にはハードディスクやフラッシュメモリである。但し、特定の記憶媒体に限定されるものではない。登録部１では、データ用記憶領域１０７に登録情報１０が記憶される。なお、登録情報１０は格納部２上に記憶されることとしても構わない。なお、登録部１の登録情報１０を直接病院内のシステム（例えば、電子カルテシステム３）に関連付けて格納してもよい。 Specifically, the program storage area 106 and the data storage area 107 are a hard disk or a flash memory. However, the present invention is not limited to a specific storage medium. In the registration unit 1, registration information 10 is stored in the data storage area 107. The registration information 10 may be stored on the storage unit 2. In addition, you may store the registration information 10 of the registration part 1 directly linked | related with the system (for example, electronic medical record system 3) in a hospital.

登録情報を格納部２に置かれる場合を想定し、格納部２について詳細に述べる。格納部２は少なくとも１以上のＨＤＤまたはＳＳＤ等の記憶媒体であり、格納部２にはバインダプール２０が記憶されている。バインダプール２０にはバインダ２０１、２０２が含まれる。各バインダには医療文書が含まれている。すなわち、格納部２は医用文書をバインダという単位で管理する。なお、バインダプール２０は病院内のシステム（例えば、電子カルテシステム３）に関連付けて記憶しても構わない。バインダプール２０の中には、情報が使用しやすいように所定の規則で登録資料がバインダ毎に格納される。バインダのまとめ方として、例えば、患者毎に各種別の資料を保存してもよいし、種別毎に各資料を保存してもよい。例えば、登録部１５０は抽出情報同定部１４０によって同定された紙文書の種別に基づいて文書画像を含む登録情報を種別毎にバインダに記憶させることが可能である。 Assuming that registration information is placed in the storage unit 2, the storage unit 2 will be described in detail. The storage unit 2 is at least one storage medium such as an HDD or an SSD, and the storage unit 2 stores a binder pool 20. The binder pool 20 includes binders 201 and 202. Each binder contains medical documents. That is, the storage unit 2 manages medical documents in units called binders. The binder pool 20 may be stored in association with a hospital system (for example, the electronic medical record system 3). In the binder pool 20, registered materials are stored for each binder according to a predetermined rule so that information can be easily used. As a method of grouping the binder, for example, various types of materials may be stored for each patient, or each material may be stored for each type. For example, the registration unit 150 can store registration information including document images in the binder for each type based on the type of paper document identified by the extraction information identification unit 140.

上述の構成で、情報処理システム全体で登録情報を参照する事が可能となる。 With the configuration described above, it is possible to refer to registration information throughout the information processing system.

なお、ネットワーク６は、病院あるいは組織内で運用されるイントラネットであってもよいし、インターネットであってもよい。 The network 6 may be an intranet operated in a hospital or organization, or may be the Internet.

なお、電子カルテシステム／オーダーシステムは、広く普及し良く知られている装置なので、ハードウェア構成例や動作フローの説明を省略する。 Since the electronic medical record system / order system is a widely spread and well-known device, description of a hardware configuration example and an operation flow is omitted.

次に、本実施形態に係る情報処理システムによる情報処理方法の処理手順の一例について説明する。 Next, an example of the processing procedure of the information processing method by the information processing system according to the present embodiment will be described.

図３は、第１の実施形態に係る情報処理装置による情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 3 is a flowchart illustrating an example of a processing procedure of an information processing method performed by the information processing apparatus according to the first embodiment.

まず、ステップＳ１１０において、文書画像解析部１１０は、図示しないスキャナにより得られた紙文書が電子化された文書画像を取得する。そして、文書画像解析部１１０は、紙文書の電子化された文書画像のレイアウトを解析し、文字領域や写真領域に分割（領域分割）する。尚、文書画像の領域分割方法として、例えば特開２００２−３１４８０６公報で開示されている公知の方法等を使用することができる。 First, in step S110, the document image analysis unit 110 acquires a document image obtained by digitizing a paper document obtained by a scanner (not shown). Then, the document image analysis unit 110 analyzes the digitized document image layout of the paper document and divides it into character areas and photograph areas (area division). As a document image region dividing method, for example, a known method disclosed in Japanese Patent Laid-Open No. 2002-314806 can be used.

続いて、ステップＳ１２０において、候補領域設定部１２０は、上記文書画像の解析結果から抽出対象の候補となる領域を設定する。この処理の詳細については後述する。 Subsequently, in step S120, the candidate area setting unit 120 sets an area that is a candidate for extraction from the analysis result of the document image. Details of this processing will be described later.

続いて、ステップＳ１３０において、候補領域認識部１３０は、上記候補領域にある文字列を認識し、認識情報を記録する。認識情報として、文字列の認識結果および文字数、また、段落である場合の行数などが挙げられる。尚、認識処理は、公知の文字認識技術を用いることができる。 Subsequently, in step S130, the candidate area recognition unit 130 recognizes the character string in the candidate area and records the recognition information. The recognition information includes the recognition result of the character string and the number of characters, and the number of lines in the case of a paragraph. The recognition process can use a known character recognition technique.

続いて、ステップＳ１４０において、情報処理装置の抽出情報同定部１４０は、上記候補領域の認識結果及び、知識情報に基づき抽出対象領域を同定し、抽出対象領域から基本情報を同定する。そして、情報処理装置の登録部１５０は、同定情報により文書画像を登録する。この処理の詳細については後述する。 Subsequently, in step S140, the extraction information identification unit 140 of the information processing apparatus identifies the extraction target region based on the recognition result of the candidate region and the knowledge information, and identifies basic information from the extraction target region. Then, the registration unit 150 of the information processing apparatus registers the document image based on the identification information. Details of this processing will be described later.

次に、ステップＳ１２０における候補領域の設定処理について説明する。 Next, the candidate area setting process in step S120 will be described.

図４は、第１の実施形態に係る図３のステップＳ１２０における候補領域の設定処理の手順の一例を示すフローチャートである。 FIG. 4 is a flowchart illustrating an example of a procedure of candidate area setting processing in step S120 of FIG. 3 according to the first embodiment.

先ず、ステップＳ１２０１において、文書画像解析部１１０による文書画像解析により取得される領域情報、即ち、各領域の位置を示す座標情報と、各領域が文字領域か写真領域を示す属性情報とを文書画像解析部１１０は候補領域設定部１２０に入力する。 First, in step S1201, area information acquired by document image analysis by the document image analysis unit 110, that is, coordinate information indicating the position of each area, and attribute information indicating whether each area is a character area or a photographic area, are stored in the document image. The analysis unit 110 inputs the candidate area setting unit 120.

続いて、ステップＳ１２０２では、候補領域設定部１２０は、属性情報に基づいて文書画像解析部１１０よって取得された領域が文字領域であるかどうかを判断する。文字領域であれば、ステップＳ１２０３で、候補領域設定部１２０は当該文字領域を候補領域として設定する。 In step S1202, the candidate area setting unit 120 determines whether the area acquired by the document image analysis unit 110 is a character area based on the attribute information. If it is a character region, the candidate region setting unit 120 sets the character region as a candidate region in step S1203.

続いて、ステップＳ１２０４では、候補領域設定部１２０は未処理の領域があるかどうかを判断します。まだ未処理の領域があれば、ステップＳ１２０２に入り、ステップＳ１２０２からステップＳ１２０４までの処理を繰り返して実行するが、未処理の領域がなければ、候補領域設定処理を終了する。 Subsequently, in step S1204, the candidate area setting unit 120 determines whether there is an unprocessed area. If there is still an unprocessed area, the process enters step S1202, and the processes from step S1202 to step S1204 are repeatedly executed. If there is no unprocessed area, the candidate area setting process is terminated.

次に、ステップＳ１４０における抽出対象の同定処理について説明する。 Next, the extraction target identification process in step S140 will be described.

図５は、第１の実施形態に係る図３のステップＳ１４０における抽出対象の同定処理の手順の一例を示すフローチャートである。 FIG. 5 is a flowchart illustrating an example of the procedure of the extraction target identification process in step S140 of FIG. 3 according to the first embodiment.

先ず、ステップＳ１４０１において、候補領域情報を候補領域設定部１２０および候補領域認識部１３０は抽出情報同定部１４０に入力する。候補領域情報には、候補領域設定部１２０により得られた候補領域の座標情報及び候補領域認識部１３０により得られた文字認識情報が含まれる。 First, in step S1401, the candidate area setting unit 120 and the candidate area recognizing unit 130 input candidate area information to the extraction information identifying unit 140. The candidate area information includes the coordinate information of the candidate area obtained by the candidate area setting unit 120 and the character recognition information obtained by the candidate area recognition unit 130.

続いて、ステップＳ１４０２からステップＳ１４０７において、抽出情報同定部１４０は候補領域の文字認識情報及び知識情報に基づいて抽出対象領域を同定し、抽出対象領域の中身を同定する。この部分について詳細に説明する。 Subsequently, in steps S1402 to S1407, the extraction information identification unit 140 identifies the extraction target region based on the character recognition information and knowledge information of the candidate region, and identifies the contents of the extraction target region. This part will be described in detail.

先ず、ステップＳ１４０２では、抽出情報同定部１４０は処理対象となる候補領域に語尾辞書（図６における符号６０４参照）にある語尾があるかどうかを判断する。 First, in step S1402, the extraction information identification unit 140 determines whether there is a ending in the ending dictionary (see reference numeral 604 in FIG. 6) in the candidate area to be processed.

語尾が候補領域にある場合、ステップＳ１４０３では、抽出情報同定部１４０は当該候補領域を抽出領域として同定する。すなわち、抽出情報同定部１４０は、複数の領域から第１の文字を含む領域を抽出する第２抽出手段の一例に相当する。また、語尾辞書に含まれる語尾は第１の文字または単語の一例に相当する。より具体的には第１の文字は複数の文字からなる単語の語尾である。また、語尾辞書に含まれる語尾は１文字としているがこれに限定されるものではなく複数の文字であってもよい。 If the ending is in the candidate area, in step S1403, the extraction information identification unit 140 identifies the candidate area as the extraction area. That is, the extraction information identification unit 140 corresponds to an example of a second extraction unit that extracts a region including the first character from a plurality of regions. The endings included in the ending dictionary correspond to an example of the first character or word. More specifically, the first character is the ending of a word composed of a plurality of characters. Moreover, although the ending part contained in the ending part dictionary is made into 1 character, it is not limited to this, A several character may be sufficient.

そして、ステップＳ１４０４では、抽出情報同定部１４０は当該抽出領域から用語辞書（図６における符号６０５参照）にある用語を抽出する。ここで、用語辞書に含まれる用語は第１の文字とは異なる情報の一例に相当する。すなわち、抽出情報同定部１４０は、第２抽出手段によって抽出された領域か第１の文字とは異なる情報を抽出する第３抽出手段の一例に相当する。 In step S1404, the extraction information identification unit 140 extracts terms in the term dictionary (see reference numeral 605 in FIG. 6) from the extraction region. Here, the term included in the term dictionary corresponds to an example of information different from the first character. That is, the extracted information identification unit 140 corresponds to an example of a third extracting unit that extracts information that is different from the region extracted by the second extracting unit or the first character.

そして、ステップＳ１４０５では、用語辞書と分類辞書（図６における符号６０６参照）の関係に基づき、抽出された用語により文書の種別を同定し、抽出対象の同定処理を終了させる。すなわち、抽出情報同定部１４０は第３抽出手段により抽出された情報を用いて文書データを分類する分類手段の一例に相当する。 In step S1405, based on the relationship between the term dictionary and the classification dictionary (see reference numeral 606 in FIG. 6), the type of the document is identified by the extracted term, and the extraction target identification process is terminated. In other words, the extracted information identification unit 140 corresponds to an example of a classifying unit that classifies document data using information extracted by the third extracting unit.

なお、語尾辞書に含まれる語尾が候補領域にない場合、ステップＳ１４０６では、抽出情報同定部１４０は未処理の候補領域があるかどうかを判断する。未処理の候補領域があれば、上記ステップＳ１４０２からステップＳ１４０５までの処理を繰り返して実行する。未処理の候補領域がなければ、抽出情報同定部１４０は候補領域の中に種別に該当する領域がないとし、種別なしと判断する。 If the ending part included in the ending dictionary is not in the candidate area, in step S1406, the extraction information identifying unit 140 determines whether there is an unprocessed candidate area. If there is an unprocessed candidate area, the processes from step S1402 to step S1405 are repeated. If there is no unprocessed candidate area, the extraction information identification unit 140 determines that there is no area corresponding to the type in the candidate area and determines that there is no type.

次に、本実施形態における抽出対象の同定処理の一例について辞書の内容を示しながらより詳細に説明する。 Next, an example of extraction target identification processing in the present embodiment will be described in more detail while showing the contents of the dictionary.

図６は、第１の実施形態に係るステップＳ１２０における候補領域の設定処理と、図５のステップＳ１４０における抽出対象の同定処理の一例を示す模式図である。 FIG. 6 is a schematic diagram illustrating an example of candidate area setting processing in step S120 and extraction target identification processing in step S140 of FIG. 5 according to the first embodiment.

６０１は、ある文書画像に対する文書画像解析部１１０による解析の結果例である。文書画像は、枠に囲まれる領域毎に分割され、また、領域毎に文字領域か写真領域、或いは、その他の属性が付与される。 Reference numeral 601 denotes an example of a result of analysis by the document image analysis unit 110 for a certain document image. The document image is divided into regions surrounded by a frame, and a character region, a photographic region, or other attributes are given to each region.

６０２は、文書画像の解析結果から候補領域設定部１２０によって得られた候補領域の設定結果例である。各候補領域は順番に領域番号、そして、座標情報が記録される。 Reference numeral 602 denotes an example of a candidate area setting result obtained by the candidate area setting unit 120 from the analysis result of the document image. Each candidate area is sequentially recorded with an area number and coordinate information.

６０３は、候補領域から抽出対象の同定処理の結果である。 Reference numeral 603 denotes a result of identification processing of an extraction target from a candidate area.

本実施形態においては抽出対象の同定処理に用いる語尾辞書６０４、用語辞書６０５および分類辞書６０６が不図示のＲＯＭに記憶されている。語尾辞書６０４は、種別に含まれる共通の語尾を記録する。用語辞書６０５は種別に含まれる用語を記録する。例えば、用語辞書６０５は「問診」および「質問」という用語を含む。すなわち、用語辞書６０５は互いに異なる第１の参照用の文字と第２の参照用の文字とを含んでおり、用語辞書６０５を保持する不図示のＲＯＭは保持手段の一例に相当する。分類辞書６０６は種別に関わる分類を記録する。なお、上記の辞書はＲＯＭ以外の記憶手段（プログラム記憶領域１０６、データ記憶領域１０７、格納部２など）に記憶されることとしてもよい。この場合、記憶手段が保持手段の一例に相当する。 In this embodiment, the ending dictionary 604, the term dictionary 605, and the classification dictionary 606 used for the extraction target identification process are stored in a ROM (not shown). The ending dictionary 604 records a common ending included in the type. The term dictionary 605 records terms included in the type. For example, the term dictionary 605 includes the terms “question” and “question”. That is, the term dictionary 605 includes a first reference character and a second reference character that are different from each other, and a ROM (not shown) that holds the term dictionary 605 corresponds to an example of a holding unit. The classification dictionary 606 records the classification related to the type. The dictionary may be stored in storage means other than the ROM (program storage area 106, data storage area 107, storage unit 2, etc.). In this case, the storage unit corresponds to an example of a holding unit.

候補領域の順番で処理する。候補領域認識部１３０により得られた候補領域０１の文字認識情報には６語尾辞書０４にある「書」という語尾が含まれるため、抽出情報同定部１４０は当該候補領域を抽出対象領域として同定する。 Process in the order of candidate areas. Since the character recognition information of the candidate region 01 obtained by the candidate region recognition unit 130 includes the ending “book” in the six-ending dictionary 04, the extraction information identification unit 140 identifies the candidate region as an extraction target region. .

また、抽出情報同定部１４０は当該抽出対象領域には用語辞書６０５にある「説明」という用語が含まれると判断する。具体的には、抽出情報同定部１４０は用語辞書６０５に含まれる用語と抽出対象領域に含まれる文字とを比較し、比較結果が一致する場合には用語辞書６０５に含まれる用語が抽出対象領域から抽出されたと判断する。本実施例では抽出情報同定部１４０は「問診」という用語を抽出対象領域に含まれる文字と比較し、一致しない場合には用語辞書６０５の次の用語と抽出対象領域に含まれる文字との比較を行う。すなわち、第３抽出手段の一例である抽出情報同定部１４０は、第２抽出手段によって抽出された領域に含まれる文字と第１の参照用の文字とを比較し、比較結果が一致する場合には第１の参照用の文字に一致する文字を情報として抽出し、比較結果が一致しない場合には第２抽出手段によって抽出された領域に含まれる文字と第２の参照用の文字とを比較する。 Further, the extraction information identification unit 140 determines that the term “explanation” in the term dictionary 605 is included in the extraction target area. Specifically, the extraction information identification unit 140 compares the terms included in the term dictionary 605 with the characters included in the extraction target region, and if the comparison results match, the terms included in the term dictionary 605 are extracted. It is judged that it was extracted from. In the present embodiment, the extraction information identification unit 140 compares the term “question” with the characters included in the extraction target area, and if they do not match, compares the next term in the term dictionary 605 with the characters included in the extraction target area. I do. That is, the extraction information identification unit 140, which is an example of the third extraction unit, compares the character included in the region extracted by the second extraction unit with the first reference character, and the comparison result matches. Extracts the character that matches the first reference character as information, and if the comparison result does not match, compares the character included in the region extracted by the second extracting means with the second reference character To do.

抽出情報同定部１４０は用語辞書６０５から、「説明」という用語は「０２」という「種別番号」と対応付けられると判断する。したがって、抽出情報同定部１４０は、分類辞書６０６に「０２」と対応する「説明・同意書」という種別が抽出対象（紙文書）の文書種別であると決定する。そして、登録部１５０は「説明・同意書」という種別を文書画像と対応付けてデータ記憶領域１０７または格納部２に記録する。 The extracted information identification unit 140 determines from the term dictionary 605 that the term “explanation” is associated with the “type number” “02”. Therefore, the extraction information identification unit 140 determines that the type of “explanation / consent form” corresponding to “02” in the classification dictionary 606 is the document type of the extraction target (paper document). Then, the registration unit 150 records the type of “explanation / agreement” in the data storage area 107 or the storage unit 2 in association with the document image.

上述の如く本実施形態は、文書画像における各領域の属性情報に基づき抽出対象の候補領域を設定し、候補領域の文字認識情報及び知識情報に基づき候補領域から抽出対象領域を同定し、紙文書の種別を取得するものである。しかしながら、本発明は上記の実施形態に限定されるものではなく、例えば医用文書（紙文書）から診療科情報や、患者情報（患者ＩＤ等の患者識別情報）などを抽出する場合は、抽出対象に応じて知識情報を置き換えればよい。患者ＩＤは例えば数字である。 As described above, the present embodiment sets a candidate area to be extracted based on the attribute information of each area in the document image, identifies the extraction target area from the candidate area based on the character recognition information and knowledge information of the candidate area, Type is acquired. However, the present invention is not limited to the above-described embodiment. For example, when extracting clinical department information, patient information (patient identification information such as patient ID) from a medical document (paper document), an extraction target The knowledge information may be replaced depending on the situation. The patient ID is, for example, a number.

例えば、診療科情報抽出の場合、種別抽出用の語尾辞書を「科」などを含む診療科辞書にすればよい。さらに、用語辞書は「小児」、「皮膚」などの文言を含む辞書に変更すればよい。分類辞書は必須の構成ではないが、使用する場合には分類辞書についても同様に診療科で分類を行うよう種別を「小児科」、「皮膚科」などに変更すればよい。また、本実施形態では、知識を辞書という言葉で記述したが、辞書以外の呼び方をされるものであってもよい。なお、患者情報（患者ＩＤ等）などを抽出する場合には、種別抽出用の語尾辞書を「ＩＤ」、「番号」などを含む辞書にすればよい。この場合、「ＩＤ」等の文字は領域内の末尾ではなく先頭に存在する場合が多いが、本実施形態においては説明を簡単にするために語尾辞書という文言を用いている。なお、患者情報（患者ＩＤ等）などを抽出する場合には分類を行う必要がないため用語辞書等は不要である。なお診療科情報および患者情報（患者ＩＤ等）の抽出方法の詳細については後述の第５の実施形態で述べる。 For example, in the case of medical department information extraction, the ending dictionary for type extraction may be a medical department dictionary including “family”. Furthermore, the term dictionary may be changed to a dictionary including words such as “children” and “skin”. The classification dictionary is not essential, but when used, the classification may be changed to “pediatric”, “dermatology”, etc. so that the classification dictionary is also classified in the medical department. In the present embodiment, knowledge is described in terms of a dictionary, but it may be called something other than a dictionary. When extracting patient information (patient ID, etc.), the ending dictionary for type extraction may be a dictionary including “ID”, “number”, and the like. In this case, the characters such as “ID” are often present at the beginning rather than at the end in the area, but in the present embodiment, the wording of the ending dictionary is used for the sake of simplicity. When extracting patient information (patient ID or the like) or the like, a term dictionary or the like is unnecessary because classification is not necessary. Details of a method for extracting clinical department information and patient information (patient ID and the like) will be described in a fifth embodiment to be described later.

また、本実施形態では、医用文書の種別抽出に、文書画像を管理しやすいために種別を図６に示す分類に分けたが、これに限定されるものではなくより細かく分類することとしてもよいし、より粗く分類することとしてもよい。なお各辞書に含まれる言葉や言葉の数も図６記載の内容に限定されるものではなく任意に変更可能である。 In the present embodiment, the classification of medical documents is classified into the classifications shown in FIG. 6 for easy management of document images. However, the classification is not limited to this, and the classification may be performed more finely. However, it may be classified more roughly. The words and the number of words included in each dictionary are not limited to the contents shown in FIG. 6 and can be arbitrarily changed.

また、本実施形態では、種別抽出用の語尾辞書、用語辞書、分類辞書を例にしたが、辞書の名称は図６記載の名称以外であってもよいし、辞書の構成を図６とは異なる構成にしてもよい。例えば、図７に示すように、用語辞書に用語及び用語と種別の関連付けのみならず、語尾との関連付けも持つようにしてもよい。この場合、語尾が見つかれば、それと組み合わせ可能な用語が含まれるかどうかのみをチェックし用語を抽出すればよい。例えば、ステップＳ１４０２では、「書」という「１０１」番号の語尾が見つかった場合、ステップＳ１４０４では、当該領域から用語辞書に含まれる用語すべてを抽出する代わりに、「１０１」番号の語尾「書」と組み合わせることが可能な用語のみを抽出する。即ち、「問診」、「説明」等だけを抽出すれば良く（「質問」を抽出しようとする必要はない）、処理の高速化を図ることが可能となる。また、図６の例に示す６０１、６０２、６０３をまとめて辞書として持っていてもよい。すなわち、辞書の形態は上記の例に限定されるものではなく他の形態とすることとしてもよい。 Further, in the present embodiment, the ending dictionary, the term dictionary, and the classification dictionary for type extraction are taken as an example. However, the names of the dictionaries may be other than the names shown in FIG. Different configurations may be used. For example, as shown in FIG. 7, the term dictionary may have not only associations between terms and terms and types but also associations with endings. In this case, if a ending is found, it is only necessary to check whether a term that can be combined with it is included and extract the term. For example, in step S1402, if the ending of “101” number “book” is found, in step S1404, instead of extracting all the terms contained in the term dictionary from the area, the ending “book” of “101” number is extracted. Only terms that can be combined with are extracted. That is, it is only necessary to extract “question”, “explanation”, and the like (it is not necessary to extract “question”), and the processing speed can be increased. Moreover, you may have collectively 601, 602, and 603 shown in the example of FIG. 6 as a dictionary. That is, the form of the dictionary is not limited to the above example, but may be other forms.

また、本実施形態では、辞書を登録部１の内部に持たせることを例にしたが、登録部１の外部に辞書を持たせることとしてもよい。外部で定義して参照するようにしてもよい。また、本実施形態では、種別に該当する情報を見つからない文書画像において種別なしと出力するが、それ以外の出力、例えば、種別不明としてもよい。 In the present embodiment, the dictionary is provided inside the registration unit 1 as an example. However, the dictionary may be provided outside the registration unit 1. It may be defined and referred to externally. In the present embodiment, information corresponding to the type is output as “no type” in the document image that cannot be found, but other types of output, for example, the type may be unknown.

以上、述べたように第１の実施形態によれば、紙文書から簡単に情報を自動抽出することができる。上記実施形態においてはバーコード等追加の情報を紙文書に付加する必要がないため、従来に比べて手間をかけずに文書種別等の情報を抽出することが可能となる。また、バーコード等の追加の情報を紙文書に付加する必要がないため未知のフォーマットの文書からも簡単に文書種別等の情報を抽出することが可能となる。すなわち、医用文書に人手を介する情報の付与作業が行われなくても、また、医用文書のフォーマットが予め分からなくても、文書種別等の情報を自動的に抽出できる。 As described above, according to the first embodiment, information can be easily automatically extracted from a paper document. In the above embodiment, since it is not necessary to add additional information such as a barcode to a paper document, it is possible to extract information such as the document type without taking time and effort as compared with the conventional case. Further, since it is not necessary to add additional information such as a barcode to a paper document, it is possible to easily extract information such as a document type from a document in an unknown format. That is, information such as the document type can be automatically extracted even if information is not manually assigned to the medical document and the format of the medical document is not known in advance.

また、上記実施形態においては語尾辞書を用いて抽出領域を同定しているため、全ての領域に対して用語辞書と照らし合わせる必要がなく文書種別等の情報を高速で抽出することが可能となる。また、「問診票」など種別そのものを示す言葉を文書画像から抽出する場合には、種別を示す言葉の多さから抽出に多くの時間がかかる虞がある。しかし、本実施形態によれば語尾と用語との組み合わせを用いているため「問診票」などの種別を示す用語を抽出する時間を短縮することが可能である。ここで、医療分野においては診療科および文書の種別は病院毎に様々な呼び名があるため、本実施形態を医療分野に用いることで顕著な効果を得ることができる。 In the above embodiment, since the extraction area is identified using the ending dictionary, it is not necessary to collate with the term dictionary for all areas, and information such as the document type can be extracted at high speed. . In addition, when a word indicating the type itself, such as an “inquiry sheet”, is extracted from the document image, it may take a long time to extract due to the large number of words indicating the type. However, according to the present embodiment, since the combination of the ending and the term is used, it is possible to shorten the time for extracting the term indicating the type such as “questionnaire”. Here, in the medical field, there are various names of departments and documents for each hospital, so that the present embodiment can be used in the medical field to obtain a remarkable effect.

なお、上記の例ではステップＳ１４０５において文書画像の種別を同定しているが、このステップは必須のものではなく、ステップＳ１４０４で処理を終了することとしてもよい。この場合、ステップＳ１４０４で抽出された用語を操作者が参照して分類を行うことができる。 In the above example, the document image type is identified in step S1405. However, this step is not essential, and the process may be terminated in step S1404. In this case, the operator can perform classification by referring to the terms extracted in step S1404.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described.

上述した第１の実施形態では、文書画像の解析結果から文字領域を抽出対象の候補領域として設定した。第２の実施形態では、文書画像の解析処理によって正しい塊の領域抽出ができていない場合に領域に併合するものである。 In the first embodiment described above, the character area is set as the extraction candidate area from the analysis result of the document image. In the second embodiment, when a correct chunk area cannot be extracted by the analysis processing of the document image, it is merged with the area.

ここで、第２の実施形態に係る情報処理システムのハードウェア構成および情報処理装置の機能構成は、図１、２と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system and the functional configuration of the information processing apparatus according to the second embodiment are the same as those in FIGS.

次に、本実施形態に係る情報処理方法の処理手順の一例について説明する。 Next, an example of a processing procedure of the information processing method according to the present embodiment will be described.

図８は、第２の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 8 is a flowchart illustrating an example of a processing procedure of an information processing method by the information processing system according to the second embodiment.

まず、ステップＳ’２１０において、文書画像解析部１１０は、図示しないスキャナにより得られた紙文書が電子化された文書画像を取得する。そして、文書画像解析部１１０は、紙文書の電子化された文書画像を解析し、文字領域や写真領域に分割する。本ステップはステップＳ１１０と同様である。 First, in step S′210, the document image analysis unit 110 acquires a document image obtained by digitizing a paper document obtained by a scanner (not shown). Then, the document image analysis unit 110 analyzes the digitized document image of the paper document and divides it into a character area and a photograph area. This step is the same as step S110.

続いて、ステップＳ’２２０において、候補領域設定部１２０は、上記文書画像の解析結果から文字領域を抽出対象の候補となる領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S′220, the candidate area setting unit 120 sets an area that is a candidate for extraction of a character area from the analysis result of the document image. Specific processing is the same as that in step S120.

続いて、ステップＳ’２３０において、候補領域設定部１２０は、上記候補領域を補正する。この処理についての詳細は後述する。 Subsequently, in step S′230, the candidate area setting unit 120 corrects the candidate area. Details of this processing will be described later.

続いて、ステップＳ’２４０において、候補領域認識部１３０上記補正後の候補領域にある文字列を認識し、認識情報を記録する。本ステップはステップＳ１３０と同様である。 Subsequently, in step S′240, the candidate area recognition unit 130 recognizes the character string in the corrected candidate area and records the recognition information. This step is the same as step S130.

続いて、ステップＳ’２５０において、抽出情報同定部１４０は上記補正後の候補領域の認識結果及び、知識情報に基づき抽出対象領域を同定し、抽出対象中身を同定する。本ステップはステップＳ１４０と同様である。 Subsequently, in step S′250, the extraction information identification unit 140 identifies the extraction target region based on the recognition result of the corrected candidate region and the knowledge information, and identifies the extraction target contents. This step is the same as step S140.

次に、ステップＳ’２３０における候補領域の補正処理について説明する。 Next, the candidate area correction processing in step S′230 will be described.

図９は第２の実施形態に係る図８のステップＳ’２３０における候補領域の補正処理の手順の一例を示すフローチャートである。 FIG. 9 is a flowchart showing an example of the procedure of the candidate area correction process in step S′230 of FIG. 8 according to the second embodiment.

先ず、ステップＳ’２３０１において、ステップＳ’２２０で設定された候補領域を入力する。 First, in step S′2301, the candidate area set in step S′220 is input.

続いて、ステップＳ’２３０２からステップＳ’２３０６では、上記候補領域から併合すべき領域を選択し、併合する。 Subsequently, in steps S ′ 2302 to S ′ 2306, an area to be merged is selected from the candidate areas and merged.

ステップＳ’２３０２では、候補領域設定部１２０が処理対象となる二つの候補領域間の間隔は所定の閾値Ｔ１以下であるかどうかを判断する。すなわち、候補領域設定部１２０は隣り合う二つの候補領域間の間隔を閾値Ｔ１と比較する。ここで、閾値Ｔ１は第１の閾値の一例に相当する。 In step S′2302, the candidate area setting unit 120 determines whether the interval between the two candidate areas to be processed is equal to or less than a predetermined threshold T1. That is, the candidate area setting unit 120 compares the interval between two adjacent candidate areas with the threshold T1. Here, the threshold value T1 corresponds to an example of a first threshold value.

候補領域間の間隔は所定の閾値Ｔ１以下あれば、ステップＳ’２３０３では、候補領域設定部１２０が更に処理対象となる二つの候補領域にある文字サイズの差は所定の閾値Ｔ２以下であるかどうかを判断する。ここで、閾値Ｔ２は第２の閾値の一例に相当する。 If the interval between the candidate areas is equal to or smaller than the predetermined threshold T1, in step S′2303, whether the difference between the character sizes in the two candidate areas that are further processed by the candidate area setting unit 120 is equal to or smaller than the predetermined threshold T2. Judge whether. Here, the threshold value T2 corresponds to an example of a second threshold value.

候補領域にある文字サイズの差は所定の閾値Ｔ２以下であれば、ステップＳ’２３０４へ進む。ステップＳ’２３０４では、候補領域設定部１２０が更に処理対象となる一の候補領域に複数の文字が含まれている場合には、それらの文字間隔の差が所定の閾値Ｔ３以下であるかどうかを判断する。すなわち、一の候補領域に複数の文字が含まれていない場合にはステップＳ’２３０４は実行されないこととしてもよい。ここで、閾値Ｔ３は第３の閾値の一例に相当する。 If the difference between the character sizes in the candidate area is equal to or smaller than the predetermined threshold T2, the process proceeds to step S'2304. In step S′2304, if the candidate area setting unit 120 further includes a plurality of characters in one candidate area to be processed, whether or not the difference between the character intervals is equal to or less than a predetermined threshold T3. Judging. That is, when a plurality of characters are not included in one candidate area, step S ′ 2304 may not be executed. Here, the threshold value T3 corresponds to an example of a third threshold value.

候補領域にある文字の間隔の差は所定の閾値Ｔ３以下であれば、ステップＳ’２３０５では、当該二つの候補領域は併合すべき領域と判断し、ステップＳ’２３０６では、候補領域設定部１２０が当該二つの候補領域同士を併合し、候補領域の情報を更新する。すなわち、候補領域設定部１２０は、第１抽出手段により抽出された領域に関する情報に基づいて第１抽出手段により抽出された領域を併合する領域併合手段の一例に相当する。また、本実施例では第２抽出手段の一例に相当する抽出情報同定部１４０は、併合された領域から第１の文字または単語を含む領域を抽出することとなる。 If the difference between the characters in the candidate area is equal to or smaller than a predetermined threshold T3, in step S′2305, the two candidate areas are determined to be merged areas, and in step S′2306, the candidate area setting unit 120 is determined. Merges the two candidate areas and updates the information of the candidate areas. That is, the candidate area setting unit 120 corresponds to an example of an area merging unit that merges the areas extracted by the first extracting unit based on the information about the area extracted by the first extracting unit. Further, in this embodiment, the extraction information identification unit 140 corresponding to an example of the second extraction unit extracts a region including the first character or word from the merged region.

続いて、ステップＳ’２３０７では、候補領域設定部１２０が未比較の領域があるかどうかを判断します。まだ未比較の領域があれば、ステップＳ’２３０２に入り、ステップＳ’２３０２からステップＳ’２３０６までの処理を繰り返して実行するが、未比較の領域がなければ、候補領域の補正処理を終了する。 In step S′2307, the candidate area setting unit 120 determines whether there is an uncompared area. If there is still an uncompared area, the process enters step S ′ 2302 and repeats the processes from step S ′ 2302 to step S ′ 2306, but if there is no uncompared area, the candidate area correction process is terminated. To do.

次に、ステップＳ’２３０における候補領域の補正処理の一例について説明する。 Next, an example of a candidate area correction process in step S′230 will be described.

図１０は、本発明の第２の実施形態を示し、図８のステップＳ’２３０における候補領域の補正処理の一例を示す模式図である。 FIG. 10 is a schematic diagram showing an example of candidate area correction processing in step S′230 in FIG. 8 according to the second embodiment of this invention.

１０００１は、ある文書画像における候補領域の設定結果例である。「同」「意」「書」は離れているため、それぞれ独立な領域として抽出されている。 Reference numeral 10001 denotes an example of a candidate area setting result in a certain document image. Since “same”, “will”, and “book” are separated, they are extracted as independent regions.

１０００２は、種別抽出の場合、１０００１から候補領域の補正処理の結果例である。１０００１の候補領域から領域の間隔が一定範囲Ｔ１以内、しかも、其々の領域にある文字サイズの差が一定範囲Ｔ２以内、其々の領域に複数の文字がある場合の文字列の間隔の差が一定範囲Ｔ３以内の候補領域を分断された領域として併合する。 10002 is an example of the result of the candidate area correction processing from 10001 in the case of type extraction. The difference in character string spacing when the distance between the areas from the candidate area 10001 is within a certain range T1, and the difference in character size in each area is within a certain range T2, and there are a plurality of characters in each area. Merge candidate regions within a certain range T3 as divided regions.

本実施形態では、抽出対象の特性に基づき候補領域を補正し、意味のある領域にするものである。本実施形態では、候補領域の併合条件として候補領域間の間隔、候補領域にある文字サイズの差、候補領域にある文字列の間隔の差を用いたが、それ以外の条件を設定してもよい。また、候補領域が過統合場合の分割処理を例にしてもよい。なお、上記の実施例では候補領域の併合条件として候補領域間の間隔（すなわち候補領域の位置）、候補領域にある文字サイズの差、候補領域にある文字列の間隔の差の全てを用いたが、少なくとも１つを用いることとしてもよい。すなわち、領域を併合するために用いられる領域に関する情報は、第１抽出手段により抽出された領域の位置、第１抽出手段により抽出された領域に含まれる文字の少なくとも１つを示す情報である。 In this embodiment, the candidate area is corrected based on the characteristics of the extraction target to make a meaningful area. In this embodiment, the candidate region merging conditions are the interval between candidate regions, the difference in character size in the candidate region, and the difference in interval between character strings in the candidate region. However, other conditions may be set. Good. In addition, division processing when candidate areas are overintegrated may be taken as an example. In the above embodiment, the candidate region merging conditions are all the intervals between the candidate regions (that is, the positions of the candidate regions), the character size differences in the candidate regions, and the character string interval differences in the candidate regions. However, at least one may be used. That is, the information regarding the region used for merging the regions is information indicating the position of the region extracted by the first extraction unit and at least one of the characters included in the region extracted by the first extraction unit.

第２の実施形態によれば、意味のある領域の抽出ができ、情報抽出処理の精度を向上することが可能になる。 According to the second embodiment, a meaningful area can be extracted, and the accuracy of information extraction processing can be improved.

（第３の実施形態）
次に、本発明の第３の実施形態について説明する。 (Third embodiment)
Next, a third embodiment of the present invention will be described.

上述した第２の実施形態では、文書画像の解析結果により意味のある領域に補正する領域にするものであった。第３の実施形態では、抽出対象の特性に基づき、候補領域を絞るものである。 In the second embodiment described above, the area is corrected to a meaningful area based on the analysis result of the document image. In the third embodiment, the candidate area is narrowed down based on the characteristics of the extraction target.

ここで、第３の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第３の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the third embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the third embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

次に、本実施形態に係る情報処理方法の処理手順について説明する。 Next, a processing procedure of the information processing method according to the present embodiment will be described.

図１１は、本発明の第３の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 11 is a flowchart showing an example of a processing procedure of an information processing method by the information processing system according to the third embodiment of the present invention.

まず、ステップＳ２１０において、文書画像解析部１１０は紙文書の電子化された文書画像を解析し、文字領域や写真領域に分割する。具体的な処理はステップＳ１１０と同様である。 First, in step S210, the document image analysis unit 110 analyzes a digitized document image of a paper document and divides it into a character area and a photograph area. Specific processing is the same as that in step S110.

続いて、ステップＳ２２０において、候補領域設定部１２０は上記文書画像の解析結果から文字領域を抽出対象の候補となる領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S220, the candidate area setting unit 120 sets an area that is a candidate for extraction of a character area from the analysis result of the document image. Specific processing is the same as that in step S120.

続いて、ステップＳ２３０において、候補領域認識部１３０は上記候補領域にある文字列を認識し、認識情報を記録する。具体的な処理はステップＳ１３０と同様である。 Subsequently, in step S230, the candidate area recognition unit 130 recognizes the character string in the candidate area and records the recognition information. The specific process is the same as that in step S130.

続いて、ステップＳ２４０において、抽出情報同定部１４０は抽出対象の特性に基づき、上記候補領域を絞る。この処理の詳細については後述する。 Subsequently, in step S240, the extraction information identification unit 140 narrows down the candidate areas based on the characteristics of the extraction target. Details of this processing will be described later.

続いて、ステップＳ２５０において、抽出情報同定部１４０は上記候補領域の認識結果及び、知識情報に基づき抽出対象領域を同定し、抽出対象中身を同定する。具体的な処理はステップＳ１４０と同様である。 Subsequently, in step S250, the extraction information identification unit 140 identifies the extraction target region based on the recognition result of the candidate region and the knowledge information, and identifies the extraction target contents. The specific process is the same as in step S140.

次に、ステップＳ２４０における候補領域の絞込み処理について説明する。候補領域の絞込み処理は、以下、候補領域のフィルタリング処理とも呼ぶ。 Next, the candidate area narrowing-down process in step S240 will be described. The candidate area narrowing process is also referred to as candidate area filtering process hereinafter.

図１２は、本発明の第３の実施形態を示し、図１１のステップＳ２４０における候補領域の絞込み処理の手順の一例を示すフローチャートである。 FIG. 12 is a flowchart illustrating an example of a procedure for narrowing down candidate areas in step S240 in FIG. 11 according to the third embodiment of this invention.

先ず、ステップＳ２４０１において、候補領域設定部１２０はステップＳ２２０で設定された候補領域を抽出情報同定部１４０に入力する。 First, in step S2401, the candidate area setting unit 120 inputs the candidate area set in step S220 to the extraction information identification unit 140.

続いて、ステップＳ２４０２からステップＳ２４０４では、抽出情報同定部１４０は上記候補領域を絞る。種別抽出の場合、種別領域は文書画像の上から一定範囲以内にある可能性が高いこと及び種別領域は複数段落の文書内に存在する可能性は低いという特性を利用して候補領域の絞込み条件として設定する。ここで、複数段落は２以上の段落でもよいし３以上の段落であってもよい。また、一定範囲内とは例えば文書画像全体の上部１／３の範囲内である。なお、一定範囲は文書画像全体の上部１／２の範囲内であってもよいし他の範囲あってもよい。また、診療科抽出または患者情報抽出の場合には絞り込みの範囲を種別抽出の場合と異なる範囲にしてもよい。すなわち、抽出対象に応じて候補領域の絞りこみ条件を変更することとしてもよい。なお、候補領域を絞るためには上記の２つの条件を使用することとしてもよいし、どちらか一方の条件を使用することとしてもよい。また、上記２つの条件に文書画像の横方向における位置等の他の条件を加えることとしてもよい。 Subsequently, in steps S2402 to S2404, the extraction information identification unit 140 narrows down the candidate areas. In the case of type extraction, the candidate area is narrowed down using the characteristics that the type area is likely to be within a certain range from the top of the document image and that the type area is unlikely to exist in a multi-paragraph document. Set as. Here, the plurality of paragraphs may be two or more paragraphs or three or more paragraphs. Further, within a certain range is, for example, within the upper third of the entire document image. The fixed range may be within the upper half of the entire document image or may be another range. In the case of medical department extraction or patient information extraction, the narrowing range may be different from the type extraction. That is, the narrowing-down conditions for the candidate area may be changed according to the extraction target. In order to narrow down candidate areas, the above two conditions may be used, or one of the two conditions may be used. Further, other conditions such as the position of the document image in the horizontal direction may be added to the above two conditions.

ステップＳ２４０２では、抽出情報同定部１４０は処理対象となる候補領域は所定の範囲以内にあるかどうかを判断する。所定の範囲以内にあれば、ステップ２４０３では、抽出情報同定部１４０は更に候補領域の行数は所定の閾値Ｔ以下であるかどうかを判断する。所定の閾値Ｔ以下であれば、ステップ２４０４では、当該候補領域を候補領域として残す。ここで、閾値Ｔは第４の閾値の一例に相当する。 In step S2402, the extraction information identification unit 140 determines whether the candidate area to be processed is within a predetermined range. If it is within the predetermined range, in step 2403, the extraction information identification unit 140 further determines whether or not the number of rows in the candidate area is equal to or less than a predetermined threshold T. If it is less than or equal to the predetermined threshold T, in step 2404, the candidate area is left as a candidate area. Here, the threshold value T corresponds to an example of a fourth threshold value.

ステップ２４０５では、所定の範囲以外にある候補領域あるいは候補領域内の文字の行数が所定の閾値Ｔ以上の候補領域を当該領域を候補領域から外す。これは文書画像の種別を示す情報は通常複数行の文書中に存在する可能性が低いことを利用したものである。上述のように、抽出情報同定部１４０は、第２抽出手段の処理対象とする領域を選択する領域選択手段の一例に相当する。 In step 2405, candidate areas outside the predetermined range or candidate areas in which the number of character lines in the candidate area is equal to or greater than a predetermined threshold T are excluded from the candidate areas. This is based on the fact that information indicating the type of document image is unlikely to exist in a multi-line document. As described above, the extraction information identification unit 140 corresponds to an example of a region selection unit that selects a region to be processed by the second extraction unit.

続いて、ステップＳ２４０６では、抽出情報同定部１４０は未処理の領域があるかどうかを判断します。まだ未処理の領域があれば、ステップＳ２４０２に入り、ステップＳ２４０２からステップＳ２４０５までの処理を繰り返して実行するが、未処理の領域がなければ、候補領域のフィルタリング処理を終了する。 In step S2406, the extraction information identification unit 140 determines whether there is an unprocessed area. If there is still an unprocessed area, the process enters step S2402, and the processes from step S2402 to step S2405 are repeatedly executed. If there is no unprocessed area, the candidate area filtering process is terminated.

次に、ステップＳ２４０における候補領域の絞込み処理の一例について説明する。 Next, an example of candidate area narrowing processing in step S240 will be described.

図１３は、本発明の第３の実施形態を示し、図１１のステップＳ２４０における候補領域の絞込み処理の一例を示す模式図である。 FIG. 13 is a schematic diagram illustrating an example of candidate area narrowing processing in step S240 of FIG. 11 according to the third embodiment of this invention.

１００１は、ある文書画像における候補領域の設定結果例である。枠に囲まれる領域は、候補領域として設定されるものである。 Reference numeral 1001 denotes an example of a candidate area setting result in a certain document image. The area surrounded by the frame is set as a candidate area.

１００２は、種別抽出の場合、１００１から候補領域のフィルタリングの結果例である。１００１の候補領域から位置が一定範囲以内にある、しかも、複数行ではない枠に囲まれる領域のみが残る。これらの候補領域は同定処理の対象領域になる。 1002 is an example of the result of filtering candidate areas from 1001 in the case of type extraction. Only a region whose position is within a certain range from the candidate region 1001 and surrounded by a frame that is not a plurality of lines remains. These candidate areas become target areas for identification processing.

本実施形態では、抽出対象の特性に基づき候補領域を絞り、残った候補領域から抽出対象を同定するものである。本実施形態では、種別抽出を例に、種別情報の特性に基づき候補領域のフィルタリングの条件を設定したが、それ以外の条件を設定してもよい。また、他の情報を抽出する場合、当該抽出情報の特性に応じてフィルタリングの条件を設定してもよい。本実施形態では候補領域の絞り込みのために、候補領域の位置（ステップＳ２４０２）および候補領域内の文字の行数（ステップＳ２４０３）を用いたが、少なくとも一つの情報を用いることとしてもよい。第３の実施形態によれば、第１の実施形態による効果に加え、情報抽出処理の効率を向上することが可能になる。 In the present embodiment, the candidate area is narrowed down based on the characteristics of the extraction target, and the extraction target is identified from the remaining candidate areas. In the present embodiment, the condition for filtering candidate areas is set based on the characteristics of the type information, taking the type extraction as an example, but other conditions may be set. When extracting other information, filtering conditions may be set according to the characteristics of the extracted information. In the present embodiment, the position of the candidate area (step S2402) and the number of character lines in the candidate area (step S2403) are used for narrowing down the candidate areas. However, at least one piece of information may be used. According to the third embodiment, in addition to the effects of the first embodiment, the efficiency of information extraction processing can be improved.

（第４の実施形態）
次に、本発明の第４の実施形態について説明する。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described.

上述した第３の実施形態では、文書画像の解析結果から候補領域を設定し、抽出対象の特性に応じて候補領域をフィルタリングし、対象となる候補領域から抽出対象を同定するものであった。第４の実施形態では、対象となる候補領域において、抽出対象らしさの順番を付けて、その抽出対象らしさ順で抽出対象を同定していくものである。 In the third embodiment described above, a candidate area is set from the analysis result of the document image, the candidate area is filtered according to the characteristics of the extraction target, and the extraction target is identified from the target candidate area. In the fourth embodiment, in the candidate region to be the target, the order of the extraction targetness is assigned, and the extraction target is identified in the order of the extraction targetness.

ここで、第４の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第４の実施形態に係る情報処理システムの機能構成は、図１に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the fourth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the fourth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

図１４は、本発明の第４の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 14 is a flowchart illustrating an example of a processing procedure of an information processing method by the information processing system according to the fourth embodiment of the present invention.

まず、ステップＳ３１０において、文書画像解析部１１０は紙文書の電子化された文書画像を解析し、文字領域や写真領域に分割する。具体的な処理はステップＳ１１０と同様である。 First, in step S310, the document image analysis unit 110 analyzes an electronic document image of a paper document and divides it into a character area and a photograph area. Specific processing is the same as that in step S110.

続いて、ステップＳ３２０において、候補領域設定部１２０は上記文書画像の解析結果から文字領域を抽出対象の候補となる領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S320, the candidate area setting unit 120 sets an area that is a candidate for extraction of a character area from the analysis result of the document image. Specific processing is the same as that in step S120.

続いて、ステップＳ３３０において、候補領域認識部１３０は上記候補領域にある文字列を認識し、認識情報を記録する。具体的な処理はステップＳ１３０と同様である。 Subsequently, in step S330, the candidate area recognition unit 130 recognizes the character string in the candidate area and records the recognition information. The specific process is the same as that in step S130.

続いて、ステップＳ３４０において、抽出情報同定部１４０は抽出対象の特性に基づき、上記候補領域を絞る。具体的な処理はステップＳ２４０と同様である。 Subsequently, in step S340, the extraction information identification unit 140 narrows down the candidate area based on the characteristics of the extraction target. Specific processing is the same as that in step S240.

続いて、ステップＳ３５０において、抽出情報同定部１４０は処理対象となる候補領域において、抽出対象らしさを計算し、抽出対象らしさの順番を付ける。すなわち、候補領域に対して処理の優先度を付与する。すなわち、抽出情報同定部１４０は第１抽出手段により抽出された領域に対して優先度を付与する付与手段の一例に相当する。抽出対象らしさの順番を付与する処理の詳細について後述する。 Subsequently, in step S350, the extraction information identification unit 140 calculates the likelihood of extraction in the candidate region to be processed, and assigns the order of the likelihood of extraction. That is, processing priority is assigned to the candidate area. In other words, the extraction information identification unit 140 corresponds to an example of an assigning unit that gives priority to the region extracted by the first extracting unit. Details of the process of assigning the order of the extraction target will be described later.

続いて、ステップＳ３６０において、抽出情報同定部１４０は上記候補領域の認識結果及び知識情報に基づき、ステップＳ３５０で決められる抽出対象らしさの順で、抽出対象領域を同定し、抽出対象中身を同定する。具体的な処理はステップＳ１４０と同様である。 Subsequently, in step S360, the extraction information identification unit 140 identifies the extraction target region in the order of the extraction target determined in step S350 based on the recognition result and knowledge information of the candidate region, and identifies the extraction target contents. . The specific process is the same as in step S140.

次に、ステップＳ３５０における候補領域の抽出対象らしさの計算処理方法について説明する。 Next, a method for calculating the likelihood of extraction of candidate areas in step S350 will be described.

文書の種別領域は基本的に文書画像のタイトルらしい領域に該当する。タイトルは基本的に文書の上に位置する、文字サイズが大きい、また、中心線に寄せるといった特徴を持つ。しかし、医用文書のフォーマットが多種多様なため、種別領域は必ずしも上述の特性を持つわけではない。ここで、これらの特性を用いて、以下の式で候補領域の種別らしさを総合的に求めるようにする。
種別らしさ＝ｗ１＊｛文字サイズ｝＋ｗ２＊｛中心線との近さの逆数｝＋ｗ３＊｛上部にある領域数の逆数｝
ここで、Ｗ１、Ｗ２、Ｗ３は各要素の重み付けである。重要視される要素に高い数値の重みを付ける。ここで、「上部」とは例えば文書画像全体の上部１／３の範囲内を示すが、これに限定されるものではない。なお、式１に示した種別らしさを示す値は３つの項のうち少なくとも１つの項目を用いることとしてもよい。また、上部にある領域数を求めるためには候補領域の位置を利用する。すなわち付与手段の一例である抽出情報同定部１４０は、第１抽出手段により抽出された領域の位置および領域に含まれる文字の大きさの少なくとも１つに基づいて優先度を付与する。 The document type area basically corresponds to an area that seems to be a title of a document image. The title basically has the characteristics that it is located on the document, has a large character size, and is close to the center line. However, since the formats of medical documents are diverse, the type area does not necessarily have the above-described characteristics. Here, using these characteristics, the likelihood of the type of the candidate area is obtained comprehensively by the following formula.
Type-likeness = w1 * {character size} + w2 * {reciprocal of the proximity to the center line} + w3 * {reciprocal of the number of regions at the top}
Here, W1, W2, and W3 are weights of the respective elements. Give high value to important elements. Here, the “upper part” indicates, for example, the range of the upper third of the entire document image, but is not limited thereto. It should be noted that at least one item out of the three terms may be used as the value indicating the type likeness shown in Equation 1. Further, the position of the candidate area is used to obtain the number of areas in the upper part. That is, the extraction information identification unit 140, which is an example of an assigning unit, assigns a priority based on at least one of the position of the region extracted by the first extracting unit and the size of the character included in the region.

なお、式１に示した種別らしさを示す値は３つの項により求められているが、４つ以上の項目を用いて種別らしさを算出することとしてもよい。また、例えば、上記種別らしさを示す値が大きい領域から抽出情報同定部１４０の処理対象とする。 In addition, although the value which shows the kindness shown in Formula 1 is calculated | required by three terms, it is good also as calculating a kindness using four or more items. Further, for example, the extraction information identification unit 140 is set as a processing target from a region having a large value indicating the type-likeness.

本実施形態では、候補領域の抽出対象らしさ（優先度）を計算し、抽出対象らしさ順で抽出対象を同定するものであった。本実施形態では、種別抽出を例に、抽出対象らしさに関わる要素として文字サイズ、領域の位置、領域の数を用いたが、それ以外の特性を使ってもよい。 In the present embodiment, the extraction target likelihood (priority) of the candidate area is calculated, and the extraction target is identified in the order of the extraction target likelihood. In the present embodiment, the character size, the position of the region, and the number of regions are used as the elements related to the extraction target by using the type extraction as an example, but other characteristics may be used.

第４の実施形態によれば、第１、第３の実施形態による効果に加え、抽出対象領域の可能性の高い候補領域から処理することが可能になり、更に抽出処理の効率性を向上することができるようになる。 According to the fourth embodiment, in addition to the effects of the first and third embodiments, it is possible to perform processing from a candidate region having a high possibility of extraction target region, and further improve the efficiency of extraction processing. Will be able to.

（第５の実施形態）
次に、第５の実施形態について説明する。 (Fifth embodiment)
Next, a fifth embodiment will be described.

上述した第１、第２、第３及び第４の実施形態では、医用文書から種別情報を抽出する例を主として説明した。第５の実施形態では、医療文書から診療科情報、或いは、患者情報を抽出するものである。 In the first, second, third, and fourth embodiments described above, the example in which the type information is extracted from the medical document has been mainly described. In the fifth embodiment, clinical department information or patient information is extracted from a medical document.

ここで、第５の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第５の実施形態に係る情報処理システムの機能構成は、図１に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。さらに、第５の実施形態に係る情報処理方法の処理手順は、図３に示す第１の実施形態に係る情報処理方法のステップＳ１４０を除いて同様であるため、ステップＳ１１０〜１３０の説明を省略する。 Here, the hardware configuration of the information processing system according to the fifth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the fifth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG. Further, since the processing procedure of the information processing method according to the fifth embodiment is the same except for step S140 of the information processing method according to the first embodiment shown in FIG. 3, the description of steps S110 to 130 is omitted. To do.

種別抽出処理は基本的に種別領域の同定の後に、種別領域の中身による種別分類が必要なため、語尾情報による種別領域の同定、種別領域にある種別用語の抽出、種別同定の３ステップで処理される。診療科抽出は基本的に診療科名を抽出するためのものなので、診療科領域の同定、診療科領域にある診療科名の抽出の２ステップで処理する。患者情報の抽出は診療科抽出と同様である。 Since the type extraction processing basically requires classification of the type region after the type region is identified, the type region is identified by the tail information, the type term in the type region is extracted, and the type identification is performed in three steps. Is done. Since the clinical department extraction is basically for extracting the name of the clinical department, it is processed in two steps: identification of the clinical department area and extraction of the clinical department name in the clinical department area. Extraction of patient information is the same as extraction of clinical departments.

ここで、本実施形態のステップＳ１４０における診療科抽出の同定処理について説明する。 Here, the identification process of the clinical department extraction in step S140 of the present embodiment will be described.

図１５は、本発明の第５の実施形態を示し、図３のステップＳ１４０における診療科の抽出処理の手順の一例を示すフローチャートである。 FIG. 15 is a flowchart showing an example of the procedure of the department extraction process in step S140 of FIG. 3 according to the fifth embodiment of the present invention.

先ず、ステップＳ４４０１では、候補領域設定部１２０は抽出情報同定部１４０に候補領域情報を入力する。 First, in step S4401, the candidate area setting unit 120 inputs candidate area information to the extraction information identification unit 140.

続いて、ステップＳ４４０２では、抽出情報同定部１４０は処理対象となる候補領域に診療科語尾辞書にある語尾があるかどうかを判断する。 Subsequently, in step S4402, the extraction information identification unit 140 determines whether there is a ending in the medical department ending dictionary in the candidate area to be processed.

語尾がある場合、ステップＳ４４０３では、抽出情報同定部１４０は当該候補領域を診療科領域として同定する。そして、ステップＳ４４０４では、抽出情報同定部１４０は当該領域に診療科用語辞書にある用語を診療科名として抽出する。 If there is a ending, in step S4403, the extraction information identification unit 140 identifies the candidate area as a clinical department area. In step S4404, the extraction information identification unit 140 extracts a term in the clinical department term dictionary in the area as a clinical department name.

語尾がない場合、ステップＳ４４０５では、未処理の候補領域があるかどうかを判断する。未処理の候補領域があれば、上記ステップＳ４４０２からステップＳ４４０４までの処理を繰り返して実行する。未処理の候補領域がなければ、候補領域のなかから診療科に該当する領域がないとし、診療科情報がないと判断する。 If there is no ending, it is determined in step S4405 whether there is an unprocessed candidate area. If there is an unprocessed candidate area, the processes from step S4402 to step S4404 are repeated. If there is no unprocessed candidate area, it is determined that there is no area corresponding to the medical department among the candidate areas, and there is no medical department information.

本実施形態では、種別抽出の他、文書画像から診療科情報、或いは、患者情報を抽出するものであった。抽出対象に応じて、知識情報を置き換えればよい。 In this embodiment, in addition to the type extraction, clinical department information or patient information is extracted from the document image. The knowledge information may be replaced according to the extraction target.

第５の実施形態によれば、第１、第２、第４の実施形態による効果に加え、種別情報以外の情報抽出も可能になる。 According to the fifth embodiment, in addition to the effects of the first, second, and fourth embodiments, information other than the type information can be extracted.

（第６の実施形態）
次に、第６の実施形態について説明する。 (Sixth embodiment)
Next, a sixth embodiment will be described.

上述した第１、第２、第３、第４及び第５の実施形態では、種別、診療科、患者情報のうち１種類の情報のみを抽出する例を主として説明した。第６の実施形態では、文書画像から複数の情報を抽出する場合を説明する。 In the first, second, third, fourth, and fifth embodiments described above, the example in which only one type of information is extracted from the type, the department, and the patient information has been mainly described. In the sixth embodiment, a case where a plurality of information is extracted from a document image will be described.

ここで、第６の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第６の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the sixth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the sixth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

次に、本実施形態に係る情報処理システムによる情報処理方法の処理手順について説明する。 Next, a processing procedure of an information processing method by the information processing system according to the present embodiment will be described.

図１６は、本発明の第６の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 16 is a flowchart showing an example of a processing procedure of an information processing method by the information processing system according to the sixth embodiment of the present invention.

まず、ステップＳ５１０では、文書画像解析部１１０は紙文書の電子化された文書画像を分割する。具体的な処理はステップＳ１１０と同様である。 First, in step S510, the document image analysis unit 110 divides an electronic document image of a paper document. Specific processing is the same as that in step S110.

続いて、ステップＳ５２０では、候補領域設定部１２０は上記領域分割の結果から抽出対象の候補領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S520, the candidate area setting unit 120 sets a candidate area to be extracted from the result of the area division. Specific processing is the same as that in step S120.

続いて、ステップＳ５３０では、候補領域認識部１３０は上記候補領域にある文字列を認識し、認識情報を記録する。具体的な処理はステップＳ１３０と同様である。 Subsequently, in step S530, the candidate area recognition unit 130 recognizes the character string in the candidate area and records the recognition information. The specific process is the same as that in step S130.

続いて、ステップＳ５４０では、抽出情報同定部１４０は、図１７に示す情報を参照することで抽出対象が構造上の特性があるかどうかを判断する。 Subsequently, in step S540, the extraction information identification unit 140 determines whether the extraction target has structural characteristics by referring to the information illustrated in FIG.

特性があると判断される場合、ステップＳ５５０では、抽出情報同定部１４０は抽出対象の特性に基づき候補領域を絞る。例えば、構造上の特性を有する種別情報を抽出する場合には抽出情報同定部１４０は候補領域を文書画像の上部に存在する候補領域に絞り込む。具体的な処理はステップＳ３４０と同様である。ここで、「上部」とは例えば文書画像全体の上部１／３の範囲内を示すが、これに限定されるものではない。 When it is determined that there is a characteristic, in step S550, the extraction information identification unit 140 narrows down candidate areas based on the characteristic to be extracted. For example, in the case of extracting type information having structural characteristics, the extraction information identification unit 140 narrows the candidate area to the candidate area existing above the document image. The specific process is the same as that in step S340. Here, the “upper part” indicates, for example, the range of the upper third of the entire document image, but is not limited thereto.

続いて、ステップＳ５６０では、抽出情報同定部１４０は図１８に示す情報に基づいて抽出対象に応じて知識情報を切り替える。 Subsequently, in step S560, the extraction information identification unit 140 switches knowledge information according to the extraction target based on the information illustrated in FIG.

続いて、ステップＳ５７０では、抽出情報同定部１４０は上記候補領域の認識結果及び知識情報に基づき抽出対象を同定する。具体的な処理はステップＳ１４０と同様である。なお、操作者が抽出対象を示す情報を登録部１に入力することで登録部１が抽出対象を把握できるようにしてもよいし、登録部１が所定の順序で抽出対象を自動的に切換えることで登録部１が抽出対象を把握することとしてもよい。 Subsequently, in step S570, the extraction information identification unit 140 identifies the extraction target based on the recognition result of the candidate area and the knowledge information. The specific process is the same as in step S140. The operator may input information indicating the extraction target to the registration unit 1 so that the registration unit 1 can grasp the extraction target, or the registration unit 1 automatically switches the extraction target in a predetermined order. Thus, the registration unit 1 may grasp the extraction target.

次に、抽出対象の構造上の特性有無、抽出対象の知識管理の一例について説明する。 Next, the presence / absence of structural characteristics of the extraction target and an example of knowledge management of the extraction target will be described.

図１７は、本発明の第６の実施形態を示し、図１６に関わる抽出対象の構造上の特性有無、抽出対象の知識管理の一例を示す模式図である。 FIG. 17 shows a sixth embodiment of the present invention, and is a schematic diagram showing an example of the structural characteristics of the extraction target and the knowledge management of the extraction target related to FIG.

１４０１は抽出対象の構造上の特性有無の管理表で、抽出対象は構造上の特性があるかどうかを記録するものである。種別情報は基本的に文書画像の上部にあるので、構造上の特性があるものとする。診療科情報と患者情報は文書画像のどこにも記述される可能性があるので、構造上の特性がないものとする。 Reference numeral 1401 denotes a management table of presence / absence of structural characteristics of an extraction target, and records whether the extraction target has structural characteristics. Since the type information is basically at the top of the document image, it is assumed that it has structural characteristics. Since clinical department information and patient information may be described anywhere in the document image, it is assumed that there is no structural characteristic.

１４０２は抽出対象の知識管理表で、抽出対象の抽出に必要な知識を管理するものである。種別抽出に種別抽出用の語尾辞書１、用語辞書１、更に分類に必要となる分類辞書１を用いる。診療科抽出に診療科抽出用の語尾辞書２、用語辞書２を用いる。患者情報抽出に患者情報抽出用の語尾辞書３を用いる。 Reference numeral 1402 denotes an extraction target knowledge management table for managing knowledge necessary for extraction of an extraction target. For the type extraction, the ending dictionary 1 for type extraction, the term dictionary 1, and the classification dictionary 1 necessary for classification are used. The ending dictionary 2 and the term dictionary 2 for extracting the clinical department are used for extracting the clinical department. The patient information extraction ending dictionary 3 is used for patient information extraction.

本実施形態では、複数の情報を抽出する場合、抽出対象の情報に応じて構造情報による候補領域の設定処理、抽出対象の同定処理に用いる知識情報を切り替えて行うものである。また、本実施形態では、抽出対象は構造上に特性がある場合、抽出対象の構造上の特性に基づき候補領域の絞込み処理を行うが、更に抽出対象の構造上の特性に基づき抽出対象らしさを計算し順位付け処理を行ってもよい。また、本実施形態では、複数の抽出情報の知識を別々に管理するものであったが、知識をまとめて管理してもよい。 In the present embodiment, when a plurality of pieces of information are extracted, the knowledge information used for the candidate region setting process and the extraction target identification process based on the structure information is switched according to the extraction target information. In this embodiment, if the extraction target has a characteristic on the structure, the candidate area is narrowed down based on the structural characteristic of the extraction target. Calculation and ranking processing may be performed. In this embodiment, the knowledge of a plurality of pieces of extracted information is managed separately, but the knowledge may be managed collectively.

第６の実施形態によれば、第１、第２、第３、第５の実施形態による効果に加え、複数の情報を抽出する場合、情報の特性を考慮する情報抽出の効率化が実現可能になる。 According to the sixth embodiment, in addition to the effects of the first, second, third, and fifth embodiments, when extracting a plurality of pieces of information, it is possible to improve the efficiency of information extraction considering information characteristics. become.

なお、上述した第１、第２、第３、第４、第５及び第６の実施形態では、文書画像の解析結果から文字領域を抽出対象の候補領域として設定するであった。しかし、文字領域のみならず、所定範囲以内でその他の属性領域を抽出対象の候補領域として広く設定してもよい。また、上述した第１、第２、第３、第４及び第６の実施形態では、候補領域の文字認識及び知識に基づき抽出対象領域を同定し、抽出情報を同定するものであったが、候補領域の文字認識の結果を補正し、補正情報及び知識に基づき抽出対象を同定してもよい。 In the first, second, third, fourth, fifth, and sixth embodiments described above, the character area is set as the extraction candidate area from the analysis result of the document image. However, not only the character area but also other attribute areas within a predetermined range may be set widely as candidate areas to be extracted. In the first, second, third, fourth, and sixth embodiments described above, the extraction target region is identified based on the character recognition and knowledge of the candidate region, and the extraction information is identified. The result of character recognition in the candidate area may be corrected, and the extraction target may be identified based on the correction information and knowledge.

（第７の実施形態）
次に、第７の実施形態について説明する。 (Seventh embodiment)
Next, a seventh embodiment will be described.

上述した第１、第２、第３、第４、第５及び第６の実施形態では、文書画像の解析により抽出対象となる情報を抽出するものであった。第７の実施形態では、院内システム（例えば、電子カルテシステム）に格納される診療情報及び文書画像の両方を解析し情報を抽出するものである。 In the first, second, third, fourth, fifth and sixth embodiments described above, information to be extracted is extracted by analyzing a document image. In the seventh embodiment, both medical information and document images stored in an in-hospital system (for example, an electronic medical record system) are analyzed and information is extracted.

ここで、第７の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第７の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the seventh embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the seventh embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

図１８は、本発明の第７の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 18 is a flowchart illustrating an example of a processing procedure of an information processing method by the information processing system according to the seventh embodiment of the present invention.

まず、ステップＳ６１０では、抽出情報同定部１４０は文書画像から患者番号を抽出する。患者番号の抽出処理は上記第５の実施形態を使用することができる。 First, in step S610, the extraction information identification unit 140 extracts a patient number from the document image. The patient number extraction process can use the fifth embodiment.

続いて、ステップＳ６２０では、抽出情報同定部１４０は電子カルテシステムから当該患者の関連情報を取り出す。関連情報は種別分類に関わるものとする。関連情報の詳細については後述する。 Subsequently, in step S620, the extraction information identification unit 140 extracts relevant information on the patient from the electronic medical record system. The related information is related to the classification. Details of the related information will be described later.

続いて、ステップＳ６３０では、抽出情報同定部１４０は種別分類の関連情報があるかどうかを確認する。関連情報があれば、ステップＳ６４０では、関連情報を用いて種別分類を絞る。関連情報がなければ、ステップＳ６５０に入る。 Subsequently, in step S630, the extraction information identification unit 140 confirms whether there is related information on the type classification. If there is related information, in step S640, the type classification is narrowed down using the related information. If there is no relevant information, the process enters step S650.

続いて、ステップＳ６５０では、種別分類から種別を同定する。種別の抽出処理は上記第１、第２、第４の実施形態の何れかを使用することができる。 Subsequently, in step S650, the type is identified from the type classification. Any of the first, second, and fourth embodiments can be used for the type extraction process.

次に、本実施形態に係る情報処理システムによる情報処理の一例について説明する。 Next, an example of information processing by the information processing system according to the present embodiment will be described.

図１９は、本発明の第７の実施形態を示し、図１８の情報処理の一例を示す模式図である。 FIG. 19 is a schematic diagram illustrating an example of the information processing of FIG. 18 according to the seventh embodiment of this invention.

１６０１は、電子カルテシステムにおける診療情報の構造情報の記述例である。基本情報に患者情報、診察日、初診か再診を含む。また、診療情報としてＳ（主訴）Ｏ（所見）Ａ（検査）Ｐ（計画）が含まれる。 Reference numeral 1601 denotes a description example of the structure information of medical information in the electronic medical record system. Basic information includes patient information, visit date, first visit or revisit. The medical information includes S (main complaint) O (findings) A (examination) P (plan).

１６０２は、電子カルテの診療情報に含まれる種別分類に関わる関連情報例である。基本情報の中に、例えば、初診、或いは、再診といった用語が挙げられる。また、診療情報の中に、例えば、手術予定、或いは、入院治療といった用語が挙げられる。 Reference numeral 1602 is an example of related information related to the classification included in the medical record medical information. In the basic information, for example, terms such as first visit or revisit are mentioned. In the medical information, for example, terms such as an operation schedule or hospitalization treatment can be mentioned.

１６０３は、本来種別抽出処理に用いる分類辞書である。 A classification dictionary 1603 is originally used for the type extraction process.

基本情報から種別分類に関わる用語を抽出し、種別分類候補を絞込む処理例では、先ず、１６０１から「初診」という関連情報が抽出される。「初診」の場合、文書画像が同意書や記録・報告などの種別の可能性がないので、それを種別候補から除外する。そして、「初診」と関連付け可能な種別番号「０１」、「１０」から種別を判定し、分類する。 In the processing example in which terms related to the classification are extracted from the basic information and the classification classification candidates are narrowed down, first, related information “first visit” is extracted from 1601. In the case of “first visit”, since there is no possibility that the document image has a type such as consent form or record / report, it is excluded from the type candidates. Then, the type is determined and classified from the type numbers “01” and “10” that can be associated with the “first visit”.

また、診療情報から種別分類に関わる用語を抽出する場合は、上記と同様に、抽出される関連用語に対応する範囲の種別分類から文書画像の種別を同定する。 Further, when extracting terms related to the classification from the medical information, the type of the document image is identified from the classification of the range corresponding to the extracted related terms, as described above.

本実施形態では、電子カルテシステムから抽出情報と関連する内容を取り出し、抽出情報候補を絞るものである。本実施形態では、電子カルテシステムの利用を例にしたが、それ以外の関連システムと連携してもよい。また、本実施形態では、種別抽出に関連する情報を例に挙げたが、それ以外の関連情報を設定してもよい。また、本実施形態では、種別抽出を例に説明したが、診療科抽出、或いは、それ以外の情報抽出にしてもよい。さらに、本実施形態では、関連情報により種別分類候補を絞り、可能性のある種別分類から種別を同定するものであった。しかし、第１、第２、第３、第４、第５の実施例のように、種別分類を先に同定に、関連情報から絞った種別分類で抽出結果の確認を行う処理方法にしてもよい。 In the present embodiment, the contents related to the extracted information are extracted from the electronic medical record system, and the extracted information candidates are narrowed down. In the present embodiment, the use of the electronic medical record system is taken as an example, but it may be linked with other related systems. In the present embodiment, information related to type extraction is given as an example, but other related information may be set. Further, in the present embodiment, the type extraction has been described as an example, but clinical department extraction or other information extraction may be performed. Furthermore, in the present embodiment, the type classification candidates are narrowed down by the related information, and the type is identified from the possible type classification. However, as in the first, second, third, fourth, and fifth embodiments, the classification classification is first identified, and the processing method is to check the extraction result with the classification classification narrowed down from the related information. Good.

第７の実施形態によれば、第１、第２、第３、第４、第６の実施形態による効果に加え、関連システムと連携した情報抽出仕組みの実現が可能になる。 According to the seventh embodiment, in addition to the effects of the first, second, third, fourth, and sixth embodiments, it is possible to realize an information extraction mechanism in cooperation with the related system.

（第８の実施形態）
次に、第８の実施形態について説明する。 (Eighth embodiment)
Next, an eighth embodiment will be described.

上述した第１、第２、第３、第４、第５、第６及び第７の実施形態では、医用向け非定型文書を対象に種別情報等を自動的に情報を抽出するものであった。第８の実施形態では、一般分野の非定型文書における情報抽出に関するものである。 In the first, second, third, fourth, fifth, sixth, and seventh embodiments described above, the type information and the like are automatically extracted for medical atypical documents. . The eighth embodiment relates to information extraction in an atypical document in a general field.

例えば、銀行の場合は、口座開設をはじめ、融資取組や、住宅ローンなどの業務に関連するドキュメントとデータのキャプチャは、基本的は手作業で行うのが現状である。例えば、米ドル建ての外国送金の場合では、米国ＯＦＡＣ規制により、取引の関係当事者の所在地に禁止取引国、また、問題のある法人・個人等が含まれているかどうかを確認する作業は非常に手間がかかるため、業務の効率化のサポートが必要である。 For example, in the case of banks, document and data capture related to operations such as opening an account, financing efforts, and mortgages are basically done manually. For example, in the case of foreign remittances denominated in US dollars, it is very laborious to check whether the locations of the parties involved in the transaction include prohibited countries or problematic corporations or individuals due to US OFAC regulations. Therefore, it is necessary to support business efficiency.

ここで、業務効率の向上に、様々なフォーマットを有するドキュメントから必要な情報を自動的に抽出し、ドキュメントを分類する第８の実施形態として挙げる。第８の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第８の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。また、第３の実施形態に係る情報処理システムのハードウェア構成は、図２に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明も省略する。また、第８の実施形態に係る情報処理方法の処理手順は、図３に示す第１の実施形態に係る情報処理方法のステップＳ１４０を除いて同様であるため、ステップＳ１１０〜１３０の説明は省略する。 Here, in order to improve business efficiency, an eighth embodiment in which necessary information is automatically extracted from documents having various formats and the documents are classified will be described. The hardware configuration of the information processing system according to the eighth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the eighth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG. The hardware configuration of the information processing system according to the third embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The processing procedure of the information processing method according to the eighth embodiment is the same except for step S140 of the information processing method according to the first embodiment shown in FIG. 3, and thus description of steps S110 to 130 is omitted. To do.

次に、ステップＳ１４０における知識に基づく抽出対象の同定処理について説明する。 Next, extraction target identification processing based on knowledge in step S140 will be described.

図２０は、本発明の第８の実施形態を示し、図３のステップＳ１４０における知識に基づく抽出対象を同定し、取引規制対象であるかどうかの確認作業支援の手順の一例を示すフローチャートである。 FIG. 20 is a flowchart showing an eighth embodiment of the present invention and identifying an extraction target based on knowledge in step S140 of FIG. 3 and showing an example of a procedure for confirming whether or not it is a transaction regulation target. .

先ず、ステップＳ７４０１では、候補領域設定部１２０は候補領域情報を抽出情報同定部１４０に入力する。 First, in step S7401, the candidate area setting unit 120 inputs candidate area information to the extraction information identification unit 140.

続いて、抽出情報同定部１４０はステップＳ７４０２からステップＳ７４０６において、基本抽出項目内容に該当するかどうかをチェックし、取引規制対象の判断を行う。以下、詳細に説明する。 Subsequently, in step S7402 to step S7406, the extraction information identification unit 140 checks whether or not the basic extraction item content is satisfied, and determines a transaction regulation target. Details will be described below.

ステップＳ７４０２では、抽出情報同定部１４０は基本抽出項目ｎを取り出す。そして、ステップＳ７４０３では、基本抽出項目ｎに対応する中身ｍを取り出す。 In step S7402, the extraction information identification unit 140 extracts a basic extraction item n. In step S7403, the contents m corresponding to the basic extraction item n is extracted.

そして、ステップＳ７４０４では、候補情報の中に、上記基本抽出項目ｎの中身ｍに該当するものがあるかどうかをチェックする。上記基本抽出項目ｎの中身ｍに該当するものがあれば、当該文書は更に精査する必要があると判断し、ステップＳ７４０７の処理に入る。上記基本抽出項目ｎの中身ｍに該当するものがなければ、ステップＳ７４０５に入り、基本項目ｎの中身をすべてチェックしたかどうかを確認する。まだ未チェックの中身があれば、ステップＳ７４０３に入り、ステップＳ７４０３からステップＳ７４０４までの処理を繰り返して実行する。基本抽出項目ｎの中身はすべてチェックする場合、ステップＳ７４０６では、基本抽出項目はすべてチェックしたかどうかを確認する。まだ未チェックの基本抽出項目があれば、ステップＳ７４０２に入り、ステップＳ７４０２からステップＳ７４０６までの処理を繰り返して実行する。すべでの基本抽出項目において、すべでの基本項目の中身に該当するものがなければ、ステップＳ７４１２に入り、本文書画像は規制対象外と判断する。 In step S7404, it is checked whether there is any candidate information corresponding to the content m of the basic extraction item n. If there is an item corresponding to the content m of the basic extraction item n, it is determined that the document needs to be further examined, and the process enters step S7407. If there is no item corresponding to the contents m of the basic extraction item n, the process enters step S7405 to check whether all the contents of the basic item n have been checked. If there are still unchecked contents, the process enters step S7403 and the processes from step S7403 to step S7404 are repeatedly executed. When all the contents of the basic extraction items n are checked, it is checked in step S7406 whether all the basic extraction items are checked. If there is still an unchecked basic extraction item, the process enters step S7402, and the processes from step S7402 to step S7406 are repeatedly executed. If none of the basic extraction items correspond to the contents of all the basic items, the process enters step S7412 and determines that the document image is not subject to restriction.

ステップＳ７４０７からステップＳ７４１３は、ステップＳ７４０４で基本抽出項目の中身に該当するものがある場合の精査処理である。以下、詳細に説明する。 Steps S7407 to S7413 are a scrutinization process in the case where there are items corresponding to the contents of the basic extraction items in step S7404. Details will be described below.

ステップＳ７４０７では、抽出項目を取り出す。そして、ステップＳ７４０８では、抽出項目ｎ’に対応する中身ｍ’を取り出す。 In step S7407, an extraction item is extracted. In step S7408, the contents m 'corresponding to the extracted item n' is extracted.

そして、ステップＳ７４０９では、候補情報の中に、上記抽出項目ｎ’の中身ｍ’に該当するものがあるかどうかをチェックする。上記抽出項目ｎ’の中身ｍ’に該当するものがあれば、ステップＳ７４１３に入り、当該文書を規制対象と判断する。上記抽出項目ｎ’の中身ｍ’に該当するものがなければ、ステップＳ７４０１０に入り、抽出項目ｎ’の中身をすべてチェックしたかどうかを確認する。まだ未チェックの中身があれば、ステップＳ７４０８に入り、ステップＳ７４０８からステップＳ７４０９までの処理を繰り返して実行する。抽出項目ｎの中身はすべてチェックする場合、ステップＳ７４１１では、抽出項目はすべてチェックしたかどうかを確認する。まだ未チェックの抽出項目があれば、ステップＳ７４０７に入り、ステップＳ７４０７からステップＳ７４１１までの処理を繰り返して実行する。すべでの抽出項目において、すべでの抽出項目の中身に該当するものがなければ、ステップＳ７４１２に入り、本文書画像は規制対象外と判断する。 In step S7409, it is checked whether there is any candidate information corresponding to the content m 'of the extracted item n'. If there is an item corresponding to the content m 'of the extracted item n', the process enters step S7413, and the document is determined to be a restriction target. If there is nothing corresponding to the contents m ′ of the extracted item n ′, the process proceeds to step S74010 to check whether or not all the contents of the extracted item n ′ have been checked. If there are still unchecked contents, the process enters step S7408, and the processes from step S7408 to step S7409 are repeatedly executed. When all the contents of the extraction item n are checked, it is checked in step S7411 whether all the extraction items are checked. If there is still an unchecked extraction item, the process enters step S7407, and the processes from step S7407 to step S7411 are repeated. If none of the extracted items correspond to the contents of all the extracted items, the process proceeds to step S7412, and it is determined that the document image is not regulated.

図２１は、本発明の第８の実施形態を示し、図２０の情報処理の一例を示す模式図である。 FIG. 21 is a schematic diagram illustrating an example of the information processing of FIG. 20 according to the eighth embodiment of this invention.

１８０１は、海外送金業務用の帳票例である。取引規制対象のチェック対象項目として、楕円で囲まれる送金通貨、国名、取引人が挙げられる。 Reference numeral 1801 denotes an example of a form for overseas remittance business. Items to be checked for transaction restrictions include remittance currency, country name, and trader enclosed in an ellipse.

１８０２は、取引規制対象のチェックに用いる知識例である。知識情報は、１８０３０基本抽出項目、１８０４０抽出項目、１８０３１基本抽出項目に含む各項目の中身リスト、１８０４１、１８０４２抽出項目に含む各項目の中身リストから構成される。例えば、１８０３０基本抽出項目０１「送金通貨」の中身番号は「０１０１」で、内容は「ＵＳＤ」として設定される。また、例えば、１８０４０抽出項目１１「国名」に対応する中身リストが複数あり、順番にリストアップされる。また、日本語だけではなく、他言語での記述も対応付けて記録されている。 Reference numeral 1802 denotes an example of knowledge used for checking a transaction regulation target. The knowledge information includes a content list of items included in 18030 basic extraction items, 18040 extraction items, 18031 basic extraction items, and items included in 18041 and 18042 extraction items. For example, the content number of 18030 basic extraction item 01 “remittance currency” is set to “0101” and the content is set to “USD”. Further, for example, there are a plurality of content lists corresponding to the 18040 extraction item 11 “country name”, which are listed in order. In addition to Japanese, descriptions in other languages are also recorded in association with each other.

上記情報処理では、基本抽出項目として設定されている「送金通貨」「ＵＳＤ」に該当するものがあれば、精査対象とする。そして、更に抽出項目として設定されている取引禁止国名のリスト、または、抽出項目として設定されている問題のある法人・個人のリストに該当するかどうかをチェックする。 In the information processing, if there is anything corresponding to “remittance currency” or “USD” set as a basic extraction item, it is subject to scrutiny. Further, it is checked whether or not it corresponds to a list of trade prohibited country names set as extraction items or a list of problematic corporations and individuals set as extraction items.

本実施形態では、金融業務の知識を利用し、金融帳票から自動的に情報抽出するものである。本実施形態では、金融業務の中に、海外送金業務を例にしたが、それ以外の文書画像関連の業務に適用してもよい。また、海外送金業務の自動化の例では、チェックする項目を基本抽出項目と抽出項目に分けて管理する例を挙げたが、まとめて管理してもよいし、それ以外の構造にしてもよい。 In the present embodiment, information is automatically extracted from a financial form using knowledge of financial operations. In this embodiment, the overseas remittance business is taken as an example in the financial business, but the present invention may be applied to other business related to document images. In the example of automating overseas remittance work, an example has been given in which items to be checked are managed by dividing them into basic extracted items and extracted items. However, they may be managed collectively or other structures.

第８の実施形態によれば、ターゲットと設定される分野の情報抽出において、必要となる知識を置き換えて、本提案のアーキテクチャを適用すれば、医療以外の業務にも適用が可能になる。 According to the eighth embodiment, in the information extraction in the field set as the target, if necessary architecture is replaced and the proposed architecture is applied, it can be applied to operations other than medical treatment.

上述した第１、第２、第３、第４、第５、第６、第７及び第８の実施形態では、スキャン文書画像から情報抽出するものであったが、カメラ撮影画像を情報抽出処理対象にしてもよい。その際に、カメラ入力画像用の画像補正処理を加えればよい。 In the first, second, third, fourth, fifth, sixth, seventh and eighth embodiments described above, information is extracted from a scanned document image. You may make it a target. At that time, an image correction process for the camera input image may be added.

（他の実施形態）
なお、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録したコンピュータ可読の記憶媒体を、システムあるいは装置に供給することによっても、達成されることは言うまでもない。また、システムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、達成されることは言うまでもない。 (Other embodiments)
Needless to say, the object of the present invention can also be achieved by supplying a system or apparatus with a computer-readable storage medium storing software program codes for realizing the functions of the above-described embodiments. Needless to say, this can also be achieved by the computer (or CPU or MPU) of the system or apparatus reading and executing the program code stored in the storage medium.

この場合、記憶媒体から読出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現される。また、プログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態が実現される場合も含まれることは言うまでもない。 Further, the functions of the above-described embodiment are realized by executing the program code read by the computer. In addition, an OS (operating system) running on a computer performs part or all of actual processing based on an instruction of a program code, and the above-described embodiment is realized by the processing. Needless to say.

なお、上記の複数の実施形態を組み合わせることとしてもよい。 In addition, it is good also as combining said some embodiment.

１１０文書画像解析部
１２０候補領域設定部
１３０候補領域認識部
１４０抽出情報同定部
１５０登録部 110 Document Image Analysis Unit 120 Candidate Area Setting Unit 130 Candidate Area Recognition Unit 140 Extraction Information Identification Unit 150 Registration Unit

Claims

First extraction means for extracting a plurality of regions from the imaged document data;
Second extraction means for extracting a region including a first character or word from the plurality of regions;
Third extraction means for extracting information different from the first character or word from the area extracted by the second extraction means;
An information processing apparatus comprising:

The information processing apparatus according to claim 1, further comprising a classifying unit that classifies the document data using information extracted by the third extracting unit.

The information processing apparatus according to claim 1, wherein the first character is a character at the end of a word composed of a plurality of characters.

A holding means for holding a first reference character and a second reference character different from each other;
The third extracting unit compares the character included in the region extracted by the second extracting unit with the first reference character, and if the comparison result matches, the third reference unit Characters that match characters are extracted as the information, and if the comparison results do not match, the characters included in the region extracted by the second extracting means are compared with the second reference characters. The information processing apparatus according to any one of claims 1 to 3.

A region merging unit for merging the regions extracted by the first extracting unit based on the information about the region extracted by the first extracting unit;
5. The information processing apparatus according to claim 1, wherein the second extraction unit extracts an area including the first character or word from the merged area. 6.

6. The information on the area is information indicating a position of the area extracted by the first extracting unit and at least one of characters included in the area extracted by the first extracting unit. The information processing apparatus described.

The region merging unit has a plurality of regions extracted by the first extracting unit, wherein a region interval is equal to or smaller than a first threshold value, and a difference in character size included in each region is equal to or smaller than a second threshold value. The information processing apparatus according to claim 5 or 6, wherein the areas are merged.

The region merging unit merges regions of the plurality of regions extracted by the first extracting unit when the difference in character spacing included in each region is equal to or smaller than a third threshold value. 7. The information processing apparatus according to 7.

A region selection unit that selects a region to be processed by the second extraction unit based on information about the region extracted by the first extraction unit;
The information processing apparatus according to any one of claims 1 to 8, wherein the second extraction unit extracts an area including the first character or word from the selected area.

The information related to the area is information indicating at least one of characters included in the area extracted by the first extracting means and the position of the area extracted by the first extracting means,
The area selecting means selects an area extracted by the first extracting means that is in a predetermined range in the document data and that includes a line number of characters that is equal to or less than a fourth threshold. Item 10. The information processing device according to Item 9.

A granting unit that gives priority to the area extracted by the first extracting unit;
11. The information processing apparatus according to claim 1, wherein the second extraction unit extracts an area including the first character or word in an order based on the priority.

12. The information according to claim 11, wherein the assigning unit assigns a priority based on at least one of a position of the region extracted by the first extracting unit and a size of a character included in the region. Processing equipment.

The information processing apparatus according to claim 1, wherein the document data is medical document data.

The information processing apparatus according to claim 13, wherein the third extraction unit extracts at least one of type information, medical department information, and patient identification information of the medical document data as the information.

When extracting the type information, the second extraction unit extracts the region including the first character or word from the region selected by the region selection unit, and extracts the medical department information or patient identification information. In this case, the second extracting unit extracts the region including the first character or word from the plurality of regions extracted by the first extracting unit without using the region selecting unit. The information processing apparatus according to claim 13 or 14 dependent on claim 9.

The claim 13 or the claim dependent on claim 2, wherein the classification means classifies the medical document data using information obtained from an electronic medical record and information extracted by the third extraction means. 14. The information processing apparatus according to 14.

The information processing apparatus according to claim 16, wherein the information obtained from the electronic medical record is information indicating whether or not a first visit is made.

A first extraction step of extracting a plurality of regions from the imaged document data;
A second extraction step of extracting a region including a first character or word from the plurality of regions;
A third extraction step of extracting information different from the first character or word from the region extracted in the second extraction step;
An information processing method comprising:

A program causing a computer to execute each step according to claim 18.

A storage medium storing the program according to claim 19.