JP6529254B2

JP6529254B2 - INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, PROGRAM, AND STORAGE MEDIUM

Info

Publication number: JP6529254B2
Application number: JP2014263172A
Authority: JP
Inventors: 暁艶戴
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2014-12-25
Filing date: 2014-12-25
Publication date: 2019-06-12
Anticipated expiration: 2034-12-25
Also published as: JP2016122404A

Description

開示の技術は、情報処理装置、情報処理方法、プログラムおよび記憶媒体に関する。 The disclosed technology relates to an information processing apparatus, an information processing method, a program, and a storage medium.

電子カルテを中心に医療分野におけるＩＴ化が急速に進みつつある。一方、院内には依然として様々な紙媒体の診療情報が存在する。紙媒体の診療情報とは例えば、診療情報提供書、説明同意書類、入退院時の必要書類、診断書類といった診療関連の文書、また、オーダー伝票や、予約票、申込書といった事務手続き用の文書である。 IT is rapidly advancing in the medical field centering on electronic medical records. On the other hand, medical information of various paper media still exists in the hospital. Paper medical care information includes, for example, medical care information provision documents, explanatory consent documents, documents necessary for medical treatment such as required documents at admission and discharge, diagnostic documents, and documents for administrative procedures such as order slips, reservation slips, and application forms. is there.

紙媒体の診療情報（紙文書）と電子カルテ等の電子情報とが混在する環境において、電子情報だけではなく紙文書も迅速に検索・活用できることが望ましい。 In an environment where medical information (paper documents) on paper media and electronic information such as electronic medical records are mixed, it is desirable that not only electronic information but also paper documents can be quickly retrieved and used.

そこで、紙文書の見読性を確保するため、紙文書をスキャナで電子化し、そして、紙文書の種類を示す種別情報、診療科情報、患者番号といった基本情報を人手によって登録し、電子システムに紐付けるワークフローが一般的に実施されている。しかし、病院で利用されている紙文書の種類は数千種類以上にも及ぶ場合があり、病院それぞれ独自の書式があるため、紙文書から上述の基本情報を登録する作業には膨大な時間と手間がかかる。 Therefore, in order to ensure readability of the paper document, the paper document is digitized by a scanner, and basic information such as type information indicating the type of paper document, medical department information, patient number is manually registered, and the electronic system A linking workflow is generally implemented. However, since the types of paper documents used in hospitals may be several thousand types or more, and each hospital has its own format, it takes an enormous amount of time to register the above basic information from paper documents. It takes time and effort.

紙文書に含まれる基本情報の登録作業の省力化を図るものとして、特許文献１において、紙文書にバーコードを付加し、バーコードリーダによってバーコードを読み取ることで紙文書に含まれる基本情報を抽出・登録する方法が開示されている。 In Patent Document 1, in order to save labor of registering basic information included in a paper document, a bar code is added to the paper document, and the bar code is read by a bar code reader to read basic information included in the paper document. A method of extracting and registering is disclosed.

また、特許文献２においては、帳票から抽出したい文字列（帳票の発行元の名前）を記憶しておき、この文字列を帳票の認識結果と照合して帳票の認識を行うことが開示されている。 Further, in Patent Document 2, it is disclosed that a character string (name of the issuer of the form) to be extracted from the form is stored, and this character string is collated with the recognition result of the form to recognize the form. There is.

特許第５３５６９０５号Patent No. 5356905 特開２００１−３１２６９４号公報JP 2001-312694 A

しかしながら、特許文献１の方法では、大量の診療記録や問診票の各用紙を電子化するにあたって予めバーコードを紙文書に付与することが必要なため、人手を介する作業が煩雑で負荷が大きい。さらに特許文献２の方法では、抽出したい文字列全体と帳票の認識結果とを照合しているため、照合できなかった場合には所合できなかった文字とは異なる新たな文字列全体と認識結果とを過去と同様に照合する必要があるため帳票の認識に時間を要する。 However, in the method of Patent Document 1, since it is necessary to attach a bar code to a paper document in advance to digitize each sheet of a large number of medical records and medical questionnaires, the operation with human hands is complicated and the load is large. Further, in the method of Patent Document 2, since the entire character string to be extracted and the recognition result of the form are collated, if the collation can not be performed, the entire new character string and the recognition result different from the characters which could not be matched. Since it is necessary to collate as in the past, it takes time to recognize the form.

開示の技術はこのような状況に鑑みてなされたものであり、紙文書からより簡単且つ迅速に情報を自動抽出することを目的の１つとする。 The technology disclosed herein has been made in view of such circumstances, and has an object to automatically extract information from paper documents more easily and quickly.

なお、前記目的に限らず、後述する発明を実施するための形態に示す各構成により導かれる作用効果であって、従来の技術によっては得られない作用効果を奏することも本件の他の目的の１つとして位置付けることができる。 It is to be noted that the operation and effect are not limited to the above object but are derived from the respective configurations shown in the embodiments for carrying out the invention described later, and it is another object of the present invention to exert the operation and effect not obtained by the prior art. It can be positioned as one.

開示の技術に係る情報処理装置は、画像化された医療文書データから複数の領域を抽出する第１抽出手段と、前記複数の領域から第１の文字を含む領域を抽出する第２抽出手段と、前記第２抽出手段によって抽出された領域から前記医療文書データの診療科情報を抽出する第３抽出手段と、を備え、前記第１の文字は「科」を含む。 The information processing apparatus according to the disclosed technology includes a first extracting means for extracting a plurality of regions from the imaged medical document data, a second extraction means for extracting a region including a first character from said plurality of areas And third extraction means for extracting medical department information of the medical document data from the area extracted by the second extraction means, and the first character includes "family" .

開示の技術によれば画像化された紙文書から簡単且つ迅速に情報を自動抽出することができる。 According to the disclosed technology, information can be extracted easily and quickly from an imaged paper document.

第１の実施形態に係る情報処理システムの構成の一例を示す図である。It is a figure showing an example of the composition of the information processing system concerning a 1st embodiment. 第１実施形態に係る情報処理装置の機能構成の一例を示すブロック図である。It is a block diagram showing an example of functional composition of an information processor concerning a 1st embodiment. 第１実施形態に係る情報処理装置の動作の一例を示すフローチャートである。It is a flowchart which shows an example of operation | movement of the information processing apparatus which concerns on 1st Embodiment. 第１の実施形態に係る図３のステップＳ１２０における候補領域の設定処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of a procedure of the setting process of the candidate area | region in FIG.3 S120 which concerns on 1st Embodiment. 第１の実施形態に係る図３のステップＳ１４０における抽出対象の同定処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of a procedure of an identification process of extraction object in FIG.3 S140 which concerns on 1st Embodiment. 第１の実施形態に係る、図４のステップＳ１２０における候補領域の設定処理および図６のステップＳ１４０における抽出対象の同定処理の一例を示す模式図である。It is a schematic diagram which shows an example of the setting process of the candidate area | region in FIG.4 S120 which concerns on 1st Embodiment, and the identification process of extraction object in FIG.6 S140. 第１の実施形態に係る知識構成の一例を示す模式図である。It is a schematic diagram which shows an example of a knowledge structure which concerns on 1st Embodiment. 第２の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 2nd Embodiment. 第２の実施形態に係る図８のステップＳ’２３０における候補領域の補正処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of a procedure of a correction process of the candidate area | region in FIG.8 S'230 which concerns on 2nd Embodiment. 第２の実施形態に係る図８のステップＳ’２３０における候補領域の補正処理の一例を示す模式図である。It is a schematic diagram which shows an example of a correction process of the candidate area | region in FIG.8 S'230 which concerns on 2nd Embodiment. 第３の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 3rd Embodiment. 第３の実施形態に係る図１１のステップＳ２４０における候補領域の絞込み処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of a procedure of narrowing-down processing of the candidate area | region in FIG.11 S240 which concerns on 3rd Embodiment. 第３の実施形態に係る図１１のステップＳ２４０における候補領域の絞込み処理の一例を示す模式図である。It is a schematic diagram which shows an example of the narrowing-down process of the candidate area | region in FIG.11 S240 which concerns on 3rd Embodiment. 第４の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 4th Embodiment. 第５の実施形態に係る、図３のステップＳ１４０における診療科の抽出処理の手順の一例を示すフローチャートである。It is a flowchart which shows an example of a procedure of extraction processing of the medical department in FIG.3 S140 based on 5th Embodiment. 第６の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 6th Embodiment. 第６の実施形態に係る図１６に関わる抽出対象の構造上の特性有無、抽出対象の知識管理の一例を示す模式図である。It is a schematic diagram which shows an example of the structural presence or absence in extraction object in connection with FIG. 16 concerning 6th Embodiment, and knowledge management of extraction object. 第７の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。It is a flowchart which shows an example of the process sequence of the information processing method by the information processing system which concerns on 7th Embodiment. 第７の実施形態に係る図１７の情報処理の一例を示す模式図である。It is a schematic diagram which shows an example of the information processing of FIG. 17 which concerns on 7th Embodiment. 第８の実施形態に係る図３のステップＳ１４０における知識に基づく抽出対象を同定し、取引規制対象であるかどうかの確認作業支援の手順の一例を示すフローチャートである。It is a flowchart which shows an example of the procedure of confirmation operation support which identifies the extraction object based on the knowledge in FIG.3 S140 which concerns on 8th Embodiment, and is a transaction control object. 第８の実施形態に係る図１９の情報処理の一例を示す模式図である。It is a schematic diagram which shows an example of the information processing of FIG. 19 which concerns on 8th Embodiment.

以下、図面を参照して、本実施形態に係る情報処理装置について詳細に説明する。ただし、この実施の形態に記載されている構成要素はあくまで例示であり、本発明の技術の範囲は、特許請求の範囲によって確定されるものであって、以下の個別の実施形態によって限定されるわけではない。 Hereinafter, the information processing apparatus according to the present embodiment will be described in detail with reference to the drawings. However, the components described in this embodiment are merely examples, and the scope of the technology of the present invention is determined by the claims and is limited by the following individual embodiments. Do not mean.

（第１の実施形態）
まず、第１の実施形態について説明する。 First Embodiment
First, the first embodiment will be described.

図１は、第１の実施形態に係る情報処理システムの構成の一例を示すものである。 FIG. 1 shows an example of the configuration of the information processing system according to the first embodiment.

図１に示すように、情報処理システムは、登録部１（情報処理装置）、格納部２を備える。また、登録部１および格納部２は有線もしくは無線のネットワーク６を介して互いに通信可能に接続されている。また、登録部１および格納部２はネットワーク６を介して病院内の各種システム（電子カルテシステム３、オーダリングシステム４、その他のシステム５）と通信可能に接続されている。なお、登録部１および格納部２は複数台あっても構わない。 As shown in FIG. 1, the information processing system includes a registration unit 1 (information processing apparatus) and a storage unit 2. The registration unit 1 and the storage unit 2 are communicably connected to each other via a wired or wireless network 6. The registration unit 1 and the storage unit 2 are communicably connected to various systems (electronic medical record system 3, ordering system 4 and other systems 5) in the hospital via the network 6. A plurality of registration units 1 and storage units 2 may be provided.

登録部１について詳細に述べる。登録部１は例えばＰＣ等の情報処理装置である。登録部１はＵＩデバイス１０１、ＣＰＵ１０２、ＲＡＭ１０３、通信ＩＦ１０４、ＵＩ表示部１０５、プログラム用記憶領域１０６およびデータ用記憶領域１０７を備える。 The registration unit 1 will be described in detail. The registration unit 1 is an information processing apparatus such as a PC. The registration unit 1 includes a UI device 101, a CPU 102, a RAM 103, a communication IF 104, a UI display unit 105, a program storage area 106, and a data storage area 107.

ＵＩデバイス１０１はマウス、デジタイザおよびキーボード等の少なくとも１つを含むものであり、ユーザによる登録情報の確認、修正、送信のために用いられる。 The UI device 101 includes at least one of a mouse, a digitizer, and a keyboard, and is used for confirmation, correction, and transmission of registration information by the user.

ＣＰＵ１０２はプログラム用記憶領域１０６からＲＡＭ１０３に読み込んだプログラムを解釈・実行することによって装置内の各種制御や計算、ＵＩの表示が可能である。例えば、ＣＰＵ１０２は、プログラムを実行することで図２に示すように、文書画像解析部１１０、候補領域設定部１２０、候補領域認識部１３０、抽出情報同定部１４０および登録部１５０として機能する。なお、登録部１が備えるＣＰＵ１０２およびＲＡＭ１０３は１つであってもよいし複数であってもよい。すなわち、少なくとも１以上の処理装置（ＣＰＵ）と少なくとも１つの記憶装置（ＲＡＭ）とが接続されており、少なくとも１以上の処理装置が少なくとも１以上の記憶装置に記憶されたプログラムを実行した場合に登録部１は上記の各部として機能する。 The CPU 102 interprets and executes the program read from the program storage area 106 to the RAM 103, thereby enabling various controls, calculations, and UI display in the apparatus. For example, the CPU 102 executes a program to function as a document image analysis unit 110, a candidate area setting unit 120, a candidate area recognition unit 130, an extraction information identification unit 140, and a registration unit 150 as shown in FIG. The CPU 102 and the RAM 103 provided in the registration unit 1 may be one or plural. That is, when at least one or more processing units (CPU) and at least one storage unit (RAM) are connected, and at least one or more processing units execute a program stored in at least one or more storage units. The registration unit 1 functions as each unit described above.

文書画像解析部１１０は図示しないスキャナにより得られた紙文書が電子化された文書画像を取得し、解析を行う。ここでスキャナによる電子化とは画像化と言い換えることができる。すなわち、文書画像は画像化された文書データの一例に相当する。なお、画像化された医療文書を医療文書データという。文書画像解析部１１０はスキャナにより得られた電子化された文書画像をスキャナから直接取得してもよいし、スキャナにより得られた文書画像が格納部２に保存されている場合には文書画像解析部１１０は格納部２から文書画像を取得することとしてもよい。 The document image analysis unit 110 acquires a document image obtained by digitizing a paper document obtained by a scanner (not shown) and analyzes it. Here, digitization by a scanner can be reworded as imaging. That is, the document image corresponds to an example of imaged document data. In addition, the medical document which was imaged is called medical document data. The document image analysis unit 110 may obtain the digitized document image obtained by the scanner directly from the scanner, or the document image analysis when the document image obtained by the scanner is stored in the storage unit 2 The unit 110 may acquire a document image from the storage unit 2.

文書画像解析部１１０は、紙文書の電子化された文書画像のレイアウトを解析し、文字領域や写真領域の複数の領域に分割（領域分割）して領域を抽出する。すなわち、文書画像解析部１１０は画像化された文書データから複数の領域を抽出する第１抽出手段の一例に相当する。 The document image analysis unit 110 analyzes the layout of the digitized document image of the paper document, and divides it into a plurality of regions such as a character region and a photo region (region division) to extract a region. That is, the document image analysis unit 110 corresponds to an example of a first extraction unit that extracts a plurality of areas from the document data that has been imaged.

なお、領域分割によって、文書画像解析部１１０は領域分割した各領域の座標および各領域が文字領域か写真領域かを示す属性情報を領域毎に取得する。文字領域か写真領域かを示す属性情報は既知の種々の手法により取得可能である。なお、紙文書を電子化する手段はスキャナに限定されるものではなく他の手段であってもよい。 Note that the document image analysis unit 110 acquires, for each area, the coordinates of each area divided and the attribute information indicating whether each area is a character area or a picture area by area division. Attribute information indicating whether it is a text area or a photo area can be obtained by various known methods. The means for digitizing the paper document is not limited to the scanner, and may be another means.

候補領域設定部１２０は、文書解析部１１０により分割された領域から情報を抽出する対象となる候補領域を設定する。具体的には、候補領域設定部１２０は文字領域を候補領域として設定する。言い換えれば、候補領域設定部１２０は文書解析部１１０により分割された領域のうち写真領域は候補領域としない。なお、候補領域設定部１２０の処理を省略して、候補領域を設定することなく抽出情報同定部１４０により後述する辞書を用いて文書の種別等を同定することとしてもよい。候補領域設定部１２０の処理により文書の種別等を同定するまでの時間は短縮されるが、候補領域設定部１２０の処理を省略しても上述の効果を奏することが可能である。 The candidate area setting unit 120 sets a candidate area for which information is to be extracted from the area divided by the document analysis unit 110. Specifically, the candidate area setting unit 120 sets a character area as a candidate area. In other words, the candidate area setting unit 120 does not set the photo area as a candidate area among the areas divided by the document analysis unit 110. The process of the candidate area setting unit 120 may be omitted, and the type of document etc. may be identified by using the dictionary described later by the extraction information identification unit 140 without setting the candidate area. Although the time taken to identify the document type and the like is shortened by the process of the candidate area setting unit 120, the above effect can be obtained even if the process of the candidate area setting unit 120 is omitted.

候補領域認識部１３０は、候補領域設定部１２０により設定された候補領域の中身を認識する処理を行うことで文字認識情報を取得する。文字認識情報は候補領域の中身の認識結果である。 The candidate area recognition unit 130 acquires character recognition information by performing processing for recognizing the contents of the candidate area set by the candidate area setting unit 120. The character recognition information is the recognition result of the contents of the candidate area.

抽出情報同定部１４０は、候補領域認識部１３０の認識結果に基づいて候補領域から抽出対象領域を同定し、同定した領域の記載から基本情報を同定する。具体的には、抽出情報同定部１４０は予め作成された辞書等の知識を用いて候補領域から抽出対象領域を同定する。そして、抽出情報同定部１４０は、同定した領域から予め作成された辞書等の知識を用いて例えば文書の種別を同定する。辞書等の知識についての詳細は後述する。なお、辞書等の知識はＲＡＭ１０３に記憶されていてもよいしデータ記憶領域１０７に記憶されていてもよい。また辞書等の知識は登録部１が備える不図示のＲＯＭに記憶されることとしてもよい。 The extraction information identification unit 140 identifies an extraction target region from the candidate region based on the recognition result of the candidate region recognition unit 130, and identifies basic information from the description of the identified region. Specifically, the extraction information identification unit 140 identifies an extraction target region from the candidate region using knowledge such as a dictionary created in advance. Then, the extraction information identification unit 140 identifies, for example, the type of the document, using knowledge such as a dictionary created in advance from the identified area. Details of the knowledge such as the dictionary will be described later. The knowledge of the dictionary or the like may be stored in the RAM 103 or may be stored in the data storage area 107. Further, knowledge such as a dictionary may be stored in a ROM (not shown) included in the registration unit 1.

登録部１５０は、抽出情報同定部１４０によって同定された情報を用いて文書画像を所定の記憶手段に登録（記録）する。例えば、登録部１５０は抽出情報同定部１４０によって同定された紙文書の種別を文書画像と対応付けて登録情報１０としてデータ記憶領域１０７等に登録する。なお、登録部１５０は登録情報１０を格納部２に記憶することとしてもよい。 The registration unit 150 registers (records) the document image in a predetermined storage unit using the information identified by the extraction information identification unit 140. For example, the registration unit 150 associates the type of the paper document identified by the extraction information identification unit 140 with the document image and registers it as the registration information 10 in the data storage area 107 or the like. The registration unit 150 may store the registration information 10 in the storage unit 2.

なお、上記の例ではＣＰＵ１０２が図２に示す各部として機能することとしているが、これに限定されるものではなくＦＰＧＡに上記の機能の少なくとも一部を持たせることとしてもよい。また、複数のＣＰＵに上記の機能を分散させることとしてもよい。さらに、プログラム用記憶領域１０６は登録部１の内部に備えられることとしてもよいし登録部１の外部に備えられることとしてもよい。また、プログラム用記憶領域１０６は１つもメモリ等の記憶装置により構成されていてもよいし、複数の記憶装置により構成されることとしてもよい。 Although the CPU 102 functions as each unit illustrated in FIG. 2 in the above example, the present invention is not limited to this, and the FPGA may have at least a part of the above functions. Further, the above functions may be distributed to a plurality of CPUs. Furthermore, the program storage area 106 may be provided inside the registration unit 1 or outside the registration unit 1. The program storage area 106 may be configured by at least one storage device such as a memory, or may be configured by a plurality of storage devices.

通信ＩＦ１０４はネットワーク６に繋がっており、登録部１と格納部２および病院内の各種サーバ３〜５との間の通信インタフェースである。 The communication IF 104 is connected to the network 6 and is a communication interface between the registration unit 1, the storage unit 2, and various servers 3 to 5 in the hospital.

ＵＩ表示部１０５は装置の状態や画像情報や登録内容を表示するＬＥＤや液晶パネル等である。 The UI display unit 105 is an LED, a liquid crystal panel, or the like that displays the state of the apparatus, image information, and registration contents.

プログラム用記憶領域１０６およびデータ用記憶領域１０７は具体的にはハードディスクやフラッシュメモリである。但し、特定の記憶媒体に限定されるものではない。登録部１では、データ用記憶領域１０７に登録情報１０が記憶される。なお、登録情報１０は格納部２上に記憶されることとしても構わない。なお、登録部１の登録情報１０を直接病院内のシステム（例えば、電子カルテシステム３）に関連付けて格納してもよい。 Specifically, the program storage area 106 and the data storage area 107 are a hard disk and a flash memory. However, it is not limited to a specific storage medium. In the registration unit 1, the registration information 10 is stored in the data storage area 107. The registration information 10 may be stored in the storage unit 2. The registration information 10 of the registration unit 1 may be directly associated with a system in the hospital (for example, the electronic medical record system 3) and stored.

登録情報を格納部２に置かれる場合を想定し、格納部２について詳細に述べる。格納部２は少なくとも１以上のＨＤＤまたはＳＳＤ等の記憶媒体であり、格納部２にはバインダプール２０が記憶されている。バインダプール２０にはバインダ２０１、２０２が含まれる。各バインダには医療文書が含まれている。すなわち、格納部２は医用文書をバインダという単位で管理する。なお、バインダプール２０は病院内のシステム（例えば、電子カルテシステム３）に関連付けて記憶しても構わない。バインダプール２０の中には、情報が使用しやすいように所定の規則で登録資料がバインダ毎に格納される。バインダのまとめ方として、例えば、患者毎に各種別の資料を保存してもよいし、種別毎に各資料を保存してもよい。例えば、登録部１５０は抽出情報同定部１４０によって同定された紙文書の種別に基づいて文書画像を含む登録情報を種別毎にバインダに記憶させることが可能である。 Assuming that registration information is stored in the storage unit 2, the storage unit 2 will be described in detail. The storage unit 2 is a storage medium such as at least one or more HDD or SSD, and the storage unit 2 stores a binder pool 20. The binder pool 20 contains binders 201 and 202. Each binder contains a medical document. That is, the storage unit 2 manages the medical document in units of binders. The binder pool 20 may be stored in association with a system in the hospital (for example, the electronic medical record system 3). In the binder pool 20, registration data is stored for each binder according to a predetermined rule so that information can be easily used. As a method of putting together a binder, for example, various materials may be stored for each patient, or each material may be stored for each type. For example, based on the type of paper document identified by the extraction information identification unit 140, the registration unit 150 can store registration information including a document image in the binder for each type.

上述の構成で、情報処理システム全体で登録情報を参照する事が可能となる。 With the above-described configuration, it is possible to refer to the registration information in the entire information processing system.

なお、ネットワーク６は、病院あるいは組織内で運用されるイントラネットであってもよいし、インターネットであってもよい。 The network 6 may be an intranet operated in a hospital or an organization, or may be the Internet.

なお、電子カルテシステム／オーダーシステムは、広く普及し良く知られている装置なので、ハードウェア構成例や動作フローの説明を省略する。 Since the electronic medical record system / order system is a widely spread and well-known device, the description of the hardware configuration and the operation flow will be omitted.

次に、本実施形態に係る情報処理システムによる情報処理方法の処理手順の一例について説明する。 Next, an example of the processing procedure of the information processing method by the information processing system according to the present embodiment will be described.

図３は、第１の実施形態に係る情報処理装置による情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 3 is a flowchart illustrating an example of the processing procedure of the information processing method by the information processing apparatus according to the first embodiment.

まず、ステップＳ１１０において、文書画像解析部１１０は、図示しないスキャナにより得られた紙文書が電子化された文書画像を取得する。そして、文書画像解析部１１０は、紙文書の電子化された文書画像のレイアウトを解析し、文字領域や写真領域に分割（領域分割）する。尚、文書画像の領域分割方法として、例えば特開２００２−３１４８０６公報で開示されている公知の方法等を使用することができる。 First, in step S110, the document image analysis unit 110 obtains a document image obtained by digitizing a paper document obtained by a scanner (not shown). Then, the document image analysis unit 110 analyzes the layout of the digitized document image of the paper document, and divides the region into character regions and photograph regions (region division). As a method of dividing a document image, for example, a known method disclosed in Japanese Patent Application Laid-Open No. 2002-314806 can be used.

続いて、ステップＳ１２０において、候補領域設定部１２０は、上記文書画像の解析結果から抽出対象の候補となる領域を設定する。この処理の詳細については後述する。 Subsequently, in step S120, the candidate area setting unit 120 sets an area to be an extraction target candidate from the analysis result of the document image. Details of this process will be described later.

続いて、ステップＳ１３０において、候補領域認識部１３０は、上記候補領域にある文字列を認識し、認識情報を記録する。認識情報として、文字列の認識結果および文字数、また、段落である場合の行数などが挙げられる。尚、認識処理は、公知の文字認識技術を用いることができる。 Subsequently, in step S130, the candidate area recognition unit 130 recognizes a character string in the candidate area and records recognition information. The recognition information includes the recognition result of the character string and the number of characters, and the number of lines in the case of a paragraph. In addition, the recognition process can use a well-known character recognition technique.

続いて、ステップＳ１４０において、情報処理装置の抽出情報同定部１４０は、上記候補領域の認識結果及び、知識情報に基づき抽出対象領域を同定し、抽出対象領域から基本情報を同定する。そして、情報処理装置の登録部１５０は、同定情報により文書画像を登録する。この処理の詳細については後述する。 Subsequently, in step S140, the extraction information identification unit 140 of the information processing apparatus identifies an extraction target region based on the recognition result of the candidate region and the knowledge information, and identifies basic information from the extraction target region. Then, the registration unit 150 of the information processing apparatus registers the document image based on the identification information. Details of this process will be described later.

次に、ステップＳ１２０における候補領域の設定処理について説明する。 Next, the setting process of the candidate area in step S120 will be described.

図４は、第１の実施形態に係る図３のステップＳ１２０における候補領域の設定処理の手順の一例を示すフローチャートである。 FIG. 4 is a flowchart showing an example of a procedure of setting processing of candidate areas in step S120 of FIG. 3 according to the first embodiment.

先ず、ステップＳ１２０１において、文書画像解析部１１０による文書画像解析により取得される領域情報、即ち、各領域の位置を示す座標情報と、各領域が文字領域か写真領域を示す属性情報とを文書画像解析部１１０は候補領域設定部１２０に入力する。 First, in step S 1201, area information acquired by document image analysis by the document image analysis unit 110, that is, coordinate information indicating the position of each area, and attribute information indicating that each area is a character area or a photograph area The analysis unit 110 inputs the candidate area setting unit 120.

続いて、ステップＳ１２０２では、候補領域設定部１２０は、属性情報に基づいて文書画像解析部１１０よって取得された領域が文字領域であるかどうかを判断する。文字領域であれば、ステップＳ１２０３で、候補領域設定部１２０は当該文字領域を候補領域として設定する。 Subsequently, in step S1202, the candidate area setting unit 120 determines whether the area acquired by the document image analysis unit 110 is a character area based on the attribute information. If it is a character area, the candidate area setting unit 120 sets the character area as a candidate area in step S1203.

続いて、ステップＳ１２０４では、候補領域設定部１２０は未処理の領域があるかどうかを判断します。まだ未処理の領域があれば、ステップＳ１２０２に入り、ステップＳ１２０２からステップＳ１２０４までの処理を繰り返して実行するが、未処理の領域がなければ、候補領域設定処理を終了する。 Subsequently, in step S1204, the candidate area setting unit 120 determines whether there is an unprocessed area. If there is still an unprocessed area, the process proceeds to step S1202, and the processing from step S1202 to step S1204 is repeated and executed. However, if there is no unprocessed area, the candidate area setting process is ended.

次に、ステップＳ１４０における抽出対象の同定処理について説明する。 Next, the process of identifying the extraction target in step S140 will be described.

図５は、第１の実施形態に係る図３のステップＳ１４０における抽出対象の同定処理の手順の一例を示すフローチャートである。 FIG. 5 is a flowchart showing an example of the procedure of identification processing of the extraction target in step S140 of FIG. 3 according to the first embodiment.

先ず、ステップＳ１４０１において、候補領域情報を候補領域設定部１２０および候補領域認識部１３０は抽出情報同定部１４０に入力する。候補領域情報には、候補領域設定部１２０により得られた候補領域の座標情報及び候補領域認識部１３０により得られた文字認識情報が含まれる。 First, in step S1401, the candidate area setting unit 120 and the candidate area recognition unit 130 input candidate area information to the extraction information identification unit 140. The candidate area information includes the coordinate information of the candidate area obtained by the candidate area setting unit 120 and the character recognition information obtained by the candidate area recognition unit 130.

続いて、ステップＳ１４０２からステップＳ１４０７において、抽出情報同定部１４０は候補領域の文字認識情報及び知識情報に基づいて抽出対象領域を同定し、抽出対象領域の中身を同定する。この部分について詳細に説明する。 Subsequently, in steps S1402 to S1407, the extraction information identification unit 140 identifies the extraction target region based on the character recognition information and the knowledge information of the candidate region, and identifies the contents of the extraction target region. This part will be described in detail.

先ず、ステップＳ１４０２では、抽出情報同定部１４０は処理対象となる候補領域に語尾辞書（図６における符号６０４参照）にある語尾があるかどうかを判断する。 First, in step S1402, the extraction information identification unit 140 determines whether the candidate area to be processed has a word tail in the word dictionary (see reference numeral 604 in FIG. 6).

語尾が候補領域にある場合、ステップＳ１４０３では、抽出情報同定部１４０は当該候補領域を抽出領域として同定する。すなわち、抽出情報同定部１４０は、複数の領域から第１の文字を含む領域を抽出する第２抽出手段の一例に相当する。また、語尾辞書に含まれる語尾は第１の文字または単語の一例に相当する。より具体的には第１の文字は複数の文字からなる単語の語尾である。また、語尾辞書に含まれる語尾は１文字としているがこれに限定されるものではなく複数の文字であってもよい。 If the ending is in the candidate area, the extraction information identification unit 140 identifies the candidate area as an extraction area in step S1403. That is, the extraction information identification unit 140 corresponds to an example of a second extraction unit that extracts a region including the first character from a plurality of regions. Further, the ending included in the ending dictionary corresponds to an example of the first character or word. More specifically, the first character is a word ending of a plurality of characters. Further, although the word ending included in the word ending dictionary is one character, it is not limited to this and may be a plurality of characters.

そして、ステップＳ１４０４では、抽出情報同定部１４０は当該抽出領域から用語辞書（図６における符号６０５参照）にある用語を抽出する。ここで、用語辞書に含まれる用語は第１の文字とは異なる情報の一例に相当する。すなわち、抽出情報同定部１４０は、第２抽出手段によって抽出された領域か第１の文字とは異なる情報を抽出する第３抽出手段の一例に相当する。 Then, in step S1404, the extraction information identification unit 140 extracts a term in the term dictionary (see reference numeral 605 in FIG. 6) from the extraction area. Here, the term included in the term dictionary corresponds to an example of information different from the first character. That is, the extraction information identification unit 140 corresponds to an example of a third extraction unit that extracts information different from the region extracted by the second extraction unit or the first character.

そして、ステップＳ１４０５では、用語辞書と分類辞書（図６における符号６０６参照）の関係に基づき、抽出された用語により文書の種別を同定し、抽出対象の同定処理を終了させる。すなわち、抽出情報同定部１４０は第３抽出手段により抽出された情報を用いて文書データを分類する分類手段の一例に相当する。 Then, in step S1405, based on the relationship between the term dictionary and the classification dictionary (see reference numeral 606 in FIG. 6), the type of the document is identified by the extracted term, and the extraction target identification process is ended. That is, the extraction information identification unit 140 corresponds to an example of a classification unit that classifies document data using the information extracted by the third extraction unit.

なお、語尾辞書に含まれる語尾が候補領域にない場合、ステップＳ１４０６では、抽出情報同定部１４０は未処理の候補領域があるかどうかを判断する。未処理の候補領域があれば、上記ステップＳ１４０２からステップＳ１４０５までの処理を繰り返して実行する。未処理の候補領域がなければ、抽出情報同定部１４０は候補領域の中に種別に該当する領域がないとし、種別なしと判断する。 If the word tail included in the word-end dictionary is not in the candidate area, in step S1406, the extraction information identification unit 140 determines whether there is an unprocessed candidate area. If there is an unprocessed candidate area, the processing from step S1402 to step S1405 is repeatedly performed. If there is no unprocessed candidate area, the extraction information identification unit 140 determines that there is no area corresponding to the type in the candidate area, and determines that there is no type.

次に、本実施形態における抽出対象の同定処理の一例について辞書の内容を示しながらより詳細に説明する。 Next, an example of the identification process of the extraction target in the present embodiment will be described in more detail while showing the contents of the dictionary.

図６は、第１の実施形態に係るステップＳ１２０における候補領域の設定処理と、図５のステップＳ１４０における抽出対象の同定処理の一例を示す模式図である。 FIG. 6 is a schematic view showing an example of setting processing of candidate areas in step S120 according to the first embodiment, and identification processing of extraction targets in step S140 of FIG.

６０１は、ある文書画像に対する文書画像解析部１１０による解析の結果例である。文書画像は、枠に囲まれる領域毎に分割され、また、領域毎に文字領域か写真領域、或いは、その他の属性が付与される。 Reference numeral 601 denotes an example of the result of analysis of a document image by the document image analysis unit 110. The document image is divided into areas surrounded by a frame, and a text area, a photo area, or other attributes are assigned to each area.

６０２は、文書画像の解析結果から候補領域設定部１２０によって得られた候補領域の設定結果例である。各候補領域は順番に領域番号、そして、座標情報が記録される。 Reference numeral 602 denotes an example of the setting result of the candidate area obtained by the candidate area setting unit 120 from the analysis result of the document image. Each candidate area is recorded with area numbers and coordinate information in order.

６０３は、候補領域から抽出対象の同定処理の結果である。 603 is the result of the identification process of the extraction object from the candidate area.

本実施形態においては抽出対象の同定処理に用いる語尾辞書６０４、用語辞書６０５および分類辞書６０６が不図示のＲＯＭに記憶されている。語尾辞書６０４は、種別に含まれる共通の語尾を記録する。用語辞書６０５は種別に含まれる用語を記録する。例えば、用語辞書６０５は「問診」および「質問」という用語を含む。すなわち、用語辞書６０５は互いに異なる第１の参照用の文字と第２の参照用の文字とを含んでおり、用語辞書６０５を保持する不図示のＲＯＭは保持手段の一例に相当する。分類辞書６０６は種別に関わる分類を記録する。なお、上記の辞書はＲＯＭ以外の記憶手段（プログラム記憶領域１０６、データ記憶領域１０７、格納部２など）に記憶されることとしてもよい。この場合、記憶手段が保持手段の一例に相当する。 In the present embodiment, the inflection dictionary 604, the term dictionary 605, and the classification dictionary 606 used for the identification process of the extraction target are stored in a ROM (not shown). The ending dictionary 604 records common endings included in the type. The term dictionary 605 records terms included in the type. For example, the term dictionary 605 includes the terms "interview" and "question". That is, the term dictionary 605 includes first reference characters and second reference characters which are different from each other, and the ROM (not shown) holding the term dictionary 605 corresponds to an example of the holding means. The classification dictionary 606 records classifications related to types. The above dictionary may be stored in storage means (program storage area 106, data storage area 107, storage unit 2 or the like) other than the ROM. In this case, the storage means corresponds to an example of the holding means.

候補領域の順番で処理する。候補領域認識部１３０により得られた候補領域０１の文字認識情報には６語尾辞書０４にある「書」という語尾が含まれるため、抽出情報同定部１４０は当該候補領域を抽出対象領域として同定する。 Process in the order of candidate areas. Since the character recognition information of the candidate area 01 obtained by the candidate area recognition unit 130 includes the word end of “book” in the 6-end dictionary 04, the extraction information identification unit 140 identifies the candidate area as the extraction target area. .

また、抽出情報同定部１４０は当該抽出対象領域には用語辞書６０５にある「説明」という用語が含まれると判断する。具体的には、抽出情報同定部１４０は用語辞書６０５に含まれる用語と抽出対象領域に含まれる文字とを比較し、比較結果が一致する場合には用語辞書６０５に含まれる用語が抽出対象領域から抽出されたと判断する。本実施例では抽出情報同定部１４０は「問診」という用語を抽出対象領域に含まれる文字と比較し、一致しない場合には用語辞書６０５の次の用語と抽出対象領域に含まれる文字との比較を行う。すなわち、第３抽出手段の一例である抽出情報同定部１４０は、第２抽出手段によって抽出された領域に含まれる文字と第１の参照用の文字とを比較し、比較結果が一致する場合には第１の参照用の文字に一致する文字を情報として抽出し、比較結果が一致しない場合には第２抽出手段によって抽出された領域に含まれる文字と第２の参照用の文字とを比較する。 Further, the extraction information identification unit 140 determines that the term “explanation” in the term dictionary 605 is included in the extraction target area. Specifically, the extraction information identification unit 140 compares the terms included in the term dictionary 605 with the characters included in the extraction target area, and if the comparison results match, the terms included in the term dictionary 605 are extracted target areas Judged that it was extracted from In the present embodiment, the extraction information identification unit 140 compares the term “inquiry” with the characters included in the extraction target area, and if there is no match, the next term in the term dictionary 605 is compared with the characters included in the extraction target area. I do. That is, the extraction information identification unit 140, which is an example of the third extraction unit, compares the character included in the area extracted by the second extraction unit with the first reference character, and the comparison result matches. Extracts the character matching the first reference character as information, and if the comparison result does not match, the character contained in the area extracted by the second extraction unit is compared with the second reference character Do.

抽出情報同定部１４０は用語辞書６０５から、「説明」という用語は「０２」という「種別番号」と対応付けられると判断する。したがって、抽出情報同定部１４０は、分類辞書６０６に「０２」と対応する「説明・同意書」という種別が抽出対象（紙文書）の文書種別であると決定する。そして、登録部１５０は「説明・同意書」という種別を文書画像と対応付けてデータ記憶領域１０７または格納部２に記録する。 The extraction information identification unit 140 determines from the term dictionary 605 that the term “description” is associated with the “type number” “02”. Therefore, the extraction information identification unit 140 determines that the type “description / consent form” corresponding to “02” in the classification dictionary 606 is the document type of the extraction target (paper document). Then, the registration unit 150 associates the type “description / consent form” with the document image and records the type in the data storage area 107 or the storage unit 2.

上述の如く本実施形態は、文書画像における各領域の属性情報に基づき抽出対象の候補領域を設定し、候補領域の文字認識情報及び知識情報に基づき候補領域から抽出対象領域を同定し、紙文書の種別を取得するものである。しかしながら、本発明は上記の実施形態に限定されるものではなく、例えば医用文書（紙文書）から診療科情報や、患者情報（患者ＩＤ等の患者識別情報）などを抽出する場合は、抽出対象に応じて知識情報を置き換えればよい。患者ＩＤは例えば数字である。 As described above, the present embodiment sets candidate areas to be extracted based on attribute information of each area in the document image, identifies extraction target areas from the candidate areas based on character recognition information and knowledge information of the candidate areas, and To obtain the type of However, the present invention is not limited to the above embodiment. For example, in the case of extracting medical department information or patient information (patient identification information such as patient ID) from a medical document (paper document), the extraction target The knowledge information should be replaced according to The patient ID is, for example, a number.

例えば、診療科情報抽出の場合、種別抽出用の語尾辞書を「科」などを含む診療科辞書にすればよい。さらに、用語辞書は「小児」、「皮膚」などの文言を含む辞書に変更すればよい。分類辞書は必須の構成ではないが、使用する場合には分類辞書についても同様に診療科で分類を行うよう種別を「小児科」、「皮膚科」などに変更すればよい。また、本実施形態では、知識を辞書という言葉で記述したが、辞書以外の呼び方をされるものであってもよい。なお、患者情報（患者ＩＤ等）などを抽出する場合には、種別抽出用の語尾辞書を「ＩＤ」、「番号」などを含む辞書にすればよい。この場合、「ＩＤ」等の文字は領域内の末尾ではなく先頭に存在する場合が多いが、本実施形態においては説明を簡単にするために語尾辞書という文言を用いている。なお、患者情報（患者ＩＤ等）などを抽出する場合には分類を行う必要がないため用語辞書等は不要である。なお診療科情報および患者情報（患者ＩＤ等）の抽出方法の詳細については後述の第５の実施形態で述べる。 For example, in the case of medical department information extraction, a term dictionary for type extraction may be a medical department dictionary including "family" and the like. Furthermore, the term dictionary may be changed to a dictionary including words such as "child" and "skin". Although the classification dictionary is not an essential component, the classification dictionary may be changed to "Pediatrics", "Dermatology", etc. so that classification can be performed in the medical department similarly. Further, in the present embodiment, knowledge is described in the word of dictionary, but it may be called other than dictionary. When patient information (patient ID and the like) is extracted, a term extraction dictionary for type extraction may be a dictionary including “ID”, “number”, and the like. In this case, characters such as “ID” are often present at the beginning of the region rather than at the end, but in the present embodiment, the term “endword dictionary” is used to simplify the description. In addition, when extracting patient information (patient ID etc.) etc., since it is not necessary to classify, a term dictionary etc. is unnecessary. The details of the medical department information and the method of extracting patient information (patient ID and the like) will be described in a fifth embodiment described later.

また、本実施形態では、医用文書の種別抽出に、文書画像を管理しやすいために種別を図６に示す分類に分けたが、これに限定されるものではなくより細かく分類することとしてもよいし、より粗く分類することとしてもよい。なお各辞書に含まれる言葉や言葉の数も図６記載の内容に限定されるものではなく任意に変更可能である。 Further, in the present embodiment, the classification is classified into the classification shown in FIG. 6 in order to easily manage the document image in extracting the medical document type, but the invention is not limited to this and may be classified more finely. And may be classified more roughly. The words and the number of words included in each dictionary are not limited to the contents described in FIG. 6 and can be arbitrarily changed.

また、本実施形態では、種別抽出用の語尾辞書、用語辞書、分類辞書を例にしたが、辞書の名称は図６記載の名称以外であってもよいし、辞書の構成を図６とは異なる構成にしてもよい。例えば、図７に示すように、用語辞書に用語及び用語と種別の関連付けのみならず、語尾との関連付けも持つようにしてもよい。この場合、語尾が見つかれば、それと組み合わせ可能な用語が含まれるかどうかのみをチェックし用語を抽出すればよい。例えば、ステップＳ１４０２では、「書」という「１０１」番号の語尾が見つかった場合、ステップＳ１４０４では、当該領域から用語辞書に含まれる用語すべてを抽出する代わりに、「１０１」番号の語尾「書」と組み合わせることが可能な用語のみを抽出する。即ち、「問診」、「説明」等だけを抽出すれば良く（「質問」を抽出しようとする必要はない）、処理の高速化を図ることが可能となる。また、図６の例に示す６０１、６０２、６０３をまとめて辞書として持っていてもよい。すなわち、辞書の形態は上記の例に限定されるものではなく他の形態とすることとしてもよい。 Further, in the present embodiment, an end dictionary for type extraction, a term dictionary, and a classification dictionary have been taken as an example, but the names of the dictionaries may be other than the names described in FIG. It may be configured differently. For example, as shown in FIG. 7, the term dictionary may have not only the association of terms and terms with types, but also the association with endings. In this case, if an end is found, it is only necessary to check whether or not a term that can be combined with it is included and extract the term. For example, in step S1402, when the ending of the "101" number "book" is found, in step S1404, instead of extracting all terms included in the term dictionary from the area, the ending "book" of the "101" number Extract only terms that can be combined with. That is, it is only necessary to extract “inquiry”, “explanation” and the like (there is no need to extract “question”), and it is possible to speed up the process. Also, 601, 602, and 603 shown in the example of FIG. 6 may be collected as a dictionary. That is, the form of the dictionary is not limited to the above example, and may be another form.

また、本実施形態では、辞書を登録部１の内部に持たせることを例にしたが、登録部１の外部に辞書を持たせることとしてもよい。外部で定義して参照するようにしてもよい。また、本実施形態では、種別に該当する情報を見つからない文書画像において種別なしと出力するが、それ以外の出力、例えば、種別不明としてもよい。 Further, in the present embodiment, the dictionary is provided inside the registration unit 1 as an example, but the dictionary may be provided outside the registration unit 1. It may be defined and referenced externally. Further, in the present embodiment, in the document image in which the information corresponding to the type is not found, it is output as no type, but other outputs, for example, the type may be unknown.

以上、述べたように第１の実施形態によれば、紙文書から簡単に情報を自動抽出することができる。上記実施形態においてはバーコード等追加の情報を紙文書に付加する必要がないため、従来に比べて手間をかけずに文書種別等の情報を抽出することが可能となる。また、バーコード等の追加の情報を紙文書に付加する必要がないため未知のフォーマットの文書からも簡単に文書種別等の情報を抽出することが可能となる。すなわち、医用文書に人手を介する情報の付与作業が行われなくても、また、医用文書のフォーマットが予め分からなくても、文書種別等の情報を自動的に抽出できる。 As described above, according to the first embodiment, it is possible to easily and automatically extract information from a paper document. In the above embodiment, since it is not necessary to add additional information such as a barcode to the paper document, it is possible to extract the information such as the document type without taking much time as compared with the prior art. Further, since it is not necessary to add additional information such as a bar code to the paper document, it is possible to easily extract information such as the document type even from a document of unknown format. That is, information such as the document type can be extracted automatically, even if the medical document is not manually attached with information or the format of the medical document is not known in advance.

また、上記実施形態においては語尾辞書を用いて抽出領域を同定しているため、全ての領域に対して用語辞書と照らし合わせる必要がなく文書種別等の情報を高速で抽出することが可能となる。また、「問診票」など種別そのものを示す言葉を文書画像から抽出する場合には、種別を示す言葉の多さから抽出に多くの時間がかかる虞がある。しかし、本実施形態によれば語尾と用語との組み合わせを用いているため「問診票」などの種別を示す用語を抽出する時間を短縮することが可能である。ここで、医療分野においては診療科および文書の種別は病院毎に様々な呼び名があるため、本実施形態を医療分野に用いることで顕著な効果を得ることができる。 Further, in the above embodiment, since the extraction area is identified using the word end dictionary, it is not necessary to check all the areas with the term dictionary, and information such as the document type can be extracted at high speed. . In addition, in the case of extracting from the document image a word that indicates the type itself, such as the “interview sheet,” there is a risk that it takes a lot of time to extract the word that indicates the type. However, according to the present embodiment, since the combination of the ending and the term is used, it is possible to shorten the time for extracting the term indicating the type such as “interview sheet”. Here, in the medical field, the medical department and the type of the document have various names for each hospital, and therefore, it is possible to obtain a remarkable effect by using the present embodiment in the medical field.

なお、上記の例ではステップＳ１４０５において文書画像の種別を同定しているが、このステップは必須のものではなく、ステップＳ１４０４で処理を終了することとしてもよい。この場合、ステップＳ１４０４で抽出された用語を操作者が参照して分類を行うことができる。 In the above example, the type of the document image is identified in step S1405, but this step is not essential, and the process may be ended in step S1404. In this case, the operator can perform classification with reference to the terms extracted in step S1404.

（第２の実施形態）
次に、本発明の第２の実施形態について説明する。 Second Embodiment
Next, a second embodiment of the present invention will be described.

上述した第１の実施形態では、文書画像の解析結果から文字領域を抽出対象の候補領域として設定した。第２の実施形態では、文書画像の解析処理によって正しい塊の領域抽出ができていない場合に領域に併合するものである。 In the first embodiment described above, the character area is set as a candidate area to be extracted from the analysis result of the document image. In the second embodiment, the region is merged with the region if the region extraction of the correct chunk can not be performed by the analysis processing of the document image.

ここで、第２の実施形態に係る情報処理システムのハードウェア構成および情報処理装置の機能構成は、図１、２と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the second embodiment and the functional configuration of the information processing apparatus are the same as those in FIGS.

次に、本実施形態に係る情報処理方法の処理手順の一例について説明する。 Next, an example of the processing procedure of the information processing method according to the present embodiment will be described.

図８は、第２の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 8 is a flowchart illustrating an example of the processing procedure of the information processing method by the information processing system according to the second embodiment.

まず、ステップＳ’２１０において、文書画像解析部１１０は、図示しないスキャナにより得られた紙文書が電子化された文書画像を取得する。そして、文書画像解析部１１０は、紙文書の電子化された文書画像を解析し、文字領域や写真領域に分割する。本ステップはステップＳ１１０と同様である。 First, in step S ′ 210, the document image analysis unit 110 obtains a document image obtained by digitizing a paper document obtained by a scanner (not shown). Then, the document image analysis unit 110 analyzes the digitized document image of the paper document and divides it into a text region and a photo region. This step is similar to step S110.

続いて、ステップＳ’２２０において、候補領域設定部１２０は、上記文書画像の解析結果から文字領域を抽出対象の候補となる領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S'220, the candidate area setting unit 120 sets an area as a candidate for extraction of a character area from the analysis result of the document image. The specific process is the same as step S120.

続いて、ステップＳ’２３０において、候補領域設定部１２０は、上記候補領域を補正する。この処理についての詳細は後述する。 Subsequently, in step S ′ 230, the candidate area setting unit 120 corrects the candidate area. Details of this process will be described later.

続いて、ステップＳ’２４０において、候補領域認識部１３０上記補正後の候補領域にある文字列を認識し、認識情報を記録する。本ステップはステップＳ１３０と同様である。 Subsequently, in step S ′ 240, the candidate area recognition unit 130 recognizes the character string in the corrected candidate area, and records the recognition information. This step is similar to step S130.

続いて、ステップＳ’２５０において、抽出情報同定部１４０は上記補正後の候補領域の認識結果及び、知識情報に基づき抽出対象領域を同定し、抽出対象中身を同定する。本ステップはステップＳ１４０と同様である。 Subsequently, in step S ′ 250, the extraction information identification unit 140 identifies the extraction target region based on the recognition result of the candidate region after the correction and the knowledge information, and identifies the extraction target content. This step is similar to step S140.

次に、ステップＳ’２３０における候補領域の補正処理について説明する。 Next, the correction process of the candidate area in step S ′ 230 will be described.

図９は第２の実施形態に係る図８のステップＳ’２３０における候補領域の補正処理の手順の一例を示すフローチャートである。 FIG. 9 is a flowchart showing an example of the procedure of the correction process of the candidate area in step S ′ 230 of FIG. 8 according to the second embodiment.

先ず、ステップＳ’２３０１において、ステップＳ’２２０で設定された候補領域を入力する。 First, in step S ′ 2301, the candidate area set in step S ′ 220 is input.

続いて、ステップＳ’２３０２からステップＳ’２３０６では、上記候補領域から併合すべき領域を選択し、併合する。 Subsequently, in steps S ′ 2302 to S ′ 2306, an area to be merged is selected from the above candidate areas and merged.

ステップＳ’２３０２では、候補領域設定部１２０が処理対象となる二つの候補領域間の間隔は所定の閾値Ｔ１以下であるかどうかを判断する。すなわち、候補領域設定部１２０は隣り合う二つの候補領域間の間隔を閾値Ｔ１と比較する。ここで、閾値Ｔ１は第１の閾値の一例に相当する。 In step S ′ 2302, the candidate area setting unit 120 determines whether an interval between two candidate areas to be processed is equal to or less than a predetermined threshold T 1. That is, the candidate area setting unit 120 compares the interval between two adjacent candidate areas with the threshold T1. Here, the threshold T1 corresponds to an example of a first threshold.

候補領域間の間隔は所定の閾値Ｔ１以下あれば、ステップＳ’２３０３では、候補領域設定部１２０が更に処理対象となる二つの候補領域にある文字サイズの差は所定の閾値Ｔ２以下であるかどうかを判断する。ここで、閾値Ｔ２は第２の閾値の一例に相当する。 If the interval between candidate areas is less than or equal to a predetermined threshold value T1, in step S ′ 2303, is the difference between the character sizes in the two candidate areas targeted for further processing by the candidate area setting unit 120 less than or equal to a predetermined threshold value T2? Determine if. Here, the threshold T2 corresponds to an example of a second threshold.

候補領域にある文字サイズの差は所定の閾値Ｔ２以下であれば、ステップＳ’２３０４へ進む。ステップＳ’２３０４では、候補領域設定部１２０が更に処理対象となる一の候補領域に複数の文字が含まれている場合には、それらの文字間隔の差が所定の閾値Ｔ３以下であるかどうかを判断する。すなわち、一の候補領域に複数の文字が含まれていない場合にはステップＳ’２３０４は実行されないこととしてもよい。ここで、閾値Ｔ３は第３の閾値の一例に相当する。 If the difference in character size in the candidate area is equal to or less than a predetermined threshold T2, the process proceeds to step S'2304. In step S'2304, when a plurality of characters are included in one candidate area to be processed further by the candidate area setting unit 120, it is determined whether the difference between the character spacings thereof is equal to or less than a predetermined threshold T3. To judge. That is, step S'2304 may not be executed when a plurality of characters are not included in one candidate area. Here, the threshold T3 corresponds to an example of a third threshold.

候補領域にある文字の間隔の差は所定の閾値Ｔ３以下であれば、ステップＳ’２３０５では、当該二つの候補領域は併合すべき領域と判断し、ステップＳ’２３０６では、候補領域設定部１２０が当該二つの候補領域同士を併合し、候補領域の情報を更新する。すなわち、候補領域設定部１２０は、第１抽出手段により抽出された領域に関する情報に基づいて第１抽出手段により抽出された領域を併合する領域併合手段の一例に相当する。また、本実施例では第２抽出手段の一例に相当する抽出情報同定部１４０は、併合された領域から第１の文字または単語を含む領域を抽出することとなる。 If the difference between the character spaces in the candidate area is equal to or less than the predetermined threshold value T3, in step S'2305, the two candidate areas are determined to be areas to be merged, and in step S'2306, the candidate area setting unit 120 Merges the two candidate areas and updates the information of the candidate areas. That is, the candidate area setting unit 120 corresponds to an example of area merging means for merging the areas extracted by the first extraction means based on the information on the areas extracted by the first extraction means. In addition, in the present embodiment, the extraction information identification unit 140 corresponding to an example of the second extraction unit extracts an area including the first character or word from the merged area.

続いて、ステップＳ’２３０７では、候補領域設定部１２０が未比較の領域があるかどうかを判断します。まだ未比較の領域があれば、ステップＳ’２３０２に入り、ステップＳ’２３０２からステップＳ’２３０６までの処理を繰り返して実行するが、未比較の領域がなければ、候補領域の補正処理を終了する。 Subsequently, in step S'2307, the candidate area setting unit 120 determines whether there is an uncompared area. If there is still an uncompared region, step S '2302 is entered, and the processing from step S' 2302 to step S '2306 is repeated and executed. If there is no uncompared region, correction processing of the candidate region is ended. Do.

次に、ステップＳ’２３０における候補領域の補正処理の一例について説明する。 Next, an example of the correction process of the candidate area in step S ′ 230 will be described.

図１０は、本発明の第２の実施形態を示し、図８のステップＳ’２３０における候補領域の補正処理の一例を示す模式図である。 FIG. 10 shows a second embodiment of the present invention, and is a schematic view showing an example of correction processing of a candidate area in step S ′ 230 of FIG.

１０００１は、ある文書画像における候補領域の設定結果例である。「同」「意」「書」は離れているため、それぞれ独立な領域として抽出されている。 Reference numeral 10001 denotes an example of setting results of candidate areas in a document image. Since "same", "mean" and "book" are separated, they are extracted as independent areas.

１０００２は、種別抽出の場合、１０００１から候補領域の補正処理の結果例である。１０００１の候補領域から領域の間隔が一定範囲Ｔ１以内、しかも、其々の領域にある文字サイズの差が一定範囲Ｔ２以内、其々の領域に複数の文字がある場合の文字列の間隔の差が一定範囲Ｔ３以内の候補領域を分断された領域として併合する。 In the case of type extraction, reference numeral 10002 is an example of the result of correction processing of candidate areas from 10001. The space between character areas from within the candidate area of 10001 is within a certain range T1, and the difference in character size in each area is within a certain range T2, and the difference in space between character strings when there are multiple characters in each area Merge the candidate areas within a certain range T3 as divided areas.

本実施形態では、抽出対象の特性に基づき候補領域を補正し、意味のある領域にするものである。本実施形態では、候補領域の併合条件として候補領域間の間隔、候補領域にある文字サイズの差、候補領域にある文字列の間隔の差を用いたが、それ以外の条件を設定してもよい。また、候補領域が過統合場合の分割処理を例にしてもよい。なお、上記の実施例では候補領域の併合条件として候補領域間の間隔（すなわち候補領域の位置）、候補領域にある文字サイズの差、候補領域にある文字列の間隔の差の全てを用いたが、少なくとも１つを用いることとしてもよい。すなわち、領域を併合するために用いられる領域に関する情報は、第１抽出手段により抽出された領域の位置、第１抽出手段により抽出された領域に含まれる文字の少なくとも１つを示す情報である。 In this embodiment, the candidate area is corrected based on the characteristics of the extraction target to make it a meaningful area. In the present embodiment, the interval between candidate areas, the difference in character size in candidate areas, and the difference in interval of character strings in candidate areas are used as merging conditions of candidate areas, but other conditions may be set. Good. Also, division processing in the case of over-integration of candidate areas may be taken as an example. In the above embodiment, the interval between candidate areas (that is, the position of the candidate area), the difference in character size in the candidate area, and the difference in the interval between character strings in the candidate area are all used as merging conditions for candidate areas. However, at least one may be used. That is, the information on the area used to merge the areas is information indicating at least one of the position of the area extracted by the first extraction means and the characters included in the area extracted by the first extraction means.

第２の実施形態によれば、意味のある領域の抽出ができ、情報抽出処理の精度を向上することが可能になる。 According to the second embodiment, a meaningful area can be extracted, and the accuracy of the information extraction process can be improved.

（第３の実施形態）
次に、本発明の第３の実施形態について説明する。 Third Embodiment
Next, a third embodiment of the present invention will be described.

上述した第２の実施形態では、文書画像の解析結果により意味のある領域に補正する領域にするものであった。第３の実施形態では、抽出対象の特性に基づき、候補領域を絞るものである。 In the second embodiment described above, the area is corrected to a meaningful area according to the analysis result of the document image. In the third embodiment, candidate regions are narrowed based on the characteristics of the extraction target.

ここで、第３の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第３の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the third embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the third embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

次に、本実施形態に係る情報処理方法の処理手順について説明する。 Next, the processing procedure of the information processing method according to the present embodiment will be described.

図１１は、本発明の第３の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 11 is a flowchart showing an example of the processing procedure of the information processing method by the information processing system according to the third embodiment of the present invention.

まず、ステップＳ２１０において、文書画像解析部１１０は紙文書の電子化された文書画像を解析し、文字領域や写真領域に分割する。具体的な処理はステップＳ１１０と同様である。 First, in step S210, the document image analysis unit 110 analyzes the digitized document image of the paper document, and divides it into a text region and a photograph region. The specific process is the same as step S110.

続いて、ステップＳ２２０において、候補領域設定部１２０は上記文書画像の解析結果から文字領域を抽出対象の候補となる領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S220, the candidate area setting unit 120 sets an area as a candidate for extraction of a character area from the analysis result of the document image. The specific process is the same as step S120.

続いて、ステップＳ２３０において、候補領域認識部１３０は上記候補領域にある文字列を認識し、認識情報を記録する。具体的な処理はステップＳ１３０と同様である。 Subsequently, in step S230, the candidate area recognition unit 130 recognizes a character string in the candidate area and records recognition information. The specific process is the same as step S130.

続いて、ステップＳ２４０において、抽出情報同定部１４０は抽出対象の特性に基づき、上記候補領域を絞る。この処理の詳細については後述する。 Subsequently, in step S240, the extraction information identification unit 140 narrows down the candidate region based on the characteristics of the extraction target. Details of this process will be described later.

続いて、ステップＳ２５０において、抽出情報同定部１４０は上記候補領域の認識結果及び、知識情報に基づき抽出対象領域を同定し、抽出対象中身を同定する。具体的な処理はステップＳ１４０と同様である。 Subsequently, in step S250, the extraction information identification unit 140 identifies the extraction target region based on the recognition result of the candidate region and the knowledge information, and identifies the extraction target content. The specific process is the same as step S140.

次に、ステップＳ２４０における候補領域の絞込み処理について説明する。候補領域の絞込み処理は、以下、候補領域のフィルタリング処理とも呼ぶ。 Next, the narrowing-down process of the candidate area in step S240 will be described. The narrowing-down process of the candidate area is hereinafter also referred to as filtering process of the candidate area.

図１２は、本発明の第３の実施形態を示し、図１１のステップＳ２４０における候補領域の絞込み処理の手順の一例を示すフローチャートである。 FIG. 12 is a flowchart showing the third embodiment of the present invention, and showing an example of the procedure of narrowing-down processing of candidate areas in step S240 of FIG.

先ず、ステップＳ２４０１において、候補領域設定部１２０はステップＳ２２０で設定された候補領域を抽出情報同定部１４０に入力する。 First, in step S2401, the candidate area setting unit 120 inputs the candidate area set in step S220 to the extraction information identification unit 140.

続いて、ステップＳ２４０２からステップＳ２４０４では、抽出情報同定部１４０は上記候補領域を絞る。種別抽出の場合、種別領域は文書画像の上から一定範囲以内にある可能性が高いこと及び種別領域は複数段落の文書内に存在する可能性は低いという特性を利用して候補領域の絞込み条件として設定する。ここで、複数段落は２以上の段落でもよいし３以上の段落であってもよい。また、一定範囲内とは例えば文書画像全体の上部１／３の範囲内である。なお、一定範囲は文書画像全体の上部１／２の範囲内であってもよいし他の範囲あってもよい。また、診療科抽出または患者情報抽出の場合には絞り込みの範囲を種別抽出の場合と異なる範囲にしてもよい。すなわち、抽出対象に応じて候補領域の絞りこみ条件を変更することとしてもよい。なお、候補領域を絞るためには上記の２つの条件を使用することとしてもよいし、どちらか一方の条件を使用することとしてもよい。また、上記２つの条件に文書画像の横方向における位置等の他の条件を加えることとしてもよい。 Subsequently, in steps S2402 to S2404, the extraction information identification unit 140 narrows down the candidate area. In the case of type extraction, the narrowing-down conditions of candidate areas are utilized by utilizing the characteristics that the type area is likely to be within a certain range from the top of the document image and the type area is unlikely to exist in the document of multiple paragraphs. Set as. Here, the plurality of paragraphs may be two or more paragraphs or three or more paragraphs. Further, "within a certain range" is, for example, within the range of the upper 1/3 of the entire document image. The fixed range may be within the range of the upper half of the entire document image or may be other range. Further, in the case of medical department extraction or patient information extraction, the range of narrowing may be different from that in the case of type extraction. That is, the narrowing-down condition of the candidate area may be changed according to the extraction target. Note that in order to narrow down the candidate area, the above two conditions may be used, or either one may be used. In addition, other conditions such as the position in the horizontal direction of the document image may be added to the above two conditions.

ステップＳ２４０２では、抽出情報同定部１４０は処理対象となる候補領域は所定の範囲以内にあるかどうかを判断する。所定の範囲以内にあれば、ステップ２４０３では、抽出情報同定部１４０は更に候補領域の行数は所定の閾値Ｔ以下であるかどうかを判断する。所定の閾値Ｔ以下であれば、ステップ２４０４では、当該候補領域を候補領域として残す。ここで、閾値Ｔは第４の閾値の一例に相当する。 In step S2402, the extraction information identification unit 140 determines whether the candidate area to be processed is within a predetermined range. If it is within the predetermined range, in step 2403, the extraction information identification unit 140 further determines whether the number of rows of the candidate area is equal to or less than a predetermined threshold T. If it is equal to or less than the predetermined threshold T, in step 2404, the candidate area is left as a candidate area. Here, the threshold T corresponds to an example of a fourth threshold.

ステップ２４０５では、所定の範囲以外にある候補領域あるいは候補領域内の文字の行数が所定の閾値Ｔ以上の候補領域を当該領域を候補領域から外す。これは文書画像の種別を示す情報は通常複数行の文書中に存在する可能性が低いことを利用したものである。上述のように、抽出情報同定部１４０は、第２抽出手段の処理対象とする領域を選択する領域選択手段の一例に相当する。 In step 2405, candidate areas outside the predetermined range or candidate areas with the number of lines of characters in the candidate area above the predetermined threshold T are excluded from the candidate areas. This is based on the fact that information indicating the type of document image is normally unlikely to exist in a document of a plurality of lines. As described above, the extraction information identification unit 140 corresponds to an example of a region selection unit that selects a region to be processed by the second extraction unit.

続いて、ステップＳ２４０６では、抽出情報同定部１４０は未処理の領域があるかどうかを判断します。まだ未処理の領域があれば、ステップＳ２４０２に入り、ステップＳ２４０２からステップＳ２４０５までの処理を繰り返して実行するが、未処理の領域がなければ、候補領域のフィルタリング処理を終了する。 Subsequently, in step S2406, the extraction information identification unit 140 determines whether there is an unprocessed area. If there is still an unprocessed area, the process proceeds to step S2402, and the processing from step S2402 to step S2405 is repeatedly executed, but if there is no unprocessed area, the filtering process of the candidate area is ended.

次に、ステップＳ２４０における候補領域の絞込み処理の一例について説明する。 Next, an example of the narrowing-down process of the candidate area in step S240 will be described.

図１３は、本発明の第３の実施形態を示し、図１１のステップＳ２４０における候補領域の絞込み処理の一例を示す模式図である。 FIG. 13 is a schematic view showing an example of the narrowing-down process of the candidate area in step S240 of FIG. 11 according to the third embodiment of the present invention.

１００１は、ある文書画像における候補領域の設定結果例である。枠に囲まれる領域は、候補領域として設定されるものである。 Reference numeral 1001 denotes an example of setting results of candidate areas in a certain document image. The area enclosed by the frame is set as a candidate area.

１００２は、種別抽出の場合、１００１から候補領域のフィルタリングの結果例である。１００１の候補領域から位置が一定範囲以内にある、しかも、複数行ではない枠に囲まれる領域のみが残る。これらの候補領域は同定処理の対象領域になる。 In the case of type extraction, 1002 is an example of the result of filtering of the candidate area from 1001 on. Only the area surrounded by a frame which has a position within a certain range from the candidate area 1001 and is not a plurality of lines remains. These candidate regions are the target regions of the identification process.

本実施形態では、抽出対象の特性に基づき候補領域を絞り、残った候補領域から抽出対象を同定するものである。本実施形態では、種別抽出を例に、種別情報の特性に基づき候補領域のフィルタリングの条件を設定したが、それ以外の条件を設定してもよい。また、他の情報を抽出する場合、当該抽出情報の特性に応じてフィルタリングの条件を設定してもよい。本実施形態では候補領域の絞り込みのために、候補領域の位置（ステップＳ２４０２）および候補領域内の文字の行数（ステップＳ２４０３）を用いたが、少なくとも一つの情報を用いることとしてもよい。第３の実施形態によれば、第１の実施形態による効果に加え、情報抽出処理の効率を向上することが可能になる。 In the present embodiment, the candidate area is narrowed based on the characteristics of the extraction target, and the extraction target is identified from the remaining candidate areas. In the present embodiment, the condition of filtering of the candidate area is set based on the characteristics of the type information by taking the type extraction as an example, but other conditions may be set. Moreover, when extracting other information, you may set the conditions of filtering according to the characteristic of the said extraction information. In this embodiment, the position of the candidate area (step S2402) and the number of lines of characters in the candidate area (step S2403) are used to narrow down the candidate area, but at least one piece of information may be used. According to the third embodiment, in addition to the effects of the first embodiment, the efficiency of the information extraction process can be improved.

（第４の実施形態）
次に、本発明の第４の実施形態について説明する。 Fourth Embodiment
Next, a fourth embodiment of the present invention will be described.

上述した第３の実施形態では、文書画像の解析結果から候補領域を設定し、抽出対象の特性に応じて候補領域をフィルタリングし、対象となる候補領域から抽出対象を同定するものであった。第４の実施形態では、対象となる候補領域において、抽出対象らしさの順番を付けて、その抽出対象らしさ順で抽出対象を同定していくものである。 In the third embodiment described above, candidate areas are set from analysis results of a document image, candidate areas are filtered according to the characteristics of extraction targets, and extraction targets are identified from the candidate areas to be targets. In the fourth embodiment, in the candidate area to be an object, the order of the extraction object likeness is added, and the extraction object is identified in the order of the extraction object likeness.

ここで、第４の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第４の実施形態に係る情報処理システムの機能構成は、図１に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the fourth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the fourth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

図１４は、本発明の第４の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 14 is a flowchart showing an example of the processing procedure of the information processing method by the information processing system according to the fourth embodiment of the present invention.

まず、ステップＳ３１０において、文書画像解析部１１０は紙文書の電子化された文書画像を解析し、文字領域や写真領域に分割する。具体的な処理はステップＳ１１０と同様である。 First, in step S310, the document image analysis unit 110 analyzes the digitized document image of the paper document, and divides the document image into a character area and a photograph area. The specific process is the same as step S110.

続いて、ステップＳ３２０において、候補領域設定部１２０は上記文書画像の解析結果から文字領域を抽出対象の候補となる領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S320, the candidate area setting unit 120 sets an area as a candidate for extraction of a character area from the analysis result of the document image. The specific process is the same as step S120.

続いて、ステップＳ３３０において、候補領域認識部１３０は上記候補領域にある文字列を認識し、認識情報を記録する。具体的な処理はステップＳ１３０と同様である。 Subsequently, in step S330, the candidate area recognition unit 130 recognizes a character string in the candidate area and records recognition information. The specific process is the same as step S130.

続いて、ステップＳ３４０において、抽出情報同定部１４０は抽出対象の特性に基づき、上記候補領域を絞る。具体的な処理はステップＳ２４０と同様である。 Subsequently, in step S340, the extraction information identification unit 140 narrows down the candidate region based on the characteristics of the extraction target. The specific process is the same as step S240.

続いて、ステップＳ３５０において、抽出情報同定部１４０は処理対象となる候補領域において、抽出対象らしさを計算し、抽出対象らしさの順番を付ける。すなわち、候補領域に対して処理の優先度を付与する。すなわち、抽出情報同定部１４０は第１抽出手段により抽出された領域に対して優先度を付与する付与手段の一例に相当する。抽出対象らしさの順番を付与する処理の詳細について後述する。 Subsequently, in step S350, the extraction information identification unit 140 calculates the likeness of the extraction target in the candidate region to be processed, and attaches the order of the likeness of the extraction target. That is, priority of processing is given to the candidate area. That is, the extraction information identification unit 140 corresponds to an example of an assigning unit that assigns a priority to the area extracted by the first extraction unit. Details of the process for giving the order of the extraction object likeness will be described later.

続いて、ステップＳ３６０において、抽出情報同定部１４０は上記候補領域の認識結果及び知識情報に基づき、ステップＳ３５０で決められる抽出対象らしさの順で、抽出対象領域を同定し、抽出対象中身を同定する。具体的な処理はステップＳ１４０と同様である。 Subsequently, in step S360, the extraction information identification unit 140 identifies the extraction target area in the order of the extraction target likeliness determined in step S350 based on the recognition result of the candidate area and the knowledge information, and identifies the extraction target contents. . The specific process is the same as step S140.

次に、ステップＳ３５０における候補領域の抽出対象らしさの計算処理方法について説明する。 Next, the calculation processing method of the candidate area extraction target likelihood in step S350 will be described.

文書の種別領域は基本的に文書画像のタイトルらしい領域に該当する。タイトルは基本的に文書の上に位置する、文字サイズが大きい、また、中心線に寄せるといった特徴を持つ。しかし、医用文書のフォーマットが多種多様なため、種別領域は必ずしも上述の特性を持つわけではない。ここで、これらの特性を用いて、以下の式で候補領域の種別らしさを総合的に求めるようにする。
種別らしさ＝ｗ１＊｛文字サイズ｝＋ｗ２＊｛中心線との近さの逆数｝＋ｗ３＊｛上部にある領域数の逆数｝
ここで、Ｗ１、Ｗ２、Ｗ３は各要素の重み付けである。重要視される要素に高い数値の重みを付ける。ここで、「上部」とは例えば文書画像全体の上部１／３の範囲内を示すが、これに限定されるものではない。なお、式１に示した種別らしさを示す値は３つの項のうち少なくとも１つの項目を用いることとしてもよい。また、上部にある領域数を求めるためには候補領域の位置を利用する。すなわち付与手段の一例である抽出情報同定部１４０は、第１抽出手段により抽出された領域の位置および領域に含まれる文字の大きさの少なくとも１つに基づいて優先度を付与する。 The document type area basically corresponds to the area that seems to be the title of the document image. The title is basically located above the document, large in character size, and centered on the center line. However, due to the wide variety of medical document formats, the type area does not necessarily have the above-mentioned characteristics. Here, these characteristics are used to comprehensively obtain the likeness of the candidate area according to the following equation.
Type likeness = w1 * {character size} + w2 * {reciprocal of proximity to center line} + w3 * {reciprocal of the number of areas at the top}
Here, W1, W2, and W3 are weightings of the respective elements. We give high numerical weight to the important elements. Here, “upper part” indicates, for example, the range of the upper one-third of the entire document image, but is not limited thereto. In addition, it is good also as using the value which shows the kind-likeness shown to Formula 1 as at least 1 item among three terms. Also, the position of the candidate area is used to obtain the number of areas at the top. That is, the extraction information identification unit 140, which is an example of the giving unit, gives priority based on at least one of the position of the area extracted by the first extracting unit and the size of the characters included in the area.

なお、式１に示した種別らしさを示す値は３つの項により求められているが、４つ以上の項目を用いて種別らしさを算出することとしてもよい。また、例えば、上記種別らしさを示す値が大きい領域から抽出情報同定部１４０の処理対象とする。 In addition, although the value which shows the classification likeness shown to Formula 1 is calculated | required by three terms, it is good also as calculating classification likeness using four or more items. Also, for example, a region having a large value indicating the type likeness is set as a processing target of the extraction information identification unit 140.

本実施形態では、候補領域の抽出対象らしさ（優先度）を計算し、抽出対象らしさ順で抽出対象を同定するものであった。本実施形態では、種別抽出を例に、抽出対象らしさに関わる要素として文字サイズ、領域の位置、領域の数を用いたが、それ以外の特性を使ってもよい。 In the present embodiment, the extraction target likelihood (priority) of the candidate area is calculated, and the extraction target is identified in the extraction target likelihood order. In the present embodiment, the type extraction is taken as an example, and the character size, the position of the area, and the number of areas are used as elements related to the likeness to be extracted, but other characteristics may be used.

第４の実施形態によれば、第１、第３の実施形態による効果に加え、抽出対象領域の可能性の高い候補領域から処理することが可能になり、更に抽出処理の効率性を向上することができるようになる。 According to the fourth embodiment, in addition to the effects of the first and third embodiments, processing can be performed from candidate areas with high probability of extraction target areas, and the efficiency of extraction processing is further improved. Will be able to

（第５の実施形態）
次に、第５の実施形態について説明する。 Fifth Embodiment
Next, a fifth embodiment will be described.

上述した第１、第２、第３及び第４の実施形態では、医用文書から種別情報を抽出する例を主として説明した。第５の実施形態では、医療文書から診療科情報、或いは、患者情報を抽出するものである。 In the first, second, third and fourth embodiments described above, examples of extracting type information from a medical document have been mainly described. In the fifth embodiment, medical department information or patient information is extracted from a medical document.

ここで、第５の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第５の実施形態に係る情報処理システムの機能構成は、図１に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。さらに、第５の実施形態に係る情報処理方法の処理手順は、図３に示す第１の実施形態に係る情報処理方法のステップＳ１４０を除いて同様であるため、ステップＳ１１０〜１３０の説明を省略する。 Here, the hardware configuration of the information processing system according to the fifth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the fifth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG. Furthermore, since the processing procedure of the information processing method according to the fifth embodiment is the same as step S140 of the information processing method according to the first embodiment shown in FIG. 3, the description of steps S110 to 130 is omitted. Do.

種別抽出処理は基本的に種別領域の同定の後に、種別領域の中身による種別分類が必要なため、語尾情報による種別領域の同定、種別領域にある種別用語の抽出、種別同定の３ステップで処理される。診療科抽出は基本的に診療科名を抽出するためのものなので、診療科領域の同定、診療科領域にある診療科名の抽出の２ステップで処理する。患者情報の抽出は診療科抽出と同様である。 The classification extraction process basically requires classification of classification according to the contents of the classification area after identification of the classification area, so identification of the classification area according to the ending information, extraction of classification terms in the classification area, and classification identification are performed in three steps. Be done. Since the medical department extraction is basically for extracting the medical department name, it is processed in two steps of identification of the medical department area and extraction of the medical department name in the medical department area. Extraction of patient information is similar to that of medical department extraction.

ここで、本実施形態のステップＳ１４０における診療科抽出の同定処理について説明する。 Here, the identification processing of the medical department extraction in step S140 of this embodiment will be described.

図１５は、本発明の第５の実施形態を示し、図３のステップＳ１４０における診療科の抽出処理の手順の一例を示すフローチャートである。 FIG. 15 is a flowchart showing the fifth embodiment of the present invention, and showing an example of the procedure of the medical department extraction process in step S140 of FIG.

先ず、ステップＳ４４０１では、候補領域設定部１２０は抽出情報同定部１４０に候補領域情報を入力する。 First, in step S4401, the candidate area setting unit 120 inputs candidate area information to the extraction information identification unit 140.

続いて、ステップＳ４４０２では、抽出情報同定部１４０は処理対象となる候補領域に診療科語尾辞書にある語尾があるかどうかを判断する。 Subsequently, in step S4402, the extraction information identification unit 140 determines whether the candidate area to be processed has a word end in the medical department word end dictionary.

語尾がある場合、ステップＳ４４０３では、抽出情報同定部１４０は当該候補領域を診療科領域として同定する。そして、ステップＳ４４０４では、抽出情報同定部１４０は当該領域に診療科用語辞書にある用語を診療科名として抽出する。 If there is an end, in step S4403, the extraction information identification unit 140 identifies the candidate area as a medical department area. Then, in step S4404, the extraction information identification unit 140 extracts the term in the medical treatment department term dictionary in the relevant area as a medical department name.

語尾がない場合、ステップＳ４４０５では、未処理の候補領域があるかどうかを判断する。未処理の候補領域があれば、上記ステップＳ４４０２からステップＳ４４０４までの処理を繰り返して実行する。未処理の候補領域がなければ、候補領域のなかから診療科に該当する領域がないとし、診療科情報がないと判断する。 If there is no ending, it is determined in step S4405 whether there is an unprocessed candidate area. If there is an unprocessed candidate area, the processing from step S4402 to step S4404 is repeatedly executed. If there is no unprocessed candidate area, it is determined that there is no area corresponding to the medical department from the candidate areas, and it is determined that there is no medical department information.

本実施形態では、種別抽出の他、文書画像から診療科情報、或いは、患者情報を抽出するものであった。抽出対象に応じて、知識情報を置き換えればよい。 In the present embodiment, in addition to type extraction, medical department information or patient information is extracted from a document image. The knowledge information may be replaced according to the extraction target.

第５の実施形態によれば、第１、第２、第４の実施形態による効果に加え、種別情報以外の情報抽出も可能になる。 According to the fifth embodiment, in addition to the effects of the first, second, and fourth embodiments, extraction of information other than type information is also possible.

（第６の実施形態）
次に、第６の実施形態について説明する。 Sixth Embodiment
Next, a sixth embodiment will be described.

上述した第１、第２、第３、第４及び第５の実施形態では、種別、診療科、患者情報のうち１種類の情報のみを抽出する例を主として説明した。第６の実施形態では、文書画像から複数の情報を抽出する場合を説明する。 In the first, second, third, fourth and fifth embodiments described above, an example in which only one type of information is extracted among the type, medical department, and patient information has been mainly described. In the sixth embodiment, the case of extracting a plurality of pieces of information from a document image will be described.

ここで、第６の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第６の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the sixth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the sixth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

次に、本実施形態に係る情報処理システムによる情報処理方法の処理手順について説明する。 Next, the processing procedure of the information processing method by the information processing system according to the present embodiment will be described.

図１６は、本発明の第６の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 16 is a flowchart showing an example of the processing procedure of the information processing method by the information processing system according to the sixth embodiment of the present invention.

まず、ステップＳ５１０では、文書画像解析部１１０は紙文書の電子化された文書画像を分割する。具体的な処理はステップＳ１１０と同様である。 First, in step S510, the document image analysis unit 110 divides the digitized document image of the paper document. The specific process is the same as step S110.

続いて、ステップＳ５２０では、候補領域設定部１２０は上記領域分割の結果から抽出対象の候補領域を設定する。具体的な処理はステップＳ１２０と同様である。 Subsequently, in step S520, the candidate area setting unit 120 sets a candidate area to be extracted from the result of the area division. The specific process is the same as step S120.

続いて、ステップＳ５３０では、候補領域認識部１３０は上記候補領域にある文字列を認識し、認識情報を記録する。具体的な処理はステップＳ１３０と同様である。 Subsequently, in step S530, the candidate area recognition unit 130 recognizes the character string in the candidate area and records the recognition information. The specific process is the same as step S130.

続いて、ステップＳ５４０では、抽出情報同定部１４０は、図１７に示す情報を参照することで抽出対象が構造上の特性があるかどうかを判断する。 Subsequently, in step S540, the extraction information identification unit 140 determines whether the extraction target has a structural characteristic by referring to the information shown in FIG.

特性があると判断される場合、ステップＳ５５０では、抽出情報同定部１４０は抽出対象の特性に基づき候補領域を絞る。例えば、構造上の特性を有する種別情報を抽出する場合には抽出情報同定部１４０は候補領域を文書画像の上部に存在する候補領域に絞り込む。具体的な処理はステップＳ３４０と同様である。ここで、「上部」とは例えば文書画像全体の上部１／３の範囲内を示すが、これに限定されるものではない。 If it is determined that there is a characteristic, in step S550, the extraction information identification unit 140 narrows the candidate area based on the characteristic of the extraction target. For example, when extracting type information having structural characteristics, the extraction information identification unit 140 narrows down the candidate areas to candidate areas present at the top of the document image. The specific process is the same as step S340. Here, “upper part” indicates, for example, the range of the upper one-third of the entire document image, but is not limited thereto.

続いて、ステップＳ５６０では、抽出情報同定部１４０は図１８に示す情報に基づいて抽出対象に応じて知識情報を切り替える。 Subsequently, in step S560, the extraction information identification unit 140 switches the knowledge information according to the extraction target based on the information shown in FIG.

続いて、ステップＳ５７０では、抽出情報同定部１４０は上記候補領域の認識結果及び知識情報に基づき抽出対象を同定する。具体的な処理はステップＳ１４０と同様である。なお、操作者が抽出対象を示す情報を登録部１に入力することで登録部１が抽出対象を把握できるようにしてもよいし、登録部１が所定の順序で抽出対象を自動的に切換えることで登録部１が抽出対象を把握することとしてもよい。 Subsequently, in step S570, the extraction information identification unit 140 identifies an extraction target based on the recognition result of the candidate area and the knowledge information. The specific process is the same as step S140. The operator may input information indicating the extraction target to the registration unit 1 so that the registration unit 1 can grasp the extraction target, or the registration unit 1 automatically switches the extraction target in a predetermined order. Thus, the registration unit 1 may grasp the extraction target.

次に、抽出対象の構造上の特性有無、抽出対象の知識管理の一例について説明する。 Next, an example of presence / absence of structural characteristics of the extraction target and knowledge management of the extraction target will be described.

図１７は、本発明の第６の実施形態を示し、図１６に関わる抽出対象の構造上の特性有無、抽出対象の知識管理の一例を示す模式図である。 FIG. 17 shows a sixth embodiment of the present invention, and is a schematic view showing an example of presence or absence of structural characteristics of an extraction target and knowledge management of the extraction target concerning FIG.

１４０１は抽出対象の構造上の特性有無の管理表で、抽出対象は構造上の特性があるかどうかを記録するものである。種別情報は基本的に文書画像の上部にあるので、構造上の特性があるものとする。診療科情報と患者情報は文書画像のどこにも記述される可能性があるので、構造上の特性がないものとする。 Reference numeral 1401 denotes a management table for the presence or absence of structural characteristics of the extraction target. The extraction target records whether or not there is a structural characteristic. Since the type information is basically at the top of the document image, it is assumed that there is a structural characteristic. Since medical department information and patient information may be described anywhere in the document image, there is no structural feature.

１４０２は抽出対象の知識管理表で、抽出対象の抽出に必要な知識を管理するものである。種別抽出に種別抽出用の語尾辞書１、用語辞書１、更に分類に必要となる分類辞書１を用いる。診療科抽出に診療科抽出用の語尾辞書２、用語辞書２を用いる。患者情報抽出に患者情報抽出用の語尾辞書３を用いる。 Reference numeral 1402 denotes a knowledge management table to be extracted, which manages knowledge necessary for extracting the extraction target. For classification extraction, an end dictionary 1 for classification extraction, a term dictionary 1, and a classification dictionary 1 necessary for classification are used. The term dictionary 2 and term dictionary 2 for medical department extraction are used for medical department extraction. The ending dictionary 3 for patient information extraction is used for patient information extraction.

本実施形態では、複数の情報を抽出する場合、抽出対象の情報に応じて構造情報による候補領域の設定処理、抽出対象の同定処理に用いる知識情報を切り替えて行うものである。また、本実施形態では、抽出対象は構造上に特性がある場合、抽出対象の構造上の特性に基づき候補領域の絞込み処理を行うが、更に抽出対象の構造上の特性に基づき抽出対象らしさを計算し順位付け処理を行ってもよい。また、本実施形態では、複数の抽出情報の知識を別々に管理するものであったが、知識をまとめて管理してもよい。 In the present embodiment, when extracting a plurality of pieces of information, the knowledge information used for the setting process of the candidate area by the structure information and the identification process of the extraction target is switched according to the extraction target information. Further, in the present embodiment, when the extraction target has a characteristic on the structure, the narrowing-down processing of the candidate area is performed based on the structural characteristic of the extraction target, but the extraction target likeness is further It may be calculated and ranked. Further, in the present embodiment, knowledge of a plurality of pieces of extraction information is separately managed, but knowledge may be collectively managed.

第６の実施形態によれば、第１、第２、第３、第５の実施形態による効果に加え、複数の情報を抽出する場合、情報の特性を考慮する情報抽出の効率化が実現可能になる。 According to the sixth embodiment, in addition to the effects of the first, second, third and fifth embodiments, in the case of extracting a plurality of pieces of information, it is possible to realize the efficiency of the information extraction considering the characteristics of the information become.

なお、上述した第１、第２、第３、第４、第５及び第６の実施形態では、文書画像の解析結果から文字領域を抽出対象の候補領域として設定するであった。しかし、文字領域のみならず、所定範囲以内でその他の属性領域を抽出対象の候補領域として広く設定してもよい。また、上述した第１、第２、第３、第４及び第６の実施形態では、候補領域の文字認識及び知識に基づき抽出対象領域を同定し、抽出情報を同定するものであったが、候補領域の文字認識の結果を補正し、補正情報及び知識に基づき抽出対象を同定してもよい。 In the first, second, third, fourth, fifth and sixth embodiments described above, the character area is set as a candidate area to be extracted from the analysis result of the document image. However, not only the character area but also other attribute areas may be widely set as candidate areas to be extracted within a predetermined range. In the first, second, third, fourth and sixth embodiments described above, the extraction target area is identified based on the character recognition and knowledge of the candidate area, and the extraction information is identified. The character recognition result of the candidate area may be corrected, and the extraction target may be identified based on the correction information and the knowledge.

（第７の実施形態）
次に、第７の実施形態について説明する。 Seventh Embodiment
Next, a seventh embodiment will be described.

上述した第１、第２、第３、第４、第５及び第６の実施形態では、文書画像の解析により抽出対象となる情報を抽出するものであった。第７の実施形態では、院内システム（例えば、電子カルテシステム）に格納される診療情報及び文書画像の両方を解析し情報を抽出するものである。 In the first, second, third, fourth, fifth and sixth embodiments described above, the information to be extracted is extracted by analyzing the document image. In the seventh embodiment, both medical information and document images stored in a hospital system (for example, an electronic medical record system) are analyzed to extract information.

ここで、第７の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第７の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。 Here, the hardware configuration of the information processing system according to the seventh embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the seventh embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG.

図１８は、本発明の第７の実施形態に係る情報処理システムによる情報処理方法の処理手順の一例を示すフローチャートである。 FIG. 18 is a flowchart showing an example of the processing procedure of the information processing method by the information processing system according to the seventh embodiment of the present invention.

まず、ステップＳ６１０では、抽出情報同定部１４０は文書画像から患者番号を抽出する。患者番号の抽出処理は上記第５の実施形態を使用することができる。 First, in step S610, the extraction information identification unit 140 extracts a patient number from the document image. The patient number extraction process can use the fifth embodiment.

続いて、ステップＳ６２０では、抽出情報同定部１４０は電子カルテシステムから当該患者の関連情報を取り出す。関連情報は種別分類に関わるものとする。関連情報の詳細については後述する。 Subsequently, in step S620, the extraction information identification unit 140 extracts the relevant information of the patient from the electronic medical record system. Relevant information shall be relevant to classification. Details of the related information will be described later.

続いて、ステップＳ６３０では、抽出情報同定部１４０は種別分類の関連情報があるかどうかを確認する。関連情報があれば、ステップＳ６４０では、関連情報を用いて種別分類を絞る。関連情報がなければ、ステップＳ６５０に入る。 Subsequently, in step S630, the extraction information identification unit 140 confirms whether there is related information of type classification. If there is related information, in step S640, the type classification is narrowed down using the related information. If there is no relevant information, step S650 is entered.

続いて、ステップＳ６５０では、種別分類から種別を同定する。種別の抽出処理は上記第１、第２、第４の実施形態の何れかを使用することができる。 Subsequently, in step S650, the type is identified from the type classification. The type extraction process can use any one of the first, second and fourth embodiments.

次に、本実施形態に係る情報処理システムによる情報処理の一例について説明する。 Next, an example of information processing by the information processing system according to the present embodiment will be described.

図１９は、本発明の第７の実施形態を示し、図１８の情報処理の一例を示す模式図である。 FIG. 19 is a schematic view showing the seventh embodiment of the present invention and showing an example of the information processing of FIG.

１６０１は、電子カルテシステムにおける診療情報の構造情報の記述例である。基本情報に患者情報、診察日、初診か再診を含む。また、診療情報としてＳ（主訴）Ｏ（所見）Ａ（検査）Ｐ（計画）が含まれる。 Reference numeral 1601 is an example of description of structure information of medical care information in the electronic medical record system. Basic information includes patient information, examination date, and first visit or reexamination. Moreover, S (main complaint) O (finding) A (examination) P (plan) is included as medical treatment information.

１６０２は、電子カルテの診療情報に含まれる種別分類に関わる関連情報例である。基本情報の中に、例えば、初診、或いは、再診といった用語が挙げられる。また、診療情報の中に、例えば、手術予定、或いは、入院治療といった用語が挙げられる。 Reference numeral 1602 denotes an example of related information related to type classification included in medical care information of an electronic medical record. The basic information includes, for example, terms such as a first visit or a revisit. Further, among the medical care information, for example, terms such as scheduled surgery or hospitalization treatment may be mentioned.

１６０３は、本来種別抽出処理に用いる分類辞書である。 Reference numeral 1603 denotes a classification dictionary originally used for the type extraction process.

基本情報から種別分類に関わる用語を抽出し、種別分類候補を絞込む処理例では、先ず、１６０１から「初診」という関連情報が抽出される。「初診」の場合、文書画像が同意書や記録・報告などの種別の可能性がないので、それを種別候補から除外する。そして、「初診」と関連付け可能な種別番号「０１」、「１０」から種別を判定し、分類する。 In the processing example of extracting terms relating to type classification from basic information and narrowing down type classification candidates, first, related information of “first visit” is extracted from 1601. In the case of "first visit", the document image has no possibility of the type such as a written consent or a record / report, so it is excluded from the type candidates. Then, the type is determined from the type numbers “01” and “10” that can be associated with the “first visit” and classified.

また、診療情報から種別分類に関わる用語を抽出する場合は、上記と同様に、抽出される関連用語に対応する範囲の種別分類から文書画像の種別を同定する。 Further, when extracting a term relating to type classification from medical care information, the type of document image is identified from the type classification of the range corresponding to the extracted related term in the same manner as described above.

本実施形態では、電子カルテシステムから抽出情報と関連する内容を取り出し、抽出情報候補を絞るものである。本実施形態では、電子カルテシステムの利用を例にしたが、それ以外の関連システムと連携してもよい。また、本実施形態では、種別抽出に関連する情報を例に挙げたが、それ以外の関連情報を設定してもよい。また、本実施形態では、種別抽出を例に説明したが、診療科抽出、或いは、それ以外の情報抽出にしてもよい。さらに、本実施形態では、関連情報により種別分類候補を絞り、可能性のある種別分類から種別を同定するものであった。しかし、第１、第２、第３、第４、第５の実施例のように、種別分類を先に同定に、関連情報から絞った種別分類で抽出結果の確認を行う処理方法にしてもよい。 In the present embodiment, contents related to the extracted information are taken out from the electronic medical record system, and the extracted information candidates are narrowed down. In the present embodiment, the use of the electronic medical record system is taken as an example, but it may be linked with other related systems. Further, in the present embodiment, the information related to type extraction has been described as an example, but other related information may be set. Further, in the present embodiment, the type extraction has been described as an example, but medical department extraction or other information extraction may be performed. Furthermore, in the present embodiment, classification classification candidates are narrowed down based on the related information, and the classification is identified from the possible classifications. However, as in the first, second, third, fourth, and fifth embodiments, the classification method may be identified first, and the extraction result may be confirmed by the classification classified from related information. Good.

第７の実施形態によれば、第１、第２、第３、第４、第６の実施形態による効果に加え、関連システムと連携した情報抽出仕組みの実現が可能になる。 According to the seventh embodiment, in addition to the effects of the first, second, third, fourth and sixth embodiments, it is possible to realize an information extraction mechanism in cooperation with a related system.

（第８の実施形態）
次に、第８の実施形態について説明する。 Eighth Embodiment
Next, an eighth embodiment will be described.

上述した第１、第２、第３、第４、第５、第６及び第７の実施形態では、医用向け非定型文書を対象に種別情報等を自動的に情報を抽出するものであった。第８の実施形態では、一般分野の非定型文書における情報抽出に関するものである。 In the first, second, third, fourth, fifth, sixth and seventh embodiments described above, type information etc. is automatically extracted for medical non-fixed form documents. . The eighth embodiment relates to information extraction in an atypical document in the general field.

例えば、銀行の場合は、口座開設をはじめ、融資取組や、住宅ローンなどの業務に関連するドキュメントとデータのキャプチャは、基本的は手作業で行うのが現状である。例えば、米ドル建ての外国送金の場合では、米国ＯＦＡＣ規制により、取引の関係当事者の所在地に禁止取引国、また、問題のある法人・個人等が含まれているかどうかを確認する作業は非常に手間がかかるため、業務の効率化のサポートが必要である。 For example, in the case of a bank, capturing of documents and data related to operations such as opening an account, lending, and mortgage is basically performed manually. For example, in the case of US dollar denominated foreign remittances, it is very time-consuming to check whether the related parties of the transaction include prohibited trading countries or problematic corporations / individuals, etc. according to the US OFAC regulations. Support for business efficiency.

ここで、業務効率の向上に、様々なフォーマットを有するドキュメントから必要な情報を自動的に抽出し、ドキュメントを分類する第８の実施形態として挙げる。第８の実施形態に係る情報処理システムのハードウェア構成は、図１に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明を省略する。また、第８の実施形態に係る情報処理システムの機能構成は、図２に示す第１の実施形態に係る情報処理システムの機能構成と同様であるため、その説明は省略する。また、第３の実施形態に係る情報処理システムのハードウェア構成は、図２に示す第１の実施形態に係る情報処理システムのハードウェア構成と同様であるため、その説明も省略する。また、第８の実施形態に係る情報処理方法の処理手順は、図３に示す第１の実施形態に係る情報処理方法のステップＳ１４０を除いて同様であるため、ステップＳ１１０〜１３０の説明は省略する。 Here, in order to improve the work efficiency, necessary information is automatically extracted from documents having various formats, and the documents are classified as an eighth embodiment. The hardware configuration of the information processing system according to the eighth embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. The functional configuration of the information processing system according to the eighth embodiment is the same as the functional configuration of the information processing system according to the first embodiment shown in FIG. Also, the hardware configuration of the information processing system according to the third embodiment is the same as the hardware configuration of the information processing system according to the first embodiment shown in FIG. In addition, since the processing procedure of the information processing method according to the eighth embodiment is the same as step S140 of the information processing method according to the first embodiment shown in FIG. 3, the description of steps S110 to 130 is omitted. Do.

次に、ステップＳ１４０における知識に基づく抽出対象の同定処理について説明する。 Next, identification processing of an extraction target based on knowledge in step S140 will be described.

図２０は、本発明の第８の実施形態を示し、図３のステップＳ１４０における知識に基づく抽出対象を同定し、取引規制対象であるかどうかの確認作業支援の手順の一例を示すフローチャートである。 FIG. 20 is a flow chart showing an eighth embodiment of the present invention, identifying an extraction target based on the knowledge in step S140 of FIG. .

先ず、ステップＳ７４０１では、候補領域設定部１２０は候補領域情報を抽出情報同定部１４０に入力する。 First, in step S7401, the candidate area setting unit 120 inputs candidate area information to the extraction information identification unit 140.

続いて、抽出情報同定部１４０はステップＳ７４０２からステップＳ７４０６において、基本抽出項目内容に該当するかどうかをチェックし、取引規制対象の判断を行う。以下、詳細に説明する。 Subsequently, in steps S7402 to S7406, the extraction information identification unit 140 checks whether or not the contents of the basic extraction item are applicable, and determines the transaction regulation target. The details will be described below.

ステップＳ７４０２では、抽出情報同定部１４０は基本抽出項目ｎを取り出す。そして、ステップＳ７４０３では、基本抽出項目ｎに対応する中身ｍを取り出す。 In step S7402, the extraction information identification unit 140 takes out the basic extraction item n. Then, in step S7403, the content m corresponding to the basic extraction item n is extracted.

そして、ステップＳ７４０４では、候補情報の中に、上記基本抽出項目ｎの中身ｍに該当するものがあるかどうかをチェックする。上記基本抽出項目ｎの中身ｍに該当するものがあれば、当該文書は更に精査する必要があると判断し、ステップＳ７４０７の処理に入る。上記基本抽出項目ｎの中身ｍに該当するものがなければ、ステップＳ７４０５に入り、基本項目ｎの中身をすべてチェックしたかどうかを確認する。まだ未チェックの中身があれば、ステップＳ７４０３に入り、ステップＳ７４０３からステップＳ７４０４までの処理を繰り返して実行する。基本抽出項目ｎの中身はすべてチェックする場合、ステップＳ７４０６では、基本抽出項目はすべてチェックしたかどうかを確認する。まだ未チェックの基本抽出項目があれば、ステップＳ７４０２に入り、ステップＳ７４０２からステップＳ７４０６までの処理を繰り返して実行する。すべでの基本抽出項目において、すべでの基本項目の中身に該当するものがなければ、ステップＳ７４１２に入り、本文書画像は規制対象外と判断する。 Then, in step S7404, it is checked whether there is any candidate information that corresponds to the content m of the basic extraction item n. If there is any that corresponds to the content m of the basic extraction item n, it is determined that the document needs to be further examined, and the processing of step S7407 is entered. If there is nothing corresponding to the content m of the basic extraction item n, the process proceeds to step S7405, where it is checked whether all the content of the basic item n have been checked. If there is still unchecked content, step S7403 is entered, and the processing from step S7403 to step S7404 is repeated and executed. If all contents of the basic extraction item n are checked, it is checked in step S7406 whether all basic extraction items have been checked. If there are still unchecked basic extraction items, step S7402 is entered, and the processing from step S7402 to step S7406 is repeated and executed. In all the basic extraction items, if there is nothing corresponding to the contents of all the basic items, step S7412 follows and it is determined that the document image is not subject to restriction.

ステップＳ７４０７からステップＳ７４１３は、ステップＳ７４０４で基本抽出項目の中身に該当するものがある場合の精査処理である。以下、詳細に説明する。 Steps S7407 to S7413 are scrutinizing processes in the case where there is an item corresponding to the contents of the basic extraction item in step S7404. The details will be described below.

ステップＳ７４０７では、抽出項目を取り出す。そして、ステップＳ７４０８では、抽出項目ｎ’に対応する中身ｍ’を取り出す。 In step S7407, the extraction item is extracted. Then, in step S7408, the content m 'corresponding to the extraction item n' is extracted.

そして、ステップＳ７４０９では、候補情報の中に、上記抽出項目ｎ’の中身ｍ’に該当するものがあるかどうかをチェックする。上記抽出項目ｎ’の中身ｍ’に該当するものがあれば、ステップＳ７４１３に入り、当該文書を規制対象と判断する。上記抽出項目ｎ’の中身ｍ’に該当するものがなければ、ステップＳ７４０１０に入り、抽出項目ｎ’の中身をすべてチェックしたかどうかを確認する。まだ未チェックの中身があれば、ステップＳ７４０８に入り、ステップＳ７４０８からステップＳ７４０９までの処理を繰り返して実行する。抽出項目ｎの中身はすべてチェックする場合、ステップＳ７４１１では、抽出項目はすべてチェックしたかどうかを確認する。まだ未チェックの抽出項目があれば、ステップＳ７４０７に入り、ステップＳ７４０７からステップＳ７４１１までの処理を繰り返して実行する。すべでの抽出項目において、すべでの抽出項目の中身に該当するものがなければ、ステップＳ７４１２に入り、本文書画像は規制対象外と判断する。 Then, in step S7409, it is checked whether or not there is any candidate information that corresponds to the content m 'of the extraction item n'. If there is anything corresponding to the content m 'of the extraction item n', step S7413 is entered and it is determined that the document is to be restricted. If there is nothing corresponding to the content m 'of the extraction item n', step S74010 is entered to check whether all the content of the extraction item n 'has been checked. If there is still unchecked content, the processing proceeds to step S7408, and the processing from step S7408 to step S7409 is repeated and executed. If all contents of the extracted item n are checked, it is checked in step S7411 whether all the extracted items have been checked. If there is an extraction item that has not been checked yet, step S7407 is entered, and the processes from step S7407 to step S7411 are repeated and executed. If the contents of all the extracted items do not correspond to the contents of all the extracted items, the process proceeds to step S7412 and it is determined that the document image is not subject to restriction.

図２１は、本発明の第８の実施形態を示し、図２０の情報処理の一例を示す模式図である。 FIG. 21 is a schematic view showing an eighth embodiment of the present invention and showing an example of the information processing of FIG.

１８０１は、海外送金業務用の帳票例である。取引規制対象のチェック対象項目として、楕円で囲まれる送金通貨、国名、取引人が挙げられる。 1801 is an example of a form for overseas remittance business. Items to be checked for transaction control include the remittance currency, country name, and trader surrounded by an ellipse.

１８０２は、取引規制対象のチェックに用いる知識例である。知識情報は、１８０３０基本抽出項目、１８０４０抽出項目、１８０３１基本抽出項目に含む各項目の中身リスト、１８０４１、１８０４２抽出項目に含む各項目の中身リストから構成される。例えば、１８０３０基本抽出項目０１「送金通貨」の中身番号は「０１０１」で、内容は「ＵＳＤ」として設定される。また、例えば、１８０４０抽出項目１１「国名」に対応する中身リストが複数あり、順番にリストアップされる。また、日本語だけではなく、他言語での記述も対応付けて記録されている。 Reference numeral 1802 is an example of knowledge used to check the subject of transaction regulation. The knowledge information includes a content list of each item included in the 18030 basic extraction item, the 18040 extraction item, the 18031 basic extraction item, and a content list of each item included in the 18041 and 18042 extraction items. For example, the content number of the 18030 basic extraction item 01 "remittance currency" is set as "0101", and the content is set as "USD". Further, for example, there are a plurality of contents lists corresponding to the 18040 extraction item 11 "country name", and are listed up in order. In addition to Japanese, descriptions in other languages are also recorded in association.

上記情報処理では、基本抽出項目として設定されている「送金通貨」「ＵＳＤ」に該当するものがあれば、精査対象とする。そして、更に抽出項目として設定されている取引禁止国名のリスト、または、抽出項目として設定されている問題のある法人・個人のリストに該当するかどうかをチェックする。 In the above-mentioned information processing, if there is one that corresponds to “remittance currency” and “USD” set as basic extraction items, it is considered as a scrutiny target. Then, it is checked whether or not it corresponds to the list of trade prohibited country names set as extraction items, or the list of problematic companies / individuals set as extraction items.

本実施形態では、金融業務の知識を利用し、金融帳票から自動的に情報抽出するものである。本実施形態では、金融業務の中に、海外送金業務を例にしたが、それ以外の文書画像関連の業務に適用してもよい。また、海外送金業務の自動化の例では、チェックする項目を基本抽出項目と抽出項目に分けて管理する例を挙げたが、まとめて管理してもよいし、それ以外の構造にしてもよい。 In the present embodiment, information is automatically extracted from a financial form using knowledge of financial business. In the present embodiment, the overseas remittance business is taken as an example of the financial business, but the present invention may be applied to other business related to document images. Moreover, although the example which divides and checks the item to check into a basic extraction item and an extraction item was given in the example of automation of overseas remittance business, you may manage collectively and you may make it the structure of other than that.

第８の実施形態によれば、ターゲットと設定される分野の情報抽出において、必要となる知識を置き換えて、本提案のアーキテクチャを適用すれば、医療以外の業務にも適用が可能になる。 According to the eighth embodiment, if the proposed architecture is applied by replacing the necessary knowledge in the information extraction of the field to be set as a target, the application to non-medical work becomes possible.

上述した第１、第２、第３、第４、第５、第６、第７及び第８の実施形態では、スキャン文書画像から情報抽出するものであったが、カメラ撮影画像を情報抽出処理対象にしてもよい。その際に、カメラ入力画像用の画像補正処理を加えればよい。 In the first, second, third, fourth, fifth, sixth, seventh and eighth embodiments described above, information is extracted from a scanned document image. It may be a target. At that time, an image correction process for a camera input image may be added.

（他の実施形態）
なお、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録したコンピュータ可読の記憶媒体を、システムあるいは装置に供給することによっても、達成されることは言うまでもない。また、システムあるいは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読出し実行することによっても、達成されることは言うまでもない。 (Other embodiments)
It goes without saying that the object of the present invention can also be achieved by supplying a computer readable storage medium storing a program code of software that implements the functions of the above-described embodiments to a system or apparatus. Needless to say, this can also be achieved by the computer (or CPU or MPU) of the system or apparatus reading out and executing the program code stored in the storage medium.

この場合、記憶媒体から読出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read out from the storage medium implements the functions of the above-described embodiments, and the storage medium storing the program code constitutes the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a non-volatile memory card, a ROM or the like can be used.

また、コンピュータが読出したプログラムコードを実行することにより、前述した実施形態の機能が実現される。また、プログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって前述した実施形態が実現される場合も含まれることは言うまでもない。 Also, the functions of the above-described embodiment are realized by executing the program code read by the computer. In addition, a case where the OS (Operating System) or the like operating on the computer performs a part or all of the actual processing based on the instructions of the program code, and the above-described embodiment is realized by the processing is also included. Needless to say.

なお、上記の複数の実施形態を組み合わせることとしてもよい。 The above plurality of embodiments may be combined.

１１０文書画像解析部
１２０候補領域設定部
１３０候補領域認識部
１４０抽出情報同定部
１５０登録部 110 document image analysis unit 120 candidate region setting unit 130 candidate region recognition unit 140 extraction information identification unit 150 registration unit

Claims

First extraction means for extracting a plurality of regions from imaged medical document data;
A second extraction means for extracting a region including a first character from said plurality of regions,
Third extraction means for extracting medical department information of the medical document data from the area extracted by the second extraction means;
Equipped with
An information processing apparatus characterized in that the first character includes "family" .

The information processing equipment according to claim 1, further comprising a classifying means for classifying the document data by using the information extracted by the third extraction unit.

The apparatus further comprises area selecting means for selecting an area to be processed by the second extracting means based on the information on the areas extracted by the first extracting means.
It said second extraction means, the information processing apparatus according to claim 1 or claim 2, characterized in that extracting a region including the first character from the selected region.

The information on the area is information indicating at least one of the position of the area extracted by the first extraction unit and the characters included in the area extracted by the first extraction unit.
The area selecting means selects an area extracted by the first extracting means which is within a predetermined range in the document data and in which the number of lines of contained characters is equal to or less than a fourth threshold. The information processing apparatus according to Item 3.

The apparatus further comprises an assigning unit that assigns a priority to the area extracted by the first extracting unit.
Said second extraction means, information processing apparatus according to any one of claims 1 to 4, characterized in that extracting a region including the first character in the order based on the priority.

6. The information according to claim 5, wherein the giving unit gives a priority based on at least one of the position of the area extracted by the first extracting unit and the size of characters included in the area. Processing unit.

It said classification means, according to claim 2 Symbol, characterized in that classifying the medical document data by using the information extracted by the information obtained and the third extracting means from the medical document data and the electronic medical record of the same patient Information processor.

8. The information processing apparatus according to claim 7, wherein the information obtained from the electronic medical record is information indicating whether or not a first visit.

A first extraction step of extracting a plurality of regions from imaged medical document data;
A second extraction step of extracting a region including a first character from said plurality of regions,
A third extraction step of extracting medical department information of the medical document data from the area extracted in the second extraction step;
Equipped with
An information processing method characterized in that the first character includes "family" ;

A program causing a computer to execute each process according to claim 9.

A storage medium storing the program according to claim 10.