JP2007034871A

JP2007034871A - Character input apparatus and character input apparatus program

Info

Publication number: JP2007034871A
Application number: JP2005219916A
Authority: JP
Inventors: Akira Nakamura; 明中村
Original assignee: Sanyo Electric Co Ltd
Current assignee: Sanyo Electric Co Ltd
Priority date: 2005-07-29
Filing date: 2005-07-29
Publication date: 2007-02-08

Abstract

PROBLEM TO BE SOLVED: To provide a document input apparatus and its program capable of presenting character strings a user wants to input, such as words and expressions, at a higher order of prediction candidates and enabling the user to select his or her desired words and the like from the prediction candidates without performing complicated operation. SOLUTION: A context dictionary generation part 202 generates a context dictionary on the basis of the documents stored in a document set to output it to a context dictionary set 203. An input part 204 decides whether or not the character strings are inputted by the user. A context dictionary selection part 205 selects an optimal context dictionary from the context dictionary set 203 according to content or a context of the document being inputted. A character string conversion processing part 206 predicts a character string the user wants to input on the basis of the input character string from the user to extract the prediction candidates from the selected context dictionary. An output part 207 displays the prediction candidates of the character strings the user wants to input on a display 104. COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、パーソナルコンピュータのワードプロセッサ機能等で文字を入力して文書を作成するための文字入力装置や文字入力装置プログラムに関する。 The present invention relates to a character input device and a character input device program for creating a document by inputting characters using a word processor function of a personal computer.

近年、パーソナルコンピュータ等のワードプロセッサ機能を用いた文書作成の機会が増加してきている。例えば、従来のビジネス文書や論文等の文書作成の機会に加えて、新たに病院等の医療機関において電子カルテ等が導入されてきたため、医者や看護士による文書作成の機会が増加してきている。 In recent years, opportunities for document creation using a word processor function of a personal computer or the like have increased. For example, in addition to the conventional opportunity for creating documents such as business documents and papers, electronic medical records and the like have been newly introduced in medical institutions such as hospitals. Therefore, opportunities for creating documents by doctors and nurses are increasing.

このような状況の中で、キーボード操作に熟練していない人でも、少ないキータッチで情報を正確に入力できるといったユーザの情報入力に関わる負担を軽減することが望まれている。 Under such circumstances, it is desired to reduce a burden related to information input of a user, such that even a person who is not skilled in keyboard operation can input information accurately with a few key touches.

かかる要請に対する一つの解決方法として、文字列の先頭の“読み”が入力されると、それをキーワードとして、ユーザが望むと思われる文字列を辞書から検索して提示することにより、ユーザの入力負担を軽減する方法が提案されている。 As one solution to such a request, when a “reading” at the beginning of a character string is input, it is used as a keyword, and a character string that the user thinks is desired is retrieved from a dictionary and presented. A method for reducing the burden has been proposed.

しかしながら、この方法では、単純な辞書に登録されている単語や表現等から、例えば、その“読み”から始まる文字列を単に抽出して提示するだけであるため、辞書に登録されている単語や表現等が増えるに連れ、予測候補が大量に提示されることとなり、その中から所望の候補を探して指定するという操作がかえって繁雑になってしまうという問題がある。 However, in this method, for example, a word string starting from “reading” is simply extracted and presented from words or expressions registered in a simple dictionary. As the number of expressions and the like increase, a large number of prediction candidates are presented, and there is a problem that the operation of searching for and specifying a desired candidate from among them becomes complicated.

このような問題を解決する方法の一つが以下の特許文献１に記載されている。この特許文献１では、携帯電話のメール機能でメールを作成する場合やその他のアプリケーションで文字入力を行う際に、ユーザの文字入力の負担を軽減するため、辞書を複数備え、携帯電話で使用するアプリケーションや文字入力を行う位置の属性（例えば、メール作成の場合のあて先記入欄）に応じて使用する辞書を切り替える（メールのあて先記入欄に記入する場合は、名前辞書を優先する）という方法が提案されている。しかしながら、この方法では、例えば、パーソナルコンピュータのワードプロセッサ機能により単一のアプリケーション上で文字入力を行い、文書を作成するような場合には、依然として上記課題を解決することができない。
特開2001-325252号 One method for solving such a problem is described in Patent Document 1 below. In this patent document 1, when creating a mail by a mail function of a mobile phone or performing a character input by another application, a plurality of dictionaries are provided and used by the mobile phone in order to reduce the burden of the user's character input. The method of switching the dictionary to be used according to the attribute of the application or character input position (for example, the address entry field in the case of mail creation) (the name dictionary has priority when filling in the mail address entry field) Proposed. However, with this method, for example, when a character is input on a single application by a word processor function of a personal computer and a document is created, the above problem cannot be solved yet.
JP 2001-325252 A

そこで、本発明は、パーソナルコンピュータのワードプロセッサ機能等を利用して文書を作成する場合において、ユーザが文字列を入力する際に、作成中の文書の内容に即して、ユーザが入力を希望する単語や表現等の文字列を予測候補の上位に提示することができ、その結果、ユーザが予測候補から所望の単語等を煩雑な操作なく選択することができる文書入力装置およびそのプログラムを提供することを目的とする。 Therefore, according to the present invention, when a document is created using the word processor function of a personal computer or the like, when the user inputs a character string, the user desires to input according to the contents of the document being created. Provided is a document input device that can present a character string such as a word or expression on the top of a prediction candidate, and as a result, allows a user to select a desired word or the like from the prediction candidate without complicated operations, and a program thereof For the purpose.

本発明に係る文書入力装置は、ユーザが入力した入力情報を文字列に変換することにより文字入力を行う文字入力装置であって、入力情報を文字列に変換するときに使用する１又は２以上の文脈辞書を備え、文脈辞書選択手段が、文字入力中の文書の内容に基づいて、前記１又は２以上の文脈辞書から使用すべき文脈辞書を選択する。そして、文字列変換手段が、選択された文脈辞書に基づいて入力情報を文字列に変換する。 The document input device according to the present invention is a character input device that performs character input by converting input information input by a user into a character string, and is one or more used when converting the input information into a character string. The context dictionary selection means selects a context dictionary to be used from the one or more context dictionaries based on the contents of the document being input. Then, the character string conversion means converts the input information into a character string based on the selected context dictionary.

尚、文脈辞書が１つのみである場合には、文脈辞書選択手段は、必然的にかかる１つの文脈辞書を選択することとなる。 If there is only one context dictionary, the context dictionary selecting means inevitably selects such one context dictionary.

本発明によると、文脈辞書選択手段が、文字入力中の文書のうち入力が確定した文字列の内容に基づいて使用すべき文脈辞書を選択する。したがって、文字入力中の文書を作成するのに適した文脈辞書が選択されるため、ユーザは希望の文字列を容易に入力することができることとなり、ユーザの文字入力負担を軽減することができる。 According to the present invention, the context dictionary selection means selects a context dictionary to be used based on the contents of a character string that is confirmed to be input from among documents in which characters are being input. Therefore, since a context dictionary suitable for creating a document during character input is selected, the user can easily input a desired character string, and the user's burden of character input can be reduced.

さらに、入力中の文書内容から適切な文脈辞書が自動的に選択されるため、ユーザは、入力文書に適した文脈辞書を、自ら選択するという操作を行う必要がない。 Furthermore, since an appropriate context dictionary is automatically selected from the contents of the document being input, the user does not need to perform an operation of selecting the context dictionary appropriate for the input document.

また、本発明に係る文書入力装置は、文脈辞書生成手段が、１又は２以上の文書からなる文書集合に基づいて１又は２以上の文脈辞書を生成する。 In the document input device according to the present invention, the context dictionary generation unit generates one or more context dictionaries based on a document set including one or more documents.

本発明によると、文脈辞書生成手段が文書集合に基づいて、ユーザからの入力情報を文字列に変換するときに使用する１又は２以上の文脈辞書を生成する。したがって、文書入力装置の導入時点では、予め独立して複数の文脈辞書を準備しなくともよいこととなる。また、これまでに作成され保存されている文書集合に基づいて文脈辞書を生成するため、ユーザが頻繁に作成する文書の特性に適した文脈辞書が生成される。したがって、かかる文脈辞書を利用することにより、ユーザは希望の文字列を容易に入力することができることとなり、ユーザの文字入力負担をさらに軽減することができる。 According to the present invention, the context dictionary generation unit generates one or more context dictionaries used when converting input information from the user into a character string based on the document set. Therefore, at the time of introduction of the document input device, it is not necessary to prepare a plurality of context dictionaries independently in advance. In addition, since the context dictionary is generated based on the document set created and stored so far, the context dictionary suitable for the characteristics of the document frequently created by the user is generated. Therefore, by using such a context dictionary, the user can easily input a desired character string, and the user's burden of character input can be further reduced.

また、本発明にかかる文字入力装置では、文字列変換手段が、入力情報に基づいてユーザが入力を希望する１又は２以上の文字列候補を、選択された文脈辞書から抽出して表示させる文字列候補表示手段を備える。 In the character input device according to the present invention, the character string conversion means extracts one or more character string candidates that the user wants to input based on the input information from the selected context dictionary and displays the characters. Column candidate display means is provided.

本発明によると、ユーザは文字列候補表示手段により表示される文字列候補から入力希望の文字列を選択できる。したがって、ユーザは希望の文字列を容易に入力することができることとなり、ユーザの文字入力負担をさらに軽減することができる。 According to the present invention, the user can select a desired character string from the character string candidates displayed by the character string candidate display means. Therefore, the user can easily input a desired character string, and the burden of character input on the user can be further reduced.

また、本発明にかかる文書入力装置は、前記文脈辞書生成手段が、前記文書集合に対してクラスタリング処理を行うことにより、１又は２以上のクラスタを生成し、前記各文書を該１又は２以上のクラスタの何れかに分類する文書分類手段と、前記各クラスタに属する１又は２以上の文書から各文書の内容を反映する言語的な特徴を抽出する第１の言語特徴抽出手段と、抽出された特徴に基づいて前記各クラスタごとに文脈辞書を出力する文脈辞書出力手段と、を備える。 In the document input device according to the present invention, the context dictionary generation unit generates one or two or more clusters by performing a clustering process on the document set, and each document is converted into the one or two or more. A document classification means for classifying into one of the clusters, a first language feature extraction means for extracting linguistic features reflecting the contents of each document from one or more documents belonging to each cluster, And a context dictionary output means for outputting a context dictionary for each cluster based on the characteristics.

また、本発明にかかる文書入力装置では、前記文脈辞書選択手段は、前記文字入力中の文書の内容を反映した言語的な特徴を抽出する第２の言語特徴抽出手段と、抽出された言語的な特徴と前記各クラスタに対応する前記各文脈辞書の言語的な特徴との類似度を算出する類似度算出手段と、算出された類似度が最も高いクラスタの文脈辞書を抽出し、使用すべき文脈辞書として出力する文脈辞書抽出手段と、を備える。 In the document input device according to the present invention, the context dictionary selecting unit includes a second linguistic feature extracting unit that extracts a linguistic feature reflecting the content of the document being input, and an extracted linguistic feature. A similarity calculation means for calculating the similarity between each feature and the linguistic feature of each context dictionary corresponding to each cluster, and the context dictionary of the cluster with the highest calculated similarity should be extracted and used Context dictionary extraction means for outputting as a context dictionary.

上記２つの本発明によると、文書集合が、文書の特性に応じて適切にクラスタリングされ、各クラスタに適した文脈辞書が生成される。そして、ユーザが文字入力中の文書を作成する際には、その文書の特性からその文書への文字入力に適した文脈辞書が選択されることとなる。したって、このように選択された文脈辞書により、ユーザは希望の文字列を容易に入力することができる。 According to the two aspects of the present invention, the document set is appropriately clustered according to the characteristics of the document, and a context dictionary suitable for each cluster is generated. When a user creates a document during character input, a context dictionary suitable for character input to the document is selected from the characteristics of the document. Therefore, the user can easily input a desired character string by the context dictionary selected in this way.

また、本発明にかかる文字入力装置では、文字列候補表示手段は、選択された文脈辞書から該文脈辞書の言語的な特徴に応じて文字列候補を抽出して表示させる。 In the character input device according to the present invention, the character string candidate display means extracts and displays character string candidates from the selected context dictionary according to the linguistic characteristics of the context dictionary.

本発明によると、文字列候補表示手段が、文脈辞書の言語的な特徴、例えば、文脈辞書に格納されている文字列の出現頻度等に応じて文字列候補を抽出して表示させる。このような構成とすることにより、ユーザが希望する文字列が文字列候補の上位に表示される。したがって、ユーザは希望の文字列を容易に選択して入力することができることとなり、ユーザの文字入力負担をさらに軽減することができる。 According to the present invention, the character string candidate display means extracts and displays character string candidates in accordance with the linguistic characteristics of the context dictionary, for example, the appearance frequency of the character string stored in the context dictionary. With this configuration, the character string desired by the user is displayed at the top of the character string candidates. Therefore, the user can easily select and input a desired character string, thereby further reducing the user's character input burden.

また、本発明にかかる文書入力装置では、前記文脈辞書生成手段において、文脈辞書更新手段が、所定のタイミングで文脈辞書の再生成を行う。 In the document input device according to the present invention, in the context dictionary generation unit, the context dictionary update unit regenerates the context dictionary at a predetermined timing.

本発明によると、文脈辞書が所定期間毎に更新されるため、最適な文脈辞書を利用することができる。 According to the present invention, since the context dictionary is updated every predetermined period, the optimum context dictionary can be used.

また、新しい内容の文書が増えてきた場合には、この内容を反映した文脈辞書が生成される。したがって、例えば、当初は予定していなかった分野あるいは種類の文書を作成する場合であっても容易に対応することができる。 In addition, when the number of documents having new contents increases, a context dictionary reflecting the contents is generated. Therefore, for example, even when a document of a field or type that was not initially planned is created, it can be easily handled.

本発明にかかる文書入力装置プログラムは、前記請求項１乃至５の何れか一項に記載された各手段の機能をコンピュータに付与する文字入力装置のプログラム。 A document input device program according to the present invention is a program for a character input device that gives a computer the function of each means described in any one of claims 1 to 5.

本発明によれば、パーソナルコンピュータのワードプロセッサ機能等を利用して文書を作成する場合に、ユーザが文字を入力する際には、作成中の文書の内容に即して、ユーザが入力を希望する単語や表現等の文字列を予測候補の上位に提示することができ、その結果ユーザが予測候補から所望の単語等を煩雑な操作なく選択することができる文書入力装置およびそのプログラムを提供することができる。 According to the present invention, when a document is created using the word processor function of a personal computer or the like, when the user inputs characters, the user desires input according to the contents of the document being created. To provide a document input device and a program thereof that can present a character string such as a word or expression on the top of a prediction candidate, and as a result, allows a user to select a desired word or the like from the prediction candidate without complicated operations. Can do.

本発明の意義ないし効果は、以下に示す実施の形態の説明により更に明らかとなろう。 The significance or effect of the present invention will become more apparent from the following description of embodiments.

ただし、以下の実施の形態は、あくまでも、本発明の一つの実施形態であって、本発明ないし各構成要件の用語の意義は、以下の実施の形態に記載されたものに制限されるものではない。 However, the following embodiment is merely one embodiment of the present invention, and the meaning of the term of the present invention or each constituent element is not limited to that described in the following embodiment. Absent.

（実施例１）
以下、本発明を病院や診療所等の医療機関で用いられる電子カルテシステムの入力予測処理に実施した形態につき、図面に沿って説明する。 Example 1
DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention applied to input prediction processing of an electronic medical record system used in medical institutions such as hospitals and clinics will be described with reference to the drawings.

図１は、本発明にかかる文字入力装置の一実施形態の構成図を示している。 FIG. 1 shows a configuration diagram of an embodiment of a character input device according to the present invention.

図１に示す如く、文字入力装置１００は、記憶装置１０１、ＣＰＵ１０２、入力装置１０３および表示装置１０４からなる。 As shown in FIG. 1, the character input device 100 includes a storage device 101, a CPU 102, an input device 103, and a display device 104.

記憶装置１０１は、後述する文字入力装置１００の機能を実行するためのプログラムを格納すると共に、かかる機能を実行する際に参照される各種テーブルや文書の電子ファイル（以下、単に文書と記載する）等を格納する。ＣＰＵ１０２は、記憶装置１０１に格納された当該文字入力装置プログラムに従って処理を実行する。入力装置１０３は、ユーザが文字入力装置プログラムを実行させるときに必要な文字列等の情報を入力するために使用される。表示装置１０４は、ユーザにより入力される文字列等の情報やＣＰＵ１０２による文字入力装置プログラムの実行結果を表示する。 The storage device 101 stores a program for executing the functions of the character input device 100 described later, and various tables and electronic files of documents (hereinafter simply referred to as documents) that are referred to when the functions are executed. Etc. are stored. The CPU 102 executes processing according to the character input device program stored in the storage device 101. The input device 103 is used for inputting information such as a character string necessary for the user to execute the character input device program. The display device 104 displays information such as a character string input by the user and the execution result of the character input device program by the CPU 102.

従って、本発明の文字入力装置１００では、入力装置１０３からユーザにより文字列等の情報が入力されると、ＣＰＵ１０２が、記憶装置１０１に格納されている文字入力装置プログラムを実行し、実行結果を表示装置１０４に表示すると共に、実行の結果作成された文書を記憶装置１０１に保存する。 Therefore, in the character input device 100 of the present invention, when information such as a character string is input from the input device 103 by the user, the CPU 102 executes the character input device program stored in the storage device 101 and displays the execution result. A document that is displayed on the display device 104 and created as a result of execution is stored in the storage device 101.

図２は、文書入力装置１００を構成する機能ブロック図を示している。機能ブロック図とは、文書入力装置１００を、機能ごとに分類し、各機能を実現するブロックごとに表現した図である。 FIG. 2 shows a functional block diagram constituting the document input device 100. The functional block diagram is a diagram in which the document input device 100 is classified for each function and expressed for each block for realizing each function.

図２において、文字入力装置１００は、これまでに作成された１又は２以上の文書からなる文書集合２０１、該文書集合２０１に基づいて後述する文脈辞書を生成する文脈辞書生成部２０２、１又は２以上の文脈辞書からなる文脈辞書集合２０３、入力装置１０３から入力された入力文字列を取得する入力部２０４、文脈辞書集合２０３のいずれの文脈辞書を使用するかを判定し、選択する文脈辞書選択部２０５、選択した文脈辞書を参照して入力文字列からユーザが希望する文字列を予測する文字列変換部２０６、予測されたユーザ希望文字列の候補を表示装置１０４に表示させる、あるいは確定され入力文字列を文書として文書集合２０１に保存する出力部２０７から構成される。 In FIG. 2, a character input device 100 includes a document set 201 composed of one or more documents created so far, and a context dictionary generation unit 202, 1 or 2 that generates a context dictionary described later based on the document set 201. A context dictionary that determines and selects which context dictionary to use: a context dictionary set 203 including two or more context dictionaries, an input unit 204 that acquires an input character string input from the input device 103, and a context dictionary set 203 A selection unit 205, a character string conversion unit 206 that predicts a character string desired by a user from an input character string with reference to the selected context dictionary, and displays or confirms a predicted user desired character string candidate on the display device 104 The output unit 207 stores the input character string as a document in the document set 201.

尚、本実施形態では、文書集合２０１は病院や診療所等の医療機関で作成された複数の患者のカルテの電子ファイル（以下、カルテ文書と記載する。）からなるものとする。 In the present embodiment, it is assumed that the document set 201 is composed of electronic files (hereinafter referred to as medical record documents) of a plurality of patient charts created by a medical institution such as a hospital or clinic.

文脈辞書生成部２０２は、複数のカルテ文書に基づいて、後述する文脈辞書を生成する。 The context dictionary generation unit 202 generates a context dictionary, which will be described later, based on a plurality of medical record documents.

具体的には、例えば、最初にＮ個（Ｎ＝１，２，３，・・・）のカルテ文書が文書集合２０１に存在しているとする。文脈辞書生成部２０２は、まず、これらのＮ個のカルテ文書を例えばＣ個（Ｃ≦Ｎ、Ｃ＝１，２，３，・・・）のクラスタに分類する。この結果、おおむねカルテの内容に応じたクラスタが形成される。 Specifically, for example, it is assumed that N (N = 1, 2, 3,...) Medical record documents initially exist in the document set 201. The context dictionary generation unit 202 first classifies these N medical record documents into, for example, C clusters (C ≦ N, C = 1, 2, 3,...). As a result, a cluster corresponding to the contents of the chart is formed.

尚、このクラスタに分類する処理、即ち、クラスタリング処理には、既存の文書分類技術が適用できる。例えば、各文書を形態素解析してから文書ベクトルを生成し、これらの文書ベクトルに対してk-means法、Ward法などのクラスタリングアルゴリズムを適用することによって各文書をクラスタに分類することができる。 An existing document classification technique can be applied to the process of classifying into clusters, that is, the clustering process. For example, each document can be classified into clusters by generating document vectors after morphological analysis of each document and applying a clustering algorithm such as k-means method or Ward method to these document vectors.

図３は、Ｎ個のカルテ文書がその内容に応じて、例えば、心疾患、消化器疾患、呼吸器疾患等の疾患群ごとにＣ個のクラスタに分類された状態を示している。ここで、クラスタの数である、Ｃは予め設定することとしてもよいし、文書辞書生成部２０２がカルテ文書の内容に応じて自動的に決定することとしてもよい。 FIG. 3 shows a state in which N medical record documents are classified into C clusters for each disease group such as heart disease, digestive organ disease, respiratory disease, and the like according to the contents. Here, C, which is the number of clusters, may be set in advance, or may be automatically determined by the document dictionary generation unit 202 according to the contents of the medical record document.

尚、実際には、上述の如く厳密に疾患群ごとにクラスタが形成されるとは限らないが、各クラスタに属するカルテ文書の大半が関連する内容となっているような分類結果が得られていれば問題はない。 In fact, as described above, clusters are not strictly formed for each disease group, but a classification result is obtained in which most of the medical record documents belonging to each cluster have related contents. If there is no problem.

次に、文脈辞書生成部２０２は、各クラスタに属するカルテ文書から文書の内容を反映した言語的特徴を抽出し、各クラスタごとに後述の文脈辞書を生成する。
ここで、文書内容を反映した言語的特徴とは、例えば、
（１）各クラスタにおける単語単位の出現頻度または出現確率
（２）各クラスタにおける単語間の共起回数または共起確率
（３）各クラスタにおける文字単位の出現頻度または出現確率
（４）各クラスタにおける文字間の共起回数または共起確率
等である。 Next, the context dictionary generation unit 202 extracts a linguistic feature reflecting the contents of the document from the medical record document belonging to each cluster, and generates a later-described context dictionary for each cluster.
Here, the linguistic feature reflecting the document content is, for example,
(1) Appearance frequency or appearance probability of each word in each cluster (2) Co-occurrence frequency or co-occurrence probability between words in each cluster (3) Appearance frequency or appearance probability of each character in each cluster (4) In each cluster The number of co-occurrence between characters or the co-occurrence probability.

本実施形態では、文書内容を反映した言語的特徴として各クラスタにおける単語単位の出現頻度を用いる。 In the present embodiment, the appearance frequency of each word in each cluster is used as a linguistic feature reflecting the document content.

文脈辞書生成部２０２は、言語的特徴を抽出するために、まず各カルテ文書を単語単位に分割する処理（形態素解析）を行う（ただし、クラスタ分類の際に形態素解析が行われている場合には、再度、形態素解析は行わない。）。次に、
各クラスタに属するカルテ文書における単語単位の出現頻度を得るために、まず各カルテ文書の形態素解析結果を参照して、カルテ文書中に出現した単語とその出現回数をカウントする。そしてＣ個のクラスタそれぞれについて、各クラスタに属するカルテ文書中に出現した単語の出現回数を単語ごとに集計する。これにより、各クラスタごとの単語出現頻度リストが得られる。 In order to extract linguistic features, the context dictionary generation unit 202 first performs processing (morpheme analysis) for dividing each medical record document into words (provided that morpheme analysis is performed during cluster classification). Does not perform morphological analysis again.) next,
In order to obtain the appearance frequency of each word in the medical record document belonging to each cluster, first, referring to the morphological analysis result of each medical record document, the words that appear in the medical record document and the number of appearances thereof are counted. Then, for each of the C clusters, the number of appearances of words appearing in the medical record document belonging to each cluster is totaled for each word. Thereby, a word appearance frequency list for each cluster is obtained.

例えば、図３におけるＣ個のクラスタのうち、主として心疾患のカルテ文書から構成されるクラスタをクラスタ１、主として消化器疾患のカルテ文書から構成されるクラスタをクラスタ２とする。 For example, out of the C clusters in FIG. 3, a cluster mainly including a heart disease medical record document is referred to as cluster 1, and a cluster mainly including a digestive organ disease medical record document is referred to as cluster 2.

図４および図５はそれぞれ、上述した手順で生成したクラスタ１およびクラスタ２についての単語出現頻度リストを示している。ただし、ここでは、カルテ文書の内容をほとんど反映しないと考えられる助詞、助動詞、接続詞などの機能語はリストから除外し、効率よく辞書サイズを削減するためにカルテ文書の内容を最も反映すると考えられる名詞のみを対象としている。各リストは、図４および図５に示すように、単語の通し番号を示す“Ｎｏ”の列、単語の表記を示す“表記”の列、単語の読みを示す“読み”の列と、単語の品詞を示す“品詞”の列及び単語の出現頻度を示す“出現頻度”の列から構成される。各リストは、出現頻度があらかじめ決めて置いたしきい値以上（図４および図５では閾値２以上）の単語のみで構成され、出現頻度が大きい順に並べられている。この結果、クラスタ１では図４に示す如く不整脈、心電図などn₁個の単語からなるリスト、クラスタ２では図５に示す如く胃粘膜、炎症などn₂個の単語からなるリストが得られている。 4 and 5 respectively show word appearance frequency lists for cluster 1 and cluster 2 generated by the above-described procedure. However, in this case, functional words such as particles, auxiliary verbs, and conjunctions that are considered to hardly reflect the contents of the medical record document are excluded from the list, and it is considered that the contents of the medical record document are most reflected in order to efficiently reduce the dictionary size. Only for nouns. As shown in FIG. 4 and FIG. 5, each list includes a “No” column indicating a word serial number, a “notation” column indicating a word notation, a “read” column indicating a word reading, It consists of a “part of speech” column indicating the part of speech and an “appearance frequency” column indicating the appearance frequency of the word. Each list is composed of only words whose appearance frequency is equal to or higher than a predetermined threshold (threshold 2 or higher in FIGS. 4 and 5), and are arranged in descending order of appearance frequency. As a result, a list of n ₁ words such as arrhythmia and electrocardiogram is obtained in cluster 1 as shown in FIG. 4, and a list of n ₂ words such as gastric mucosa and inflammation is obtained in cluster 2 as shown in FIG. .

文脈辞書生成部２０２は、同様にして、残りのクラスタについても単語出現頻度リストを生成する。 Similarly, the context dictionary generation unit 202 generates a word appearance frequency list for the remaining clusters.

文脈辞書生成部２０２は、上述のように生成したＣ個の単語出現頻度リストを各クラスタに対応する文脈辞書として出力し、
文脈辞書集合２０３に保存する。 The context dictionary generation unit 202 outputs the C word appearance frequency list generated as described above as a context dictionary corresponding to each cluster,
Saved in the context dictionary set 203.

次に、文脈辞書選択部２０５および文字列変換処理部２０６による入力予測処理を以下に説明する。 Next, input prediction processing by the context dictionary selection unit 205 and the character string conversion processing unit 206 will be described below.

尚、入力予測処理とは、作成中のカルテ文書の入力確定済文字列に応じた文脈辞書を選択し、ユーザが入力中の文字列を認識して入力希望文字列を予測し、選択された文脈辞書から予測候補を抽出し、表示装置１０４に表示させるという一連の処理である。 Note that the input prediction process is to select the context dictionary corresponding to the input confirmed character string of the medical record document being created, the user recognizes the character string being input, predicts the input desired character string, and is selected. This is a series of processes in which prediction candidates are extracted from the context dictionary and displayed on the display device 104.

図６は、腹痛と嘔吐の症状を訴えて来院した患者の診察中に、表示装置１０４に表示されているカルテ文書の入力画面を示している。 FIG. 6 shows an input screen of a medical record document displayed on the display device 104 during examination of a patient who has visited the hospital complaining of symptoms of abdominal pain and vomiting.

図６によると、ユーザがカルテ文書の入力画面中の［主訴・現病歴］の欄に「昨夜より腹痛あり。今朝、朝食後に嘔吐。今も腹痛続く。」、［初診時特有情報］の欄に「既往歴：H8 胃潰瘍」、［生活習慣情報］の欄に「職業：会社員飲酒歴25年」と入力し、これらの入力が確定した後、さらにユーザが、［所見欄］に「急性胃炎の疑い」と入力しようとして入力装置１０３のキーボード（図示せず）から「きゅうせ」まで入力した時点で、予測候補として「急性胃炎」、「急性腹膜炎」、「急性虫垂炎」、「急性大腸炎」、「急性肝炎」、「急性膵炎」の６単語が表示された状態を示している。 According to FIG. 6, the user has a stomachache more than last night in the [main complaint / current medical history] field on the input screen of the medical record document. In the “history: H8 stomach ulcer” field, enter “Occupation: Company employee drinking history 25 years” in the column of “Lifestyle information”, and after confirming these inputs, the user will enter “Acute” At the time of inputting “suspected gastritis” from the keyboard (not shown) of the input device 103 to “Kyuse”, prediction candidates “Acute gastritis”, “Acute peritonitis”, “Acute appendicitis”, “Acute colon” It shows a state where six words “flame”, “acute hepatitis”, and “acute pancreatitis” are displayed.

文脈辞書選択部２０５は、ユーザからの入力文字列が確定するたびに、すでに入力確定済みの文字列を形態素解析し、名詞とその出現回数を取り出す。図６に示すケースでは、文脈辞書選択部２０５は、未確定文字列「きゅうせ」以前に入力確定した文字列から「昨夜」、「腹痛」（２回）、「今朝」、「朝食」、「嘔吐」、「既往歴」、「胃潰瘍」、「職業」、「会社員」および「飲酒歴」を抽出する。これは、文字入力中のカルテ文書から以下のように単語の出現頻度を重みとするベクトル（以下、文書ベクトルと記載する。）を生成していることに相当する。

(昨夜, 腹痛, 今朝, 朝食, 嘔吐, 既往歴, 胃潰瘍, 職業, 会社員, 飲酒歴)
= (1, 2, 1, 1, 1, 1, 1, 1, 1, 1)
次に、文脈辞書選択部２０５は、抽出した単語とその回数を各クラスタに対応する文脈辞書の“表記”の列に格納されている単語と“出現頻度”の列に格納されている出現頻度と照合する。具体的には、各文脈辞書の “出現頻度”の列に格納されている各単語の出現頻度を“表記”の列に格納されている対応する各単語の重みとする文書ベクトルをそのクラスタを代表する文書ベクトルとみなして、文字入力中のカルテ文書の文書ベクトルと各クラスタの文書ベクトルとの間のコサイン類似度Ｓを求める。即ち、次式（数１）に示すように両者に共に出現する単語の出現頻度を掛け合わせて加算し、これを２つのベクトルのノルムの積で除算する。 Whenever the input character string from the user is confirmed, the context dictionary selection unit 205 performs morphological analysis on the character string that has already been confirmed, and extracts the noun and the number of appearances. In the case illustrated in FIG. 6, the context dictionary selection unit 205 selects “last night”, “abdominal pain” (twice), “this morning”, “breakfast”, from a character string input and confirmed before the unconfirmed character string “Kyuse”. “Vomiting”, “Past history”, “Gastric ulcer”, “Occupation”, “Company employee” and “Drinking history” are extracted. This is equivalent to generating a vector (hereinafter referred to as a document vector) having a weight of the appearance frequency of a word as follows from a medical record document during character input.

(Last night, abdominal pain, this morning, breakfast, vomiting, medical history, stomach ulcer, occupation, company employee, drinking history)
= (1, 2, 1, 1, 1, 1, 1, 1, 1, 1)
Next, the context dictionary selection unit 205 displays the extracted word and the number of times of the word stored in the “notation” column of the context dictionary corresponding to each cluster and the appearance frequency stored in the “appearance frequency” column. To match. Specifically, a cluster of document vectors having the appearance frequency of each word stored in the “occurrence frequency” column of each context dictionary as the weight of each corresponding word stored in the “notation” column Considering it as a representative document vector, the cosine similarity S between the document vector of the medical record document during character input and the document vector of each cluster is obtained. That is, as shown in the following equation (Equation 1), the appearance frequencies of words appearing in both are multiplied and added, and this is divided by the product of norms of two vectors.

d : 現在入力中の文書を表す文書ベクトル

Di ：i番目のクラスタを代表する文書ベクトル

(d∩Di) : dとDiに共通して現れる単語の集合

W(d,w) : ベクトルdにおける単語wの重み（出現頻度）

W(Di,w) : ベクトルDiにおける単語wの重み（出現頻度）

文脈辞書選択部２０５は、Ｃ個の文脈辞書との類似度Ｓを算出し、類似度Ｓが最大である文脈辞書を抽出し、かかる文脈辞書を、出現する単語の傾向が最も似通ったクラスタの文脈辞書として選択する。図６のケースでは、主として消化器疾患のカルテ文書から構成されるクラスタ２をもっとも類似するクラスタとして選択する。 d: Document vector representing the document currently being input

Di: Document vector representing the i-th cluster

(d∩Di): A set of words that appear in both d and Di

W (d, w): Weight of word w in vector d (appearance frequency)

W (Di, w): Weight of word w in vector Di (appearance frequency)

The context dictionary selection unit 205 calculates the similarity S with the C context dictionaries, extracts the context dictionary having the maximum similarity S, and selects the context dictionary having the most similar tendency of the appearing words. Select as context dictionary. In the case of FIG. 6, the cluster 2 mainly composed of the medical record document of the digestive system disease is selected as the most similar cluster.

なお、文脈辞書の選択は、使用するＰＣ等の計算機で処理可能な範囲内で、文字列が新たに入力確定されるたび随時行うことができる。ただし、新規文書作成直後など、入力確定文字列が極端に少ない場合には、精度よく文脈辞書の選択を行うことが難しい。そのため、入力確定文字列が一定文字数（あるいは一定単語数）に満たない場合には、あえて文脈辞書の選択を行わない、としてもよい。 The selection of the context dictionary can be performed at any time within the range that can be processed by a computer such as a PC to be used whenever a character string is newly entered and confirmed. However, when the input confirmed character string is extremely small, such as immediately after creation of a new document, it is difficult to select the context dictionary with high accuracy. Therefore, if the input confirmed character string is less than a certain number of characters (or a certain number of words), the context dictionary may not be selected.

文字列変換処理部２０６は、文字入力中の読み文字列を認識してユーザが入力しようとしている文字列を予測する。そして、文脈辞書判定部２０５によって選択された文脈辞書から予測候補を抽出し、出力部２０７に表示させる。 The character string conversion processing unit 206 recognizes a reading character string during character input, and predicts a character string that the user intends to input. Then, prediction candidates are extracted from the context dictionary selected by the context dictionary determination unit 205 and displayed on the output unit 207.

図６のケースでは、文字列変換処理部２０６は、文脈辞書選択部２０５によって選ばれたクラスタ２の文脈辞書から、読みが入力中の読み文字列「きゅうせ」に前方一致する語「急性胃炎」「急性腹膜炎」「急性虫垂炎」…を抽出し、出現頻度の高い順に予測候補として出力部２０７を介してカルテ入力画面に表示させる。 In the case of FIG. 6, the character string conversion processing unit 206 reads from the context dictionary of cluster 2 selected by the context dictionary selecting unit 205 the word “acute gastritis” whose reading matches the forward reading character string “Kyuse” being input. “Acute peritonitis”, “acute appendicitis”... Are extracted and displayed on the chart input screen via the output unit 207 as prediction candidates in descending order of appearance frequency.

尚、一般的な入力予測方法の詳細は、例えば特開平7-334499や特開平09-274613に開示されている。 Details of a general input prediction method are disclosed in, for example, Japanese Patent Laid-Open Nos. 7-334499 and 09-274613.

図７は、ＣＰＵ１０２によって実行される文字入力装置のプログラムのフローチャートを示している。 FIG. 7 shows a flowchart of the program of the character input device executed by the CPU 102.

ステップＳ３０１では、文脈辞書生成部２０２が、上述の如く文書集合に保存されている文書カルテに基づいて文脈辞書を生成し、文脈辞書集合２０３へ出力する。 In step S 301, the context dictionary generation unit 202 generates a context dictionary based on the document chart stored in the document set as described above, and outputs the generated context dictionary to the context dictionary set 203.

ステップＳ３０２では、入力部２０４がユーザから文字列が入力されたかどうかを判定し、文字入力があれば、ステップＳ３０３へ進み、そうでなければ文字入力があるま待機する。 In step S302, the input unit 204 determines whether or not a character string has been input from the user. If there is a character input, the process proceeds to step S303. Otherwise, the process waits until there is a character input.

ステップＳ３０３では、文脈辞書選択部２０５が、入力中の文書の内容や文脈に応じて最適な文脈辞書を文脈辞書集合２０３から選択する。 In step S 303, the context dictionary selection unit 205 selects an optimum context dictionary from the context dictionary set 203 according to the content and context of the document being input.

ステップＳ３０４では、文字列変換処理部２０６が、ユーザからの入力文字列に基づいて、ユーザの入力希望文字列を予測し、選択された文脈辞書から予測候補を抽出する。 In step S304, the character string conversion processing unit 206 predicts the user's desired character string based on the input character string from the user, and extracts prediction candidates from the selected context dictionary.

ステップＳ３０５では、出力部２０７が、ユーザの入力希望文字列の予測候補を表示装置１０４に表示させる。 In step S 305, the output unit 207 causes the display device 104 to display a prediction candidate for the user-desired character string.

以上のように、本発明の文字入力装置１００では、単一の文脈辞書から単語を抽出して頻度順に予測候補を構成する方法に比べ、文脈に適合した語を上位に提示することができ、操作性が大きく向上する。たとえば、読みが「きゅうせ」に前方一致する語は「急性〜」で始まる医学用語だけに限っても２００語以上あるため、入力したい語が上位に現れることはめったになく、ユーザは予測候補リストをスクロールして所望の語を探すか、さらに長い読みを入力して候補を絞り込むかする必要がある。 As described above, the character input device 100 of the present invention can present words suitable for the context at the top as compared to the method of extracting words from a single context dictionary and constructing prediction candidates in order of frequency, The operability is greatly improved. For example, since there are more than 200 words whose readings are prefixed with “Kyusei” only for medical terms that start with “Acute”, the word that the user wants to input rarely appears at the top, and the user is in the prediction candidate list. Scroll down to find the desired word, or enter longer readings to narrow down the candidates.

また、最適な文脈辞書を自動的に判定するため、あらかじめ分野ごとに複数の辞書を用意しておき、入力文書の分野をユーザが指示することにより辞書を使い分ける方式と比較すると、辞書選択のための指示操作が不要となる。 Also, in order to automatically determine the optimum context dictionary, a plurality of dictionaries are prepared for each field in advance, and compared with a method in which a dictionary is selectively used by the user indicating the field of the input document. This instruction operation becomes unnecessary.

また、文脈辞書を既存のカルテ文書に基づいて動的に生成するという構成をとることにより、あらかじめ想定し得るすべての文脈辞書をシステムに用意しておくという方法の場合に発生する、ほとんど必要としない文脈辞書を保持しつづけなければならないという課題も発生しない。 In addition, it is almost necessary to generate a context dictionary dynamically based on an existing medical record document so that all possible context dictionaries are prepared in the system. There is no problem of having to keep a context dictionary that does not.

尚、上記の例では文脈辞書と文書ベクトルにおける単語の重み付け尺度として単語の出現頻度ＴＦ(Term Frequency)を用いたが、各単語の重み付け尺度としては、これに限らず既存の各種尺度、たとえば語が少数の特定の文書に出現する度合いを表すＩＤＦ(Inverse Document Frequency)や、ＴＦとＩＤＦの積、など既存の各種尺度を用いることができる。 In the above example, the word appearance frequency TF (Term Frequency) is used as the word weighting scale in the context dictionary and the document vector. However, the word weighting scale is not limited to this, and various existing scales such as words Various existing measures such as IDF (Inverse Document Frequency) representing the degree of occurrence in a small number of specific documents and the product of TF and IDF can be used.

また、単語単位の出現頻度に加えて（もしくはこれに代わり）単語間の共起確率や連接確率等を採用することもできる。単語間の共起確率とは、ある２つ（もしくは３つ以上）の単語が同じ文書中にともに出現した回数を確率で表した数値であり、一般に共起確率が高いほどこれらの単語間の関連性が高いことを表す。また連接確率とは２つ（もしくは３つ以上）の単語が連続して出現する確率であり、この値が大きいほどこれらの単語が連続して出現する可能性が高いことを示す。これらの統計値を併用して文脈辞書を構成することにより、より適した文脈辞書判定の精度を向上することができ、単語間のつながりやすさを反映したより高精度な予測処理を行うことが可能となる。 Further, in addition to (or instead of) the appearance frequency in units of words, a co-occurrence probability between words or a connection probability can be employed. The co-occurrence probability between words is a numerical value representing the number of times two (or more) words appear together in the same document. Generally, the higher the co-occurrence probability is, Relevance is high. The concatenation probability is a probability that two (or three or more) words appear consecutively, and the larger this value is, the higher the possibility that these words appear consecutively. By using these statistics together to construct a context dictionary, it is possible to improve the accuracy of more suitable context dictionary determination, and to perform more accurate prediction processing that reflects the ease of connection between words It becomes possible.

さらに、選択した文脈辞書を用いて予測処理を行う際、文脈辞書の選択において類似度が最大となった文脈辞書１つだけを用いる代わりに、場合によっては上位ｋ個（ｋ＜Ｃ）の文脈辞書を併用してもよい（たとえば、文字入力中のカルテ文書との類似度があるしきい値以上となる文脈辞書が複数あればこれらを併用する。）。また、予測候補の漏れを防ぐために、適合した文脈辞書から抽出した語に全クラスタの文脈辞書から抽出した語を追加して予測候補を構成してもよい。

（実施例２）
上記実施例１で述べた電子カルテシステムにおいて、文脈辞書集合２０３に保存されている文脈辞書の更新が行われる場合について以下に説明する。 Furthermore, when performing the prediction process using the selected context dictionary, in some cases, instead of using only one context dictionary having the maximum similarity in the selection of the context dictionary, the top k contexts (k <C) may be used. Dictionaries may be used in combination (for example, if there are a plurality of context dictionaries having a similarity with a medical record document during character input exceeding a threshold value, these are used in combination). Moreover, in order to prevent omission of a prediction candidate, you may comprise the prediction candidate by adding the word extracted from the context dictionary of all the clusters to the word extracted from the adapted context dictionary.

(Example 2)
A case where the context dictionary stored in the context dictionary set 203 is updated in the electronic medical record system described in the first embodiment will be described below.

実施例１では電子カルテシステムの導入時点において既に存在していたＮ個のカルテ文書をＣ個のクラスタに分類することにより、心疾患、消化器疾患、呼吸器疾患、血液疾患など疾患群に対応するＣ個の文脈辞書を生成した。その後、例えば、この医療機関に糖尿病治療を専門とする医師が採用されたことにより、糖尿病の患者が増加し、カルテ文書の内容の傾向に変化が生じてきたとする。このような場合、初期導入時の文脈辞書では作成中の文書の内容や文脈に即した予測処理を十分に精度良く行うことができない。そこでカルテ文書のクラスタリングをやり直すことにより、最新のカルテ文書の傾向を反映したＣ’個のクラスタが得られ、これらに対応したＣ’個の文脈辞書が再構成される。糖尿病患者のカルテ文書が他の疾患群と比べ無視できない数であれば、Ｃ’個のクラスタの内の一つとして、主として糖尿病のカルテ文書から構成されるクラスタが得られ、これに対応した文脈辞書が生成される。これにより、初期導入時の文脈辞書では適切に予測処理を行えなかった糖尿病患者のカルテ文書への入力においても、予測処理の精度が向上する。 In Example 1, N medical record documents that already existed at the time of introduction of the electronic medical record system are classified into C clusters to deal with disease groups such as heart disease, digestive tract disease, respiratory disease, and blood disease. C context dictionaries were generated. Then, for example, it is assumed that the number of diabetic patients has increased due to the adoption of doctors specializing in diabetes treatment in this medical institution, and the tendency of the contents of medical record documents has changed. In such a case, the context dictionary at the time of initial introduction cannot perform sufficiently accurate prediction processing according to the contents and context of the document being created. Therefore, by re-clustering the medical record documents, C ′ clusters reflecting the trend of the latest medical record document are obtained, and C ′ context dictionaries corresponding to these clusters are reconstructed. If the medical record documents of diabetic patients are insignificant compared to other disease groups, a cluster composed mainly of diabetic medical record documents is obtained as one of the C ′ clusters, and the corresponding context is obtained. A dictionary is generated. As a result, the accuracy of the prediction process is improved even in the case of input to a medical record document of a diabetic patient who could not properly perform the prediction process in the context dictionary at the time of initial introduction.

また、実運用においては、上述のような場合の他に、以下に示すような所定タイミングごとに文脈辞書の再構築を行うこととしてもよい。
（１）前回の文脈辞書生成から所定時間（日数）の経過後
（２）ユーザによる所定回数のカルテ文書へのアクセス(作成・編集・閲覧・受信等)後
また、タイミングを２段階設定しておき、第１のタイミングでは分類済みクラスタへの更新されたカルテ文書の追加と文脈辞書の生成のみを行い、第２のタイミングではクラスタリングからすべて行う、としてもよい。これは、クラスタリングには計算コストがかかることを考慮したものである。 In actual operation, in addition to the case described above, the context dictionary may be rebuilt at predetermined timings as described below.
(1) After a predetermined time (days) has elapsed since the previous context dictionary generation (2) After a user has accessed (created, edited, viewed, received, etc.) a predetermined number of chart documents Alternatively, it is possible to add only the updated medical record document to the classified cluster and generate the context dictionary at the first timing, and to perform all from the clustering at the second timing. This is due to the fact that clustering is computationally expensive.

さらに、文脈辞書の生成に十分な文書が確保できない場合、クラスタリング済みのカルテ文書をクエリとしてネットワーク上から類似カルテ文書を取得し、文脈辞書生成に用いることとしてもよい。 Furthermore, when a document sufficient for generating a context dictionary cannot be secured, a similar chart document may be acquired from the network using a clustered chart document as a query and used for generating the context dictionary.

さらにまた、一般に大量の文書のクラスタリングは大きな計算コストがかかるため、例えばカルテ文書のクラスタリングは月に１回だけ行い、週に１回、その週に更新されたカルテ文書を前回生成したクラスタのいずれかに追加し、新たに追加されたカルテ文書内容を反映するように文脈辞書のみ生成し直す、という構成をとることもできる。 Furthermore, since clustering a large number of documents generally requires a large calculation cost, for example, clustering of medical record documents is performed only once a month, and once a week, any of the clusters in which a medical record document updated in the week was generated last time It is also possible to adopt a configuration in which only the context dictionary is regenerated to reflect the contents of the newly added chart document.

さらに、本実施形態では、カルテ文書に基づいて文脈辞書を動的に再構成するため、導入時点では想定し得なかった新たな内容の文脈辞書が必要となった場合であっても、（例えば、数年前のSARS（重症急性呼吸器症候群）のように新たな感染症が流行するケース、C型肝炎のように薬害による感染の可能性が明らかになり検査受診が急増するケースな
ど）文書集合に保存される新たな内容や種類のカルテ文書に基づいて新たに必要な文脈辞書を容易に生成することができる。

（実施例３）
本発明を手書き文字入力装置における文脈処理（文字認識後処理）に用いる場合について説明する。 Furthermore, in this embodiment, since the context dictionary is dynamically reconfigured based on the medical record document, even when a context dictionary having new contents that cannot be assumed at the time of introduction is required (for example, Documents such as SARS (Severe Acute Respiratory Syndrome) several years ago, cases where new infectious diseases are prevalent, cases such as hepatitis C, where the possibility of infection due to phytotoxicity is revealed, and the number of examinations increases rapidly) It is possible to easily generate a new necessary context dictionary based on new contents and types of medical record documents stored in the set.

(Example 3)
A case where the present invention is used for context processing (post-character recognition processing) in a handwritten character input device will be described.

この場合、ユーザが入力する入力文字列は手書き文字列であり、文脈辞書選択部２０５および文字列変換処理部２０６による入力予測処理は、手書き文字列を文字認識手段により文字認識し、さらにその結果得られた文字認識候補群から言語知識に基づいて適切な文字列候補を出力する処理に相当することとなる。 In this case, the input character string input by the user is a handwritten character string, and the input prediction processing by the context dictionary selecting unit 205 and the character string conversion processing unit 206 recognizes the handwritten character string by the character recognition unit, and further results thereof. This corresponds to a process of outputting an appropriate character string candidate based on language knowledge from the obtained character recognition candidate group.

文字認識における後処理の方法は「特開2000-90201 バイグラム辞書とその小型化方法並びに手書き文字の認識処理方法およびその装置（東京農工大）」や「特開平9-282420 文字パターン認識装置（日立）」などに記述されているように、文字間の連接確率を用いて最適な候補文字の組み合わせを求める方法が一般的であるが、以下に概要を説明する。 Post-processing methods in character recognition include “JP 2000-90201 bigram dictionary and its miniaturization method and handwritten character recognition processing method and device (Tokyo Univ. Of Agriculture and Technology)” and “JP 9-282420 character pattern recognition device (Hitachi). As is described in “)” and the like, a method for obtaining an optimal combination of candidate characters by using a connection probability between characters is generally used, but an outline will be described below.

手書き入力文字列X={X₁, X₂, ..., X_N}に対する文字列候補C={C₁, C₂, ..., C_N}の文字列評価値L(C | X)を次式（数２）により求め、これを最大化する文字列候補を1位候補とする。 Character string evaluation value L (C | X for character string candidate C = {C ₁ , C ₂ , ..., C _N } for handwritten input character string X = {X ₁ , X ₂ , ..., X _N } ) Is obtained by the following equation (Equation 2), and the character string candidate that maximizes this is set as the first candidate.

ここで、X₁, X₂, ..., X_Nは手書き入力文字列Xを構成する個々の手書き文字パターン、C₁, C₂, ..., C_Nはある文字列候補Cを構成する個々の文字を表し、S(X_i, C_i)は手書き文字パターンX_iから得られた認識候補文字C_iの類似度（認識スコア）、P(C_i+1| C_i)は文字C_iから文字C_i+1への連接確率、ｗは実験により定める重み定数を表す。 Here, X ₁ , X ₂ , ..., X _N are individual handwritten character patterns constituting the handwritten input character string X, and C ₁ , C ₂ , ..., C _N constitute a certain character string candidate C S (X _i , C _i ) is the similarity (recognition score) of the recognition candidate character C _i obtained from the handwritten character pattern X _i , and P (C _{i + 1} | C _i ) is the character The connection probability from C _i to the letter C _{i + 1} , w represents a weight constant determined by experiment.

図８は、文字認識手段による連接確率に基づいて手書きで入力された文字列の認識処理を示す図である。 FIG. 8 is a diagram showing a process for recognizing a character string input by handwriting based on the connection probability by the character recognition means.

図８では、「本」「日」「は」「晴」の４文字が手書き入力され、「本」に対する認識候補文字として「本」「古」「布」、「日」に対する認識候補文字として「目」「日」「月」、のように各３文字ずつの候補文字が得られている。ここでいう文字認識後処理とは、これら３文字×４の候補文字の組み合わせによって得られる８１通りの候補文字列から、文字間の連接確率に基づいて最適な候補文字列を求める処理である。 In FIG. 8, four characters “book”, “day”, “ha”, and “sunny” are input by handwriting, and “book”, “old”, “cloth”, and “date” as recognition candidate characters for “book”. Three candidate characters such as “eyes”, “day”, and “month” are obtained. The post-character recognition process here is a process for obtaining an optimal candidate character string from 81 candidate character strings obtained by combining these 3 characters × 4 candidate characters based on the connection probability between characters.

一般的な文字認識後処理方法では、後処理に用いる文字間連接確率（文字バイグラム）をあらかじめ大量のテキストから求めておき、これを用いる。 In a general character recognition post-processing method, a character connection probability (character bigram) used for post-processing is obtained in advance from a large amount of text and used.

これに対し本発明では、実施例１と同様に、導入時に既存のＮ個の文書をＣ個のクラスタに分類する。そして各クラスタに属する文書からＣ個の文字バイグラム辞書を生成し、これを各クラスタに対応する文脈辞書とする。そして、文脈辞書選択部２０５は、文字入力中の文書の内容から最適な文脈辞書を選択する。続いて、文字列変換処理部２０６が、選択された文脈辞書を用いて文字認識後処理を行う。これにより、単一の汎用的な文字バイグラム辞書を用いるよりも文脈にふさわしい高精度な後処理結果が得られる。 On the other hand, in the present invention, as in the first embodiment, existing N documents are classified into C clusters at the time of introduction. Then, C character bigram dictionaries are generated from the documents belonging to each cluster, and this is used as a context dictionary corresponding to each cluster. Then, the context dictionary selection unit 205 selects an optimum context dictionary from the contents of the document being input. Subsequently, the character string conversion processing unit 206 performs post-character recognition processing using the selected context dictionary. As a result, a highly accurate post-processing result suitable for the context can be obtained rather than using a single general-purpose character bigram dictionary.

また、実施例２と同様、所定のタイミングで文脈辞書の再構成を行うため、入力文書の傾向が変化しても適切な文字認識後処理を行うことが可能である。 Further, as in the second embodiment, since the context dictionary is reconfigured at a predetermined timing, it is possible to perform appropriate post-character recognition processing even if the tendency of the input document changes.

尚、このケースでは、文脈辞書選択部２０５における文脈辞書選択処理において、実施例１のときのように単語単位の出現頻度を用いる代わりに、文字入力中の文書からも文字バイグラムを抽出し、各クラスタの文字バイグラム辞書との一致度から最適な辞書を選択することも可能である。ただし、各クラスタの単語出現頻度も別途求めておいて、実施例１と同様に単語の出現傾向に基づいて文脈辞書の選択を行う構成とすることもできる。 In this case, in the context dictionary selection process in the context dictionary selection unit 205, instead of using the word unit appearance frequency as in the first embodiment, a character bigram is extracted from a document in which characters are being input, It is also possible to select an optimal dictionary from the degree of coincidence with the clustered character bigram dictionary. However, it is also possible to obtain the word appearance frequency of each cluster separately and select the context dictionary based on the word appearance tendency as in the first embodiment.

尚、本発明は上記実施の形態に限らず、特許請求の範囲に記載の技術的範囲内で種々の変形が可能である。 The present invention is not limited to the above embodiment, and various modifications can be made within the technical scope described in the claims.

本発明の実施形態の一つである文書入力装置の構成を示す図である。It is a figure which shows the structure of the document input device which is one of embodiment of this invention. 本発明の実施形態の一つである文書入力装置の機能ブロックを示す図である。It is a figure which shows the functional block of the document input device which is one Embodiment of this invention. 文脈辞書生成部２０２によるカルテ文書のクラスタ分類を示す図である。It is a figure which shows the cluster classification | category of the medical record document by the context dictionary production | generation part 202. FIG. 文脈辞書生成部２０２により生成された単語出現頻度リストの例を示す図である。It is a figure which shows the example of the word appearance frequency list produced | generated by the context dictionary production | generation part 202. FIG. 文脈辞書生成部２０２により生成された単語出現頻度リストの例を示す図である。It is a figure which shows the example of the word appearance frequency list produced | generated by the context dictionary production | generation part 202. FIG. カルテ文書への入力画面を示す図である。It is a figure which shows the input screen to a medical chart document. 文字入力装置プログラムのフローチャートを示す図である。It is a figure which shows the flowchart of a character input device program. 文字認識手段による手書き文字列の認識処理を示す図である。It is a figure which shows the recognition process of the handwritten character string by a character recognition means.

Explanation of symbols

１００文字入力装置
２０１文書集合
２０２文脈辞書生成部
２０３文脈辞書集合
２０４入力部
２０５文脈辞書選択部
２０６文字列変換処理部
２０７出力部 DESCRIPTION OF SYMBOLS 100 Character input device 201 Document set 202 Context dictionary generation part 203 Context dictionary set 204 Input part 205 Context dictionary selection part 206 Character string conversion process part 207 Output part

Claims

In a character input device for inputting characters by converting input information into a character string,
One or more context dictionaries to use when converting input information to a string;
A context dictionary selecting means for selecting a context dictionary to be used from the one or more context dictionaries based on the contents of a document during character input;
A character string conversion means for converting input information into a character string based on the selected context dictionary;
A character input device comprising:

A context dictionary generating means for generating the one or more context dictionaries based on a document set including one or more documents;
The character input device according to claim 1, further comprising:

The character string conversion means is a character string candidate display means for extracting and displaying one or more character string candidates that the user desires to input based on the input information from the selected context dictionary,
The character input device according to claim 1, wherein the character input device is provided.

The context dictionary generating means includes:
A document classification means for generating one or more clusters by performing a clustering process on the document set, and classifying each document into one of the one or more clusters;
First linguistic feature extracting means for extracting linguistic features reflecting the contents of each document from one or more documents belonging to each cluster;
A context dictionary output means for outputting a context dictionary for each cluster based on the extracted features;
The character input device according to claim 2, further comprising:

The context dictionary selecting means includes
Second linguistic feature extracting means for extracting linguistic features reflecting the contents of the document during character input;
Similarity calculation means for calculating the similarity between the extracted linguistic features and the linguistic features of each context dictionary corresponding to each cluster;
A context dictionary extracting means for extracting a context dictionary of a cluster whose calculated similarity is equal to or greater than a predetermined value and outputting as a context dictionary to be used;
The character input device according to claim 4, further comprising:

6. The character string according to claim 4, wherein the character string candidate display means extracts and displays the character string candidate from the selected context dictionary according to a linguistic feature of the context dictionary. Input device.

7. The character input device according to claim 2, wherein the context dictionary generation unit includes a context dictionary update unit that regenerates the one or more context dictionaries at a predetermined timing. .

The character input device program which provides the function of each means as described in any one of Claim 1 thru | or 7 to a computer.