JP6973782B2

JP6973782B2 - Standard item name setting device, standard item name setting method and standard item name setting program

Info

Publication number: JP6973782B2
Application number: JP2017186661A
Authority: JP
Inventors: 一也谷川; 健太鈴木
Original assignee: Milabo Co Ltd
Current assignee: Milabo Co Ltd
Priority date: 2017-09-27
Filing date: 2017-09-27
Publication date: 2021-12-01
Anticipated expiration: 2037-09-27
Also published as: JP2019061550A; WO2019065775A1

Description

本発明は、標準項目名設定装置、標準項目名設定方法及び標準項目名設定プログラムに関する。 The present invention relates to a standard item name setting device, a standard item name setting method, and a standard item name setting program.

自治体や企業等では多数の帳票が利用されている。帳票は紙媒体であることが一般的ではあるが、紙媒体の帳票を電子化した入力フォームを用いることで帳票の管理コストを低減することが望まれている。 Many forms are used by local governments and companies. Although the form is generally a paper medium, it is desired to reduce the management cost of the form by using an input form in which the form of the paper medium is digitized.

例えば、下記の特許文献１においては、帳票の種類を判別し、帳票の種類に応じた入力フォームを利用して帳票の受付処理をするシステムについて開示されている。 For example, Patent Document 1 below discloses a system that determines the type of form and uses an input form according to the type of form to process the acceptance of the form.

特開２００４−１２６９１０号公報Japanese Unexamined Patent Publication No. 2004-126910

しかしながら、同じ種類の帳票であっても、自治体や企業等に応じて対応する項目の名称（項目名）が異なっていることがある。そのため、数多くの種類の帳票について項目名を標準化しようとすると、項目名のリストが膨大なものとなるため、人手で整理しようとすると労力が極めて大きいという課題があった。 However, even if the forms are of the same type, the names (item names) of the corresponding items may differ depending on the local government, company, or the like. Therefore, when trying to standardize the item names for many types of forms, the list of item names becomes enormous, and there is a problem that the labor is extremely large when trying to organize them manually.

本発明は、上記の課題に鑑みてなされたものであり、その目的は、複数の帳票において対応する項目名に対し標準的な項目名を設定できる標準項目名設定装置、標準項目名設定方法及び標準項目名設定プログラムを提供することにある。 The present invention has been made in view of the above problems, and an object thereof is a standard item name setting device capable of setting a standard item name for a corresponding item name in a plurality of forms, a standard item name setting method, and a standard item name setting method. It is to provide a standard item name setting program.

上記課題は、本発明に係る標準項目名設定装置によれば、複数の帳票に記載された複数の項目名を取得する項目名取得部と、前記複数の項目名を複数のクラスタのいずれかに分類するクラスタリング部と、前記複数のクラスタのうちの注目クラスタに分類された項目名に基づいて、当該注目クラスタのタイプを複数のタイプの中から決定するタイプ決定部と、前記タイプ決定部により決定したタイプに対応するルールに基づいて、前記注目クラスタに分類された項目名を共通の文字列を有する項目名からなるサブクラスタに細分化するとともに、当該サブクラスタに属する項目名について前記共通の文字列以外の文字列に基づく複数の項目名候補を生成する項目名候補生成部と、前記サブクラスタの前記複数の項目名候補の中から所定の基準に基づいて選択した項目名候補を、前記サブクラスタに対応する標準項目名として設定する標準項目名設定部と、を備えることにより解決される。 According to the standard item name setting device according to the present invention, the above-mentioned problem is to be assigned to an item name acquisition unit for acquiring a plurality of item names described in a plurality of forms and the plurality of item names in one of a plurality of clusters. Based on the clustering unit to be classified and the item name classified into the attention cluster among the plurality of clusters, the type determination unit for determining the type of the attention cluster from the plurality of types and the type determination unit determine the type. Based on the rule corresponding to the type, the item names classified in the attention cluster are subdivided into subclusters consisting of item names having a common character string, and the item names belonging to the subcluster are the same characters. The item name candidate generation unit that generates a plurality of item name candidates based on a character string other than a column, and the item name candidate selected from the plurality of item name candidates of the subcluster based on a predetermined criterion are selected as the sub. This is solved by providing a standard item name setting unit that is set as a standard item name corresponding to the cluster.

上記課題は、本発明に係る標準項目名設定方法によれば、コンピュータが、複数の帳票に記載された複数の項目名を取得し、前記複数の項目名を複数のクラスタのいずれかに分類し、前記複数のクラスタのうちの注目クラスタに分類された項目名に基づいて、当該注目クラスタのタイプを複数のタイプの中から決定し、前記決定したタイプに対応するルールに基づいて、前記注目クラスタに分類された項目名を共通の文字列を有する項目名からなるサブクラスタに細分化するとともに、当該サブクラスタに属する項目名について前記共通の文字列以外の文字列に基づく複数の項目名候補を生成し、前記サブクラスタの前記複数の項目名候補の中から所定の基準に基づいて選択した項目名候補を、前記サブクラスタに対応する標準項目名として設定することにより解決される。 In the above problem, according to the standard item name setting method according to the present invention, a computer acquires a plurality of item names described in a plurality of forms and classifies the plurality of item names into one of a plurality of clusters. , The type of the attention cluster is determined from the plurality of types based on the item name classified into the attention cluster among the plurality of clusters, and the attention cluster is determined based on the rule corresponding to the determined type. The item names classified in are subdivided into subclusters consisting of item names having a common character string, and multiple item name candidates based on character strings other than the common character string are selected for the item names belonging to the subcluster. The problem is solved by setting an item name candidate that is generated and selected from the plurality of item name candidates of the subcluster based on a predetermined criterion as a standard item name corresponding to the subcluster.

上記課題は、本発明に係る標準項目名設定プログラムによれば、複数の帳票に記載された複数の項目名を取得する項目名取得部と、前記複数の項目名を複数のクラスタのいずれかに分類するクラスタリング部と、前記複数のクラスタのうちの注目クラスタに分類された項目名に基づいて、当該注目クラスタのタイプを複数のタイプの中から決定するタイプ決定部と、前記タイプ決定部により決定したタイプに対応するルールに基づいて、前記注目クラスタに分類された項目名を共通の文字列を有する項目名からなるサブクラスタに細分化するとともに、当該サブクラスタに属する項目名について前記共通の文字列以外の文字列に基づく複数の項目名候補を生成する項目名候補生成部と、前記サブクラスタの前記複数の項目名候補の中から所定の基準に基づいて選択した項目名候補を、前記サブクラスタに対応する標準項目名として設定する標準項目名設定部としてコンピュータを機能させることにより解決される。 According to the standard item name setting program according to the present invention, the above-mentioned problem is to be assigned to an item name acquisition unit for acquiring a plurality of item names described in a plurality of forms and the plurality of item names in one of a plurality of clusters. Based on the clustering unit to be classified and the item name classified into the attention cluster among the plurality of clusters, the type determination unit that determines the type of the attention cluster from among the plurality of types, and the type determination unit determines the type. Based on the rules corresponding to the types, the item names classified in the attention cluster are subdivided into subclusters consisting of item names having a common character string, and the item names belonging to the subcluster are the same characters. The item name candidate generation unit that generates a plurality of item name candidates based on a character string other than a column, and the item name candidate selected from the plurality of item name candidates of the subcluster based on a predetermined criterion are selected as the sub. This is solved by making the computer function as the standard item name setting unit that is set as the standard item name corresponding to the cluster.

上記の標準項目名設定装置、標準項目名設定方法及び標準項目名設定プログラムによれば、複数の帳票において対応する項目名に対して１つの標準的な項目名を設定することができる。これにより、対応する複数の項目名を１つの項目名にまとめる労力を軽減できる。 According to the above-mentioned standard item name setting device, standard item name setting method, and standard item name setting program, one standard item name can be set for the corresponding item name in a plurality of forms. As a result, it is possible to reduce the labor of combining a plurality of corresponding item names into one item name.

上記の標準項目名設定装置において、前記複数の項目名のそれぞれの特徴ベクトルを生成する特徴ベクトル生成部を備え、前記クラスタリング部は、前記複数の項目名のそれぞれの特徴ベクトルの類似度に基づいて、前記複数の項目名を前記複数のクラスタに分類すると好適である。
こうすることで、帳票に記載の互いに類似する複数の項目名に対して１つの標準的な項目名を設定することができる。これにより、複数の類似する項目名を１つの項目名にまとめる労力を軽減できる。 The standard item name setting device includes a feature vector generation unit that generates a feature vector of each of the plurality of item names, and the clustering unit is based on the similarity of each feature vector of the plurality of item names. , It is preferable to classify the plurality of item names into the plurality of clusters.
By doing so, one standard item name can be set for a plurality of item names similar to each other described in the form. As a result, it is possible to reduce the labor of combining a plurality of similar item names into one item name.

上記の標準項目名設定装置において、学習データとしての１以上の帳票に出現する単語を機械学習した学習モデルを記憶する学習モデル記憶部を備え、前記特徴ベクトル生成部は、前記項目名を分解した各単語の前記学習モデルに基づくベクトルを合成して、前記項目名の特徴ベクトルを生成すると好適である。
こうすることで、類似する項目名をまとめて分類する精度を向上できる。 The standard item name setting device includes a learning model storage unit that stores a learning model in which words appearing in one or more forms as learning data are machine-learned, and the feature vector generation unit decomposes the item names. It is preferable to synthesize the vectors of each word based on the learning model to generate the feature vector of the item name.
By doing so, it is possible to improve the accuracy of classifying similar item names together.

上記の標準項目名設定装置において、前記複数のタイプごとに、キーワード、正規表現のうち少なくとも一方を含むマッチングパターンを対応付けて記憶したマッチングパターン記憶部を備え、前記タイプ決定部は、前記注目クラスタに分類された項目名にマッチするマッチングパターンに基づいて、前記複数のタイプのうちから前記注目クラスタのタイプを決定すると好適である。
こうすることで、クラスタのタイプの判定精度を向上できる。 The standard item name setting device includes a matching pattern storage unit that stores matching patterns including at least one of a keyword and a regular expression in association with each other for each of the plurality of types, and the type determination unit is the attention cluster. It is preferable to determine the type of the attention cluster from the plurality of types based on the matching pattern matching the item names classified in.
By doing so, the accuracy of determining the type of cluster can be improved.

上記の標準項目名設定装置において、前記項目名候補生成部は、前記注目クラスタに分類された項目名のうち、前記注目クラスタのタイプに対応付けて記憶されたマッチングパターンに該当する文字列以外から前記共通の文字列を設定すると好適である。
こうすることで、１つのクラスタを１以上のサブクラスタに分類する基準を簡易に定めることができる。 In the above-mentioned standard item name setting device, the item name candidate generation unit is from a character string other than the character string corresponding to the matching pattern stored in association with the type of the attention cluster among the item names classified into the attention cluster. It is preferable to set the common character string.
By doing so, it is possible to easily determine the criteria for classifying one cluster into one or more subclusters.

上記の標準項目名設定装置において、前記標準項目名設定部は、前記サブクラスタについての前記複数の項目名候補のうち、前記サブクラスタにおける出現頻度が最も高い項目名候補を前記標準項目名として設定すると好適である。
こうすることで、同一のサブクラスタに分類された項目名のうち、最も良く使用されている表現に基づいて標準項目名を設定できる。 In the standard item name setting device, the standard item name setting unit sets the item name candidate having the highest frequency of appearance in the subcluster as the standard item name among the plurality of item name candidates for the subcluster. Then it is suitable.
By doing so, the standard item name can be set based on the most commonly used expression among the item names classified into the same subcluster.

上記の標準項目名設定装置において、前記複数のクラスタのそれぞれについて、前記タイプ決定部と、前記項目名候補生成部と、前記標準項目名設定部による処理を実行して、複数の標準項目名を設定すると好適である。
こうすることで、帳票の多様な項目名について標準項目名を設定できる。これにより、帳票から標準項目名を設定する労力を軽減できる。 In the above standard item name setting device, for each of the plurality of clusters, processing by the type determination unit, the item name candidate generation unit, and the standard item name setting unit is executed to obtain a plurality of standard item names. It is suitable to set.
By doing this, standard item names can be set for various item names in the form. This can reduce the labor of setting the standard item name from the form.

本発明によれば、複数の帳票において対応する項目名に対し標準的な項目名を設定できる。 According to the present invention, standard item names can be set for corresponding item names in a plurality of forms.

情報処理システムの全体構成を示す図である。It is a figure which shows the whole structure of an information processing system. 複数の帳票における項目名の関係を示す図である。It is a figure which shows the relationship of the item name in a plurality of forms. 標準項目名の設定処理の流れを説明する図である。It is a figure explaining the flow of the setting process of a standard item name. 標準項目名設定装置の機能ブロック図である。It is a functional block diagram of a standard item name setting device. ニューラルネットワークの構成を説明する図である。It is a figure explaining the structure of a neural network. 項目名の特徴ベクトルを説明する図である。It is a figure explaining the feature vector of the item name. マッチングパターンテーブルの一例を示す図である。It is a figure which shows an example of a matching pattern table. 項目名候補テーブルの一例を示す図である。It is a figure which shows an example of the item name candidate table. 項目名のクラスタリング処理のフロー図である。It is a flow diagram of the clustering process of the item name. 標準項目名の設定処理のフロー図である。It is a flow diagram of the setting process of a standard item name.

以下、図１乃至図１０を参照しながら、本発明の実施の形態（以下、本実施形態）に係る標準項目名設定装置１０を備える情報処理システム１について説明する。
なお、以下に説明する実施形態は、本発明の理解を容易にするための一例に過ぎず、本発明を限定するものではない。すなわち、以下に説明するシステムの構成、データ、処理等については、本発明の趣旨を逸脱することなく、変更、改良され得るとともに、本発明にはその等価物が含まれる。 Hereinafter, the information processing system 1 including the standard item name setting device 10 according to the embodiment of the present invention (hereinafter referred to as the present embodiment) will be described with reference to FIGS. 1 to 10.
It should be noted that the embodiments described below are merely examples for facilitating the understanding of the present invention, and do not limit the present invention. That is, the system configuration, data, processing, etc. described below can be changed or improved without departing from the spirit of the present invention, and the present invention includes equivalents thereof.

[情報処理システム１の構成]
図１に示されるように、情報処理システム１は、標準項目名設定装置１０及び帳票処理装置３０を備える。標準項目名設定装置１０と帳票処理装置３０とは、例えば図示しないインターネットやイントラネット等のネットワークを介して通信可能に接続される。 [Configuration of information processing system 1]
As shown in FIG. 1, the information processing system 1 includes a standard item name setting device 10 and a form processing device 30. The standard item name setting device 10 and the form processing device 30 are communicably connected to each other via a network such as the Internet or an intranet (not shown).

帳票処理装置３０はスキャナ４０に接続される。
スキャナ４０は、紙媒体を光学走査することにより画像情報を取り込む装置である。本実施形態では、スキャナ４０は、帳票Ｐをスキャンしたスキャン画像（画像情報）を、帳票処理装置３０に出力する。
帳票Ｐは、帳簿、伝票、申請書等の定型的な書類である。本実施形態では、多種類の帳票Ｐをスキャナ４０により取り込み、帳票処理装置３０に出力することとする。 The form processing device 30 is connected to the scanner 40.
The scanner 40 is a device that captures image information by optically scanning a paper medium. In the present embodiment, the scanner 40 outputs the scanned image (image information) obtained by scanning the form P to the form processing device 30.
Form P is a standard document such as a book, a slip, or an application form. In the present embodiment, many types of forms P are captured by the scanner 40 and output to the form processing device 30.

帳票処理装置３０は、スキャナ４０により取り込んだ帳票Ｐを処理するコンピュータである。具体的には、帳票処理装置３０は、帳票Ｐに対してＯＣＲ（光学文字認識）を実行して、帳票Ｐに記載の文字列を取得する。また、帳票処理装置３０は、罫線、文字列の配置に基づいて、帳票Ｐの表構造を解析する。より具体的には、帳票処理装置３０は、帳票Ｐを構成する項目欄、入力欄、穴埋め入力欄に分けるとともに、項目欄（さらには穴埋め入力欄）に記載された項目名の情報を解析する。
なお、項目欄とは、項目名としての文字列が記載された領域であり、入力欄とは、文字列が記載されず、項目欄に対応する情報を入力する領域である。そして、穴埋め入力欄とは、文字列が記載され、文字列の間に情報を入力する領域である。 The form processing device 30 is a computer that processes the form P captured by the scanner 40. Specifically, the form processing device 30 executes OCR (optical character recognition) on the form P to acquire the character string described in the form P. Further, the form processing device 30 analyzes the table structure of the form P based on the arrangement of the ruled lines and the character strings. More specifically, the form processing device 30 divides the form P into an item field, an input field, and a fill-in-the-blank input field, and analyzes the information of the item name described in the item field (further, the fill-in-the-blank input field). ..
The item field is an area in which a character string as an item name is described, and the input field is an area in which a character string is not described and information corresponding to the item field is input. The fill-in-the-blank input field is an area in which a character string is described and information is input between the character strings.

本実施形態では、帳票処理装置３０が解析した複数種類の帳票Ｐの情報を標準項目名設定装置１０に出力する。そして、標準項目名設定装置１０が複数種類の帳票Ｐにおいて対応する項目名を特定し、対応する項目名を標準化した標準項目名を設定する。 In the present embodiment, the information of the plurality of types of forms P analyzed by the form processing device 30 is output to the standard item name setting device 10. Then, the standard item name setting device 10 specifies the corresponding item name in the plurality of types of forms P, and sets the standard item name standardized by the corresponding item name.

次に、標準項目名設定装置１０の構成について説明する。
図１に示されるように、標準項目名設定装置１０は、ハードウェアとしてプロセッサ１１、記憶装置１２及び通信用インターフェース１３を備えるコンピュータである。 Next, the configuration of the standard item name setting device 10 will be described.
As shown in FIG. 1, the standard item name setting device 10 is a computer including a processor 11, a storage device 12, and a communication interface 13 as hardware.

プロセッサ１１は、例えば中央処理装置（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）を含み構成され、記憶装置１２に記憶されるプログラムやデータに基づいて各種の演算処理を実行するとともに、標準項目名設定装置１０の各部を制御する。 The processor 11 is configured to include, for example, a central processing unit, executes various arithmetic processes based on programs and data stored in the storage device 12, and controls each part of the standard item name setting device 10. do.

記憶装置１２は、例えばメモリ、磁気ディスク装置を含み構成され、各種のプログラムやデータを記憶するほか、プロセッサ１１のワークメモリとしても機能する。 The storage device 12 includes, for example, a memory and a magnetic disk device, stores various programs and data, and also functions as a work memory of the processor 11.

通信用インターフェースは、ネットワークインターフェースカード（ＮＩＣ）等の通信インターフェースを有し、通信インターフェースを介してネットワークに接続する。そして、通信用インターフェースは、ネットワークを介して帳票処理装置３０等のデバイスと通信する。 The communication interface has a communication interface such as a network interface card (NIC) and connects to the network via the communication interface. Then, the communication interface communicates with a device such as the form processing device 30 via the network.

[標準項目名設定装置１０により実行される処理の概要]
ここで、図２及び図３を参照しながら、標準項目名設定装置１０により実行される処理の概要について説明する。 [Outline of processing executed by the standard item name setting device 10]
Here, the outline of the process executed by the standard item name setting device 10 will be described with reference to FIGS. 2 and 3.

図２には、複数の帳票（第１帳票ＰＡ、第２帳票ＰＢ、第３帳票ＰＣ）における項目名の対応関係を示した。ここで、第１帳票ＰＡ、第２帳票ＰＢ、第３帳票ＰＣは同一の手続に関する帳票であるが、フォーマットが異なっていることとする。
そのため、第１項目名５０Ａ、第２項目名５０Ｂ、第３項目名５０Ｃは同一の項目として扱うことが望ましいところ、項目名が「氏名」、「おなまえ」、「名前」と異なっており、これらの項目名を標準化することが必要となる。
そこで、本実施形態に係る標準項目名設定装置１０では、以下の手順で帳票の項目名を標準化した標準項目名を設定する。なお、以下においては処理の概要について説明し、処理の詳細については追って説明する。 FIG. 2 shows the correspondence of item names in a plurality of forms (first form PA, second form PB, third form PC). Here, it is assumed that the first form PA, the second form PB, and the third form PC are forms related to the same procedure, but the formats are different.
Therefore, it is desirable to treat the first item name 50A, the second item name 50B, and the third item name 50C as the same item, but the item names are different from "name", "name", and "name". It is necessary to standardize these item names.
Therefore, in the standard item name setting device 10 according to the present embodiment, the standard item name in which the item name of the form is standardized is set by the following procedure. In the following, the outline of the process will be described, and the details of the process will be described later.

図３に示されるように、まず、標準項目名設定装置１０は、帳票処理装置３０から複数の帳票Ｐに記載された項目名のリストを含む項目名リスト６０を取得する。 As shown in FIG. 3, first, the standard item name setting device 10 acquires an item name list 60 including a list of item names described in a plurality of forms P from the form processing device 30.

次に、標準項目名設定装置１０は、項目名リスト６０に含まれる項目名（項目名Ｉ_１〜項目名Ｉ_Ｎ）のそれぞれの特徴ベクトル（特徴ベクトルＶ_１〜Ｖ_Ｎ）を生成する。なお、上記においてＮ（２以上の整数）は項目名の数を表す。 Then, standard field name setting device 10 generates an item name included in the item name list 60 (item name _{I 1} ~ item name _{I N)} feature vectors for each of the (feature vector _V 1 ~V _N). In the above, N (an integer of 2 or more) represents the number of item names.

次に、標準項目名設定装置１０は、特徴ベクトル（特徴ベクトルＶ_１〜Ｖ_Ｎ）をクラスタリングして、複数のクラスタ（クラスタＣ_１〜Ｃ_Ｍ）に分類する。ここでは、クラスタ数をＭ（２以上の整数）とする。 Then, standard field name setting device 10, and clustering feature vector (feature vector _V 1 ~V _N), classified into a plurality of clusters (clusters _C 1 _~C _M). Here, the number of clusters is M (integer of 2 or more).

次に、標準項目名設定装置１０は、複数のクラスタ（クラスタＣ_１〜Ｃ_Ｍ）のそれぞれのタイプを決定する。タイプには、例えば「ｎａｍｅ（名称）」、「ｄａｔｅ（日付）」、「ａｄｄｒｅｓｓ（住所）」等の予め複数のタイプが定められており、標準項目名設定装置１０は、各クラスタに対応するタイプを決定する。 Then, standard field name setting device 10 determines the respective types of a plurality of clusters (clusters _C 1 _~C _M). A plurality of types such as "name (name)", "date (date)", and "addless (address)" are defined in advance, and the standard item name setting device 10 corresponds to each cluster. Determine the type.

次に、標準項目名設定装置１０は、クラスタのタイプに基づく細分化ルールに従い、クラスタに分類された項目名をサブクラスタ（例えばＣ１ａ、Ｃ１ｂ、Ｃ１ｃ）に分類する。
例えば、サブクラスタは、タイプの名詞を修飾する修飾語ごとに生成される。 Next, the standard item name setting device 10 classifies the item names classified into the clusters into subclusters (for example, C1a, C1b, C1c) according to the subdivision rule based on the cluster type.
For example, subclusters are generated for each modifier that modifies a type of noun.

次に、標準項目名設定装置１０は、サブクラスタの修飾語が修飾する名詞ごとに、項目名候補を生成する。例えば、サブクラスタの修飾語が「妊婦の」である場合には、サブクラスタに対応する標準項目名は「妊婦の［ｎａｍｅ］」となる。ここで、サブクラスタの項目名に［ｎａｍｅ］の候補として、「氏名」、「御名前」、「名前」があるとすると、この中から出現頻度が最も多いもの（例えば「氏名」）が標準項目名として選ばれる。
標準項目名設定装置１０は、以上の処理を各クラスタ及び各サブクラスタに対して実行し、標準項目名を設定する。 Next, the standard item name setting device 10 generates an item name candidate for each noun modified by the modifier of the subcluster. For example, when the modifier of the subcluster is "pregnant woman", the standard item name corresponding to the subcluster is "pregnant woman [name]". Here, if there are "name", "name", and "name" as candidates for [name] in the item name of the subcluster, the one with the highest frequency of appearance (for example, "name") is standard. Selected as the item name.
The standard item name setting device 10 executes the above processing for each cluster and each sub-cluster, and sets the standard item name.

[標準項目名設定装置１０に備えられる機能]
以下においては、以上説明した処理を実現するために標準項目名設定装置１０に備えられる機能について説明する。 [Functions provided in the standard item name setting device 10]
Hereinafter, the functions provided in the standard item name setting device 10 in order to realize the processing described above will be described.

図６には、標準項目名設定装置１０の機能ブロック図を示した。図６に示されるように、標準項目名設定装置１０は、機能として、学習モデル記憶部２０、項目名取得部２１、特徴ベクトル生成部２２、クラスタリング部２３、マッチングパターン記憶部２４、タイプ決定部２５、項目名候補生成部２６、及び標準項目名設定部２７を備える。 FIG. 6 shows a functional block diagram of the standard item name setting device 10. As shown in FIG. 6, the standard item name setting device 10 has, as functions, a learning model storage unit 20, an item name acquisition unit 21, a feature vector generation unit 22, a clustering unit 23, a matching pattern storage unit 24, and a type determination unit. 25, an item name candidate generation unit 26, and a standard item name setting unit 27 are provided.

標準項目名設定装置１０に備えられる上記の各部の機能は、記憶装置１２に記憶されるプログラム（標準項目名設定プログラム）に従ってプロセッサ１１が標準項目名設定装置１０の各部を動作させることにより実行される。なお、上記のプログラムは、通信用インターフェースによりネットワーク等の通信網を介して標準項目名設定装置１０が取得してもよいし、プログラムを記憶した記憶媒体から標準項目名設定装置１０が読み込んで取得することとしてもよい。
また、上記の標準項目名設定プログラムに従って、プロセッサ１１が動作することにより本発明に係る標準項目名設定方法が実現される。
以下、上記の各部の機能の詳細について説明する。 The functions of the above-mentioned parts provided in the standard item name setting device 10 are executed by the processor 11 operating each part of the standard item name setting device 10 according to the program (standard item name setting program) stored in the storage device 12. NS. The above program may be acquired by the standard item name setting device 10 via a communication network such as a network using a communication interface, or may be acquired by the standard item name setting device 10 read from a storage medium in which the program is stored. You may do it.
Further, the standard item name setting method according to the present invention is realized by operating the processor 11 according to the above standard item name setting program.
Hereinafter, the details of the functions of the above parts will be described.

［学習モデル記憶部２０の説明］
学習モデル記憶部２０は、学習データとしての１以上の帳票に出現する単語を機械学習した学習モデルを記憶する。 [Explanation of learning model storage unit 20]
The learning model storage unit 20 stores a learning model in which words appearing in one or more forms as learning data are machine-learned.

「学習データ」とは、学習モデルに機械学習を行わせるために用いるデータセットである。換言すれば、「学習データ」とは、学習モデルに機械学習を行わせるために用いるサンプルデータ集合である。
例えば、帳票Ｐの解析データが「学習データ」の一例に相当する。具体的には、上記の解析データには、帳票Ｐから光学文字認識により得た項目名と、その項目名を形態素解析により単語に分解したデータが含まれる。 The "learning data" is a data set used to make a learning model perform machine learning. In other words, "learning data" is a set of sample data used to make a learning model perform machine learning.
For example, the analysis data of the form P corresponds to an example of "learning data". Specifically, the above analysis data includes an item name obtained from the form P by optical character recognition and data obtained by decomposing the item name into words by morphological analysis.

「帳票に出現する単語」とは、学習データとしての帳票に記載された単語である。換言すれば、学習データとしての帳票から光学文字認識により得た項目名を構成する単語である。なお、「単語」とは、意味・機能を持つ最小の言語単位である。 The "word that appears in the form" is a word described in the form as learning data. In other words, it is a word that constitutes an item name obtained by optical character recognition from a form as learning data. A "word" is the smallest linguistic unit that has meaning and function.

「機械学習」とは、ある程度の数のサンプルデータ集合を入力して解析を行い、そのデータから有用な規則、ルール、知識表現、判断基準などを抽出し、アルゴリズムを発展させることをいう。
本実施形態では、「機械学習」には教師なし学習の手法を用いることとする。 "Machine learning" means inputting a certain number of sample data sets, performing analysis, extracting useful rules, rules, knowledge representations, judgment criteria, etc. from the data, and developing an algorithm.
In this embodiment, the method of unsupervised learning is used for "machine learning".

「学習モデル」とは、学習データにより機械学習をする対象となる数学モデルである。例えば、ニューラルネットワークが上記の「学習モデル」の一例に相当する。ニューラルネットワークには、二層のパーセプトロン、三層の階層型ニューラルネットワーク、四層以上の多層ニューラルネットワーク（ディープニューラルネットワーク）を含む。 The "learning model" is a mathematical model that is the target of machine learning using learning data. For example, a neural network corresponds to an example of the above "learning model". Neural networks include two-layer perceptrons, three-layer hierarchical neural networks, and four-layer or higher multi-layer neural networks (deep neural networks).

図５には、学習モデルとしてのニューラルネットワーク７０の一例を示した。図５に示されるように、ニューラルネットワーク７０は、入力層７１、隠れ層７２、出力層７３を有する。 FIG. 5 shows an example of the neural network 70 as a learning model. As shown in FIG. 5, the neural network 70 has an input layer 71, a hidden layer 72, and an output layer 73.

入力層７１は、学習データとしての帳票から抽出された単語数を次元（要素数）とする入力ノードである。
隠れ層７２は、入力層７１と出力層７３を中継する、１層又は複数層に構成されたネットワークである。
出力層７３は、学習データとしての帳票から抽出された単語数を次元（要素数）とする出力ノードである。 The input layer 71 is an input node whose dimension (number of elements) is the number of words extracted from the form as learning data.
The hidden layer 72 is a network configured as one layer or a plurality of layers that relays the input layer 71 and the output layer 73.
The output layer 73 is an output node whose dimension (number of elements) is the number of words extracted from the form as learning data.

そして、学習データを用いた機械学習では、入力層７１から隠れ層７２への重み行列、及び隠れ層７２から出力層７３への重み行列を学習する。 Then, in machine learning using the learning data, a weight matrix from the input layer 71 to the hidden layer 72 and a weight matrix from the hidden layer 72 to the output layer 73 are learned.

学習モデル記憶部２０は、主に標準項目名設定装置１０の記憶装置１２により実現される。
具体的には、記憶装置１２は、ニューラルネットワーク７０を構成する入力層７１、隠れ層７２、出力層７３の構成と、入力層７１から隠れ層７２への重み行列、及び隠れ層７２から出力層７３への重み行列のデータを、学習モデルのデータとして記憶する。
なお、機械学習の処理は、標準項目名設定装置１０が実行してもよいし、標準項目名設定装置１０は、他のコンピュータで実行された機械学習の結果得られた学習モデルのデータを、通信用インターフェース１３を介して取得するようにしてもよい。 The learning model storage unit 20 is mainly realized by the storage device 12 of the standard item name setting device 10.
Specifically, the storage device 12 includes a configuration of an input layer 71, a hidden layer 72, and an output layer 73 constituting the neural network 70, a weight matrix from the input layer 71 to the hidden layer 72, and a hidden layer 72 to the output layer. The data of the weight matrix to 73 is stored as the data of the learning model.
The machine learning process may be executed by the standard item name setting device 10, and the standard item name setting device 10 may use the data of the learning model obtained as a result of machine learning executed by another computer. It may be acquired via the communication interface 13.

［項目名取得部２１の説明］
項目名取得部２１は、複数の帳票に記載された複数の項目名を取得する。 [Explanation of item name acquisition unit 21]
The item name acquisition unit 21 acquires a plurality of item names described in a plurality of forms.

「複数の帳票」とは、標準項目名を設定する対象とする帳票である。すなわち、形式の異なる複数種類の帳票が上記の「複数の帳票」に相当する。また、上記の「複数の帳票」は、学習データとして用いる帳票と一致していてもよい。 The "plurality of forms" is a form for which a standard item name is set. That is, a plurality of types of forms having different formats correspond to the above-mentioned "plurality of forms". Further, the above-mentioned "plurality of forms" may match the forms used as learning data.

「複数の項目名」とは、処理の対象とする「複数の帳票」に記載された項目名である。例えば、処理の対象とする「複数の帳票」に含まれる全項目名が上記の「複数の項目名」に相当する。 The "plurality of item names" is an item name described in the "plurality of forms" to be processed. For example, all item names included in the "plurality of forms" to be processed correspond to the above "plurality of item names".

項目名取得部２１は、主に標準項目名設定装置１０のプロセッサ１１、記憶装置１２及び通信用インターフェース１３により実現される。
具体的には、プロセッサ１１は、通信用インターフェース１３を介して帳票処理装置３０から処理の対象とする複数の帳票の解析結果を取得する。ここで、複数の帳票の解析結果には、帳票から光学文字認識により得た１以上の項目名の文字列データを含む。 The item name acquisition unit 21 is mainly realized by the processor 11, the storage device 12, and the communication interface 13 of the standard item name setting device 10.
Specifically, the processor 11 acquires the analysis results of a plurality of forms to be processed from the form processing device 30 via the communication interface 13. Here, the analysis result of the plurality of forms includes the character string data of one or more item names obtained from the form by optical character recognition.

［特徴ベクトル生成部２２の説明］
特徴ベクトル生成部２２は、複数の項目名のそれぞれの特徴ベクトルを生成する。 [Explanation of feature vector generation unit 22]
The feature vector generation unit 22 generates feature vectors for each of the plurality of item names.

「特徴ベクトル」とは、項目名の特徴をベクトルとして表したものである。例えば、「特徴ベクトル」は、学習データとして用いる帳票に含まれる全単語の数を次元数とするベクトルである。そして、項目名の特徴ベクトルは、項目名を構成する単語のベクトルを合成したベクトルである。 The "feature vector" represents the feature of the item name as a vector. For example, the "feature vector" is a vector whose dimension number is the number of all words included in the form used as learning data. The feature vector of the item name is a vector obtained by synthesizing the vectors of the words constituting the item name.

具体的には、特徴ベクトル生成部２２は、項目名を分解した各単語の学習モデルに基づくベクトルを合成して、項目名の特徴ベクトルを生成する。
なお、学習モデルでは、機械学習の結果、各単語のベクトルの情報を記憶している。具体的には、単語のベクトルの情報は、入力層７１から隠れ層７２への重み行列に含まれる。そして、特徴ベクトル生成部２２は、項目名を構成する各単語のベクトルを学習モデルから取得し、項目名の特徴ベクトルを生成する。 Specifically, the feature vector generation unit 22 generates a feature vector of the item name by synthesizing a vector based on the learning model of each word obtained by decomposing the item name.
In the learning model, as a result of machine learning, vector information of each word is stored. Specifically, the word vector information is included in the weight matrix from the input layer 71 to the hidden layer 72. Then, the feature vector generation unit 22 acquires the vector of each word constituting the item name from the learning model, and generates the feature vector of the item name.

ここで、図６を参照しながら、項目名の特徴ベクトルの生成処理について説明する。
図６の（ａ）には、項目名の一例（「受給者の変更前の住所」）を示した。これに対し、特徴ベクトル生成部２２は、項目名に対して形態素解析を実行し、図６（ｂ）に示されるように、項目名を単語に分解する。この例では、「受給者の変更前の住所」は、「受給者」、「変更前」、「住所」、及び「の」に分解される。 Here, the process of generating the feature vector of the item name will be described with reference to FIG.
In FIG. 6A, an example of the item name (“address before change of beneficiary”) is shown. On the other hand, the feature vector generation unit 22 performs morphological analysis on the item name and decomposes the item name into words as shown in FIG. 6 (b). In this example, the "recipient's address before change" is decomposed into "recipient", "before change", "address", and "no".

ここで、「受給者」、「変更前」、「住所」、及び「の」のそれぞれの単語ベクトル（ｖ１〜ｖ４）は、学習モデル記憶部２０に記憶される学習モデルの重み行列から得られる。
そして、「受給者」、「変更前」、「住所」、及び「の」のそれぞれの単語ベクトル（ｖ１〜ｖ４）を合成することで、「受給者の変更前の住所」の特徴ベクトルが生成される。 Here, the word vectors (v1 to v4) of "recipient", "before change", "address", and "no" are obtained from the weight matrix of the learning model stored in the learning model storage unit 20. ..
Then, by synthesizing the word vectors (v1 to v4) of "recipient", "before change", "address", and "no", the feature vector of "address before change of beneficiary" is generated. Will be done.

特徴ベクトル生成部２２は、主に標準項目名設定装置１０のプロセッサ１１及び記憶装置１２により実現される。
具体的には、プロセッサ１１は、項目名取得部２１により取得したそれぞれの項目名について以下のように特徴ベクトルを生成する。まず、プロセッサ１１は、項目名に対して形態素解析を実行し、項目名を単語に分解する。次に、プロセッサ１１は、学習モデル記憶部２０に記憶される学習モデルに基づいて、項目名を構成する単語のベクトルを合成した項目名の特徴ベクトルを生成する。 The feature vector generation unit 22 is mainly realized by the processor 11 and the storage device 12 of the standard item name setting device 10.
Specifically, the processor 11 generates a feature vector for each item name acquired by the item name acquisition unit 21 as follows. First, the processor 11 performs morphological analysis on the item name and decomposes the item name into words. Next, the processor 11 generates a feature vector of the item name by synthesizing the vector of the words constituting the item name based on the learning model stored in the learning model storage unit 20.

［クラスタリング部２３の説明］
クラスタリング部２３は、複数の項目名を複数のクラスタのいずれかに分類する。 [Explanation of clustering unit 23]
The clustering unit 23 classifies a plurality of item names into one of a plurality of clusters.

「クラスタ」とは、複数の項目名を振り分ける先の分類である。換言すれば、複数の項目名のうち特徴ベクトルが類似するもの同士をグループ化した場合の各グループに相当する。 The "cluster" is a classification to which a plurality of item names are distributed. In other words, it corresponds to each group in the case where items having similar feature vectors among a plurality of item names are grouped together.

クラスタリング部２３は、複数の項目名のそれぞれの特徴ベクトルの類似度に基づいて、複数の項目名を複数のクラスタに分類する。 The clustering unit 23 classifies the plurality of item names into a plurality of clusters based on the similarity of the feature vectors of the plurality of item names.

「特徴ベクトルの類似度」とは、２つの特徴ベクトルの類似性を表す指標である。具体的には、２つの特徴ベクトルのコサイン類似度が上記の「特徴ベクトルの類似度」の一例に相当する。また、ピアソンの相関係数や偏差パターン類似度等も上記の「特徴ベクトルの類似度」の一例に相当する。 The "similarity of feature vectors" is an index showing the similarity between two feature vectors. Specifically, the cosine similarity between the two feature vectors corresponds to an example of the above-mentioned "similarity of feature vectors". In addition, Pearson's correlation coefficient, deviation pattern similarity, and the like correspond to an example of the above-mentioned "characteristic vector similarity".

クラスタリング部２３は、主に標準項目名設定装置１０のプロセッサ１１及び記憶装置１２により実現される。
具体的には、プロセッサ１１は、特徴ベクトル生成部２２により生成した複数の項目名の特徴ベクトルの類似度を計算し、計算した類似度に基づいて、複数の項目名の特徴ベクトルを複数のクラスタに分類する。
なお、クラスタリングには、最短距離法、最長距離法、群平均法、ウォード法等の公知の手法を用いることができる。 The clustering unit 23 is mainly realized by the processor 11 and the storage device 12 of the standard item name setting device 10.
Specifically, the processor 11 calculates the similarity of the feature vectors of a plurality of item names generated by the feature vector generation unit 22, and based on the calculated similarity, a plurality of clusters of the feature vectors of the plurality of item names. Classify into.
For clustering, known methods such as the shortest distance method, the longest distance method, the group average method, and Ward's method can be used.

［マッチングパターン記憶部２４の説明］
マッチングパターン記憶部２４は、複数のタイプごとに、キーワード、正規表現のうち少なくとも一方を含むマッチングパターンを対応付けて記憶する。 [Explanation of matching pattern storage unit 24]
The matching pattern storage unit 24 stores matching patterns including at least one of a keyword and a regular expression in association with each other for each of a plurality of types.

「タイプ」とは、クラスタの特性である。換言すれば、「タイプ」とは、クラスタに属する項目名の特徴を示すデータである。より具体的には、１つのクラスタに属する項目名が、複数のタイプのそれぞれに定められたマッチングパターンのうちいずれを満足するかに基づいて、そのクラスタのタイプが決定される。 A "type" is a characteristic of a cluster. In other words, the "type" is data indicating the characteristics of the item names belonging to the cluster. More specifically, the type of the cluster is determined based on which of the matching patterns defined for each of the plurality of types the item name belonging to one cluster satisfies.

「キーワード」とは、予め指定された文字列である。文字列は、１つの単語からなってもよいし、複数の単語からなってもよい。
「正規表現」とは、文字列の集合を一つの文字列で表現する方法の一つである。具体的には、「正規表現」は、文字とワイルドカード等の条件が定められた記号列との組み合わせにより表される。 The "keyword" is a character string specified in advance. The character string may consist of one word or a plurality of words.
"Regular expression" is one of the methods of expressing a set of character strings with one character string. Specifically, the "regular expression" is represented by a combination of a character and a symbol string in which conditions such as wildcards are defined.

「マッチングパターン」とは、クラスタのタイプに対応して定められた文字列のマッチング条件である。「マッチングパターン」には、１以上の正規表現と、１以上のキーワードを含むこととしてよい。 The "matching pattern" is a matching condition for a character string defined according to the type of cluster. The "matching pattern" may include one or more regular expressions and one or more keywords.

マッチングパターン記憶部２４は、主に標準項目名設定装置１０の記憶装置１２により実現される。具体的には、標準項目名設定装置１０の記憶装置１２は、マッチングパターンを格納したマッチングパターンテーブルＴ１を記憶する。 The matching pattern storage unit 24 is mainly realized by the storage device 12 of the standard item name setting device 10. Specifically, the storage device 12 of the standard item name setting device 10 stores the matching pattern table T1 in which the matching pattern is stored.

図７には、マッチングパターンテーブルＴ１の一例を示す。図７に示されるように、マッチングパターンテーブルＴ１は、項目タイプ（タイプ）、キーワード、正規表現を関連付けて格納する。例えば、「項目タイプ」には、「ｎａｍｅ（名前）」、「ｄａｔｅ（日付）」、「ａｄｄｒｅｓｓ（住所）」等が含まれることとし、各項目タイプに対してキーワード、正規表現が関連付けられる。 FIG. 7 shows an example of the matching pattern table T1. As shown in FIG. 7, the matching pattern table T1 stores an item type (type), a keyword, and a regular expression in association with each other. For example, the "item type" includes "name (name)", "date (date)", "address (address)", and the like, and a keyword and a regular expression are associated with each item type.

具体的には、項目タイプが「ｎａｍｅ」である場合には、例えばキーワードや正規表現として「氏名」、「ふりがな」、「名称」等を関連付けることとしてよい。
また、項目タイプが「ｄａｔｅ」である場合には、例えばキーワードや正規表現として「＊年＊月＊日」、「＊月＊日」等を関連付けることとしてよい。ここで、「＊」はワイルドカード（任意の文字列）である。
また、項目タイプが「ａｄｄｒｅｓｓ」である場合には、例えばキーワードや正規表現として「住所」、「所在地」、「＊先」等を関連付けることとしてよい。 Specifically, when the item type is "name", for example, "name", "furigana", "name" and the like may be associated as keywords and regular expressions.
When the item type is "date", for example, "* year * month * day", "* month * day", etc. may be associated as a keyword or a regular expression. Here, "*" is a wild card (arbitrary character string).
When the item type is "addless", for example, "address", "location", "* destination", etc. may be associated as keywords or regular expressions.

また、マッチングパターン記憶部２４は、項目タイプに対して、正規表現とキーワードのうちいずれか一方を記憶するようにしても構わない。 Further, the matching pattern storage unit 24 may store either a regular expression or a keyword for the item type.

［タイプ決定部２５の説明］
タイプ決定部２５は、複数のクラスタのうちの注目クラスタに分類された項目名に基づいて、当該注目クラスタのタイプを複数のタイプの中から決定する。 [Explanation of type determination unit 25]
The type determination unit 25 determines the type of the attention cluster from the plurality of types based on the item names classified into the attention cluster among the plurality of clusters.

「注目クラスタ」とは、複数のクラスタのうちから選択される、処理対象とする１つのクラスタである。換言すれば、「注目クラスタ」は、複数のクラスタのうちから選択される任意のクラスタである。すなわち、複数のクラスタの全てが「注目クラスタ」となり得る。例えば、複数のクラスタの中から注目クラスタを順次選択することで、複数のクラスタの各々を注目クラスタとして処理することが可能である。 The “cluster of interest” is one cluster to be processed, which is selected from a plurality of clusters. In other words, a "cluster of interest" is any cluster selected from a plurality of clusters. That is, all of the plurality of clusters can be "attention clusters". For example, by sequentially selecting a cluster of interest from a plurality of clusters, it is possible to process each of the plurality of clusters as a cluster of interest.

「複数のタイプ」とは、予め定められた複数のタイプである。具体的には、「ｎａｍｅ」、「ｄａｔｅ」、「ａｄｄｒｅｓｓ」等が上記の「複数のタイプ」に相当する。 The "plurality of types" is a plurality of predetermined types. Specifically, "name", "date", "addless" and the like correspond to the above "plurality of types".

タイプ決定部２５は、注目クラスタに分類された項目名にマッチするマッチングパターンに基づいて、複数のタイプのうちから注目クラスタのタイプを決定する。 The type determination unit 25 determines the type of the attention cluster from the plurality of types based on the matching pattern matching the item name classified into the attention cluster.

「項目名にマッチするマッチングパターン」とは、複数のタイプにそれぞれ関連付けられた複数のマッチングパターンのうち、注目クラスタに分類された項目名が満足するマッチングパターンである。
例えば、マッチングパターンがキーワードと正規表現を複数含んでいる場合には、注目クラスタに含まれる所定数の項目名が、上記のマッチングパターンに含まれるキーワードと正規表現のうちいずれかを満足する場合に、注目クラスタに属する項目名が、上記のマッチングパターンを満足しているものと判定される。 The "matching pattern that matches the item name" is a matching pattern that satisfies the item name classified into the cluster of interest among the plurality of matching patterns associated with each of the plurality of types.
For example, when the matching pattern contains a plurality of keywords and regular expressions, the predetermined number of item names included in the cluster of interest satisfies any of the keywords and regular expressions included in the above matching pattern. , It is determined that the item name belonging to the cluster of interest satisfies the above matching pattern.

タイプ決定部２５は、主に標準項目名設定装置１０のプロセッサ１１及び記憶装置１２により実現される。
具体的には、プロセッサ１１は、複数のクラスタの中から１つのクラスタを注目クラスタとして設定し、以下の処理を実行する。
まず、プロセッサ１１は、マッチングパターン記憶部２４に記憶されるマッチングパターンテーブルＴ１に記憶される項目タイプに関連付けられるマッチングパターンのうち、マッチングパターンが注目クラスタに分類された項目名とマッチするものを検索する。そして、プロセッサ１１は、上記検索されたマッチングパターンに対応する項目タイプを、注目クラスタの項目タイプとする。 The type determination unit 25 is mainly realized by the processor 11 and the storage device 12 of the standard item name setting device 10.
Specifically, the processor 11 sets one cluster out of a plurality of clusters as a cluster of interest, and executes the following processing.
First, the processor 11 searches for a matching pattern associated with the item type stored in the matching pattern table T1 stored in the matching pattern storage unit 24, whose matching pattern matches the item name classified in the cluster of interest. do. Then, the processor 11 sets the item type corresponding to the searched matching pattern as the item type of the attention cluster.

［項目名候補生成部２６の説明］
項目名候補生成部２６は、タイプ決定部により決定したタイプに対応するルールに基づいて、注目クラスタに分類された項目名を共通の文字列を有する項目名からなるサブクラスタに細分化するとともに、当該サブクラスタに属する項目名について前記共通の文字列以外の文字列に基づく複数の項目名候補を生成する。 [Explanation of item name candidate generation unit 26]
The item name candidate generation unit 26 subdivides the item names classified into the attention cluster into subclusters consisting of item names having a common character string based on the rules corresponding to the types determined by the type determination unit. For item names belonging to the subcluster, a plurality of item name candidates based on character strings other than the common character string are generated.

「タイプに対応するルール」とは、タイプに対応付けて定められた注目クラスタの細分化ルールである。例えば、上記の細分化ルールは、マッチングパターンに定められたキーワードや正規表現に該当する文字列以外の文字列（例えば修飾語）ごとに、注目クラスタを細分化するというルールとしてよい。
具体的には、タイプ「ｎａｍｅ」の注目クラスタに分類された項目名において、「氏名」、「ふりがな」、「名称」といったキーワード以外の文字列として、修飾語「１」（申請者の＊）、修飾語「２」（妊婦の＊）、修飾語「３」（受給者の変更前の＊）という３つの修飾語が抽出されたとする。この場合に、項目名候補生成部２６は、注目クラスタを修飾語「１」〜修飾語「３」をそれぞれ含む項目名からなるサブクラスタ「１」〜サブクラスタ「３」に細分化する。 The "rule corresponding to the type" is a subdivision rule of the cluster of interest defined in association with the type. For example, the above subdivision rule may be a rule that subdivides the attention cluster for each character string (for example, a modifier) other than the character string corresponding to the keyword or regular expression defined in the matching pattern.
Specifically, in the item names classified into the attention cluster of type "name", the modifier "1" (* of the applicant) is used as a character string other than the keywords such as "name", "furigana", and "name". , The modifier "2" (* for pregnant women) and the modifier "3" (* before the change of the beneficiary) are extracted. In this case, the item name candidate generation unit 26 subdivides the attention cluster into subclusters "1" to subclusters "3" having item names including the modifiers "1" to "3".

「共通の文字列」とは、注目クラスタに分類された項目名において、注目クラスタのタイプを表す文字列以外の文字列であって、複数の項目名において共通する文字列である。
例えば、クラスタに分類された項目名における修飾語が「共通の文字列」に相当する。 The "common character string" is a character string other than the character string representing the type of the attention cluster in the item name classified into the attention cluster, and is a character string common to a plurality of item names.
For example, the modifier in the item name classified into the cluster corresponds to the "common character string".

「サブクラスタ」とは、上記の細分化ルールによって細分化された注目クラスタである。すなわち、注目クラスタを細分化した集合の各々が上記の「サブクラスタ」となる。 The "sub-cluster" is a cluster of interest subdivided by the above subdivision rule. That is, each of the subdivided sets of the clusters of interest is the above-mentioned "subcluster".

「サブクラスタに属する項目名」とは、サブクラスタに振り分けられた項目名である。すなわち、集合としてのサブクラスタの要素が上記の「サブクラスタに属する項目名」となる。 The "item name belonging to the subcluster" is an item name assigned to the subcluster. That is, the element of the subcluster as a set becomes the above-mentioned "item name belonging to the subcluster".

「共通の文字列以外の文字列」とは、サブクラスタに振り分けられた項目名において、サブクラスタの細分化の際に用いられた文字列（例えば修飾語）以外の文字列である。
例えば、修飾語により修飾される被修飾語が上記の「共通の文字列以外の文字列」に相当する。 The "character string other than the common character string" is a character string other than the character string (for example, a modifier) used when subdividing the subcluster in the item name assigned to the subcluster.
For example, the modified word modified by the modifier corresponds to the above-mentioned "character string other than the common character string".

「項目名候補」とは、サブクラスタに分類された項目名を、第１文字列（共通の文字列）、第２文字列（共通の文字列以外の文字列）として構成した場合における、第１文字列と第２文字列の組み合わせの候補である。すなわち、サブクラスタに分類された項目名の中に、第１文字列が共通し、第２文字列が異なる項目名があるとすると、これらの組み合わせが、上記の「項目名候補」となる。なお、第２文字列のバリエーションを、上記の「項目名候補」としてもよい。 The "item name candidate" is the first when the item names classified into the subclusters are configured as the first character string (common character string) and the second character string (character string other than the common character string). It is a candidate for a combination of the first character string and the second character string. That is, if there is an item name in which the first character string is common and the second character string is different among the item names classified into the subclusters, these combinations are the above-mentioned "item name candidates". The variation of the second character string may be the above-mentioned "item name candidate".

項目名候補生成部２６は、注目クラスタに分類された項目名のうち、注目クラスタのタイプに対応付けて記憶されたマッチングパターンに該当する文字列以外から共通の文字列を設定する。 The item name candidate generation unit 26 sets a common character string from the item names classified into the attention clusters other than the character strings corresponding to the matching pattern stored in association with the type of the attention cluster.

「注目クラスタのタイプに対応付けて記憶されたマッチングパターンに該当する文字列」とは、注目クラスタのタイプに対応するマッチングパターンが満足する文字列である。すなわち、マッチングパターンが複数のキーワードを含む場合には、それらの複数のキーワードが上記の「注目クラスタのタイプに対応付けて記憶されたマッチングパターンに該当する文字列」に相当する。 The "character string corresponding to the matching pattern stored in association with the type of the attention cluster" is a character string to which the matching pattern corresponding to the type of the attention cluster is satisfied. That is, when the matching pattern includes a plurality of keywords, those plurality of keywords correspond to the above-mentioned "character string corresponding to the matching pattern stored in association with the type of the cluster of interest".

項目名候補生成部２６は、主に標準項目名設定装置１０のプロセッサ１１及び記憶装置１２により実現される。
具体的には、プロセッサ１１は、注目クラスタに含まれる項目名を、注目クラスタのタイプに応じたマッチングパターンに該当する文字列以外の修飾語に応じてサブクラスタにグループ化する。すなわち、プロセッサ１１は、注目クラスタに含まれる項目名を、修飾語ごとにサブクラスタに分ける。
次に、プロセッサ１１は、サブクラスタに分けられた項目名について、サブクラスタに対応する修飾語以外で相違する文字列を抽出し、当該抽出した文字列に基づいて項目名候補を生成する。 The item name candidate generation unit 26 is mainly realized by the processor 11 and the storage device 12 of the standard item name setting device 10.
Specifically, the processor 11 groups the item names included in the attention cluster into subclusters according to modifiers other than the character string corresponding to the matching pattern according to the type of the attention cluster. That is, the processor 11 divides the item names included in the cluster of interest into subclusters for each modifier.
Next, the processor 11 extracts a character string different from the item name divided into the subclusters other than the modifier corresponding to the subcluster, and generates an item name candidate based on the extracted character string.

［標準項目名設定部２７の説明］
標準項目名設定部２７は、サブクラスタの複数の項目名候補の中から所定の基準に基づいて選択した項目名候補を、サブクラスタに対応する標準項目名として設定する。 [Explanation of standard item name setting unit 27]
The standard item name setting unit 27 sets an item name candidate selected based on a predetermined criterion from a plurality of item name candidates of the subcluster as a standard item name corresponding to the subcluster.

「所定の基準」とは、サブクラスタの複数の項目名候補の評価基準である。例えば、上記の評価基準は、サブクラスタにおける出現頻度としてよい。すなわち、標準項目名設定部２７は、サブクラスタの複数の項目名候補のうち、サブクラスタにおいて最も出現頻度が高い項目名候補を、サブクラスタに対応する標準項目名として設定することとする。 The "predetermined standard" is an evaluation standard for a plurality of item name candidates of a subcluster. For example, the above evaluation criteria may be the frequency of occurrence in subclusters. That is, the standard item name setting unit 27 sets the item name candidate having the highest frequency of appearance in the subcluster as the standard item name corresponding to the subcluster among the plurality of item name candidates of the subcluster.

「サブクラスタに対応する標準項目名」とは、サブクラスタから導き出された標準項目名である。すなわち、サブクラスタに対応する第１文字列（共通文字列）と、サブクラスタにおいて最も出現頻度が大きい第２文字列とを組み合わせた項目名が、サブクラスタに対応する標準項目名となる。
ここで、標準項目名とは、種類の異なる帳票において意味、用途が対応する項目名について、標準的に用いる項目名である。 The "standard item name corresponding to the subcluster" is a standard item name derived from the subcluster. That is, the item name that combines the first character string (common character string) corresponding to the subcluster and the second character string corresponding to the subcluster is the standard item name corresponding to the subcluster.
Here, the standard item name is an item name used as standard for an item name corresponding to the meaning and use in different types of forms.

標準項目名設定部２７は、サブクラスタについての複数の項目名候補のうち、サブクラスタにおける出現頻度が最も高い項目名候補を標準項目名として設定する。 The standard item name setting unit 27 sets the item name candidate having the highest frequency of appearance in the subcluster as the standard item name among the plurality of item name candidates for the subcluster.

「出現頻度」とは、サブクラスタに属する項目名のうち、項目名候補の文字列を含む項目名の数である。 The "occurrence frequency" is the number of item names including the character string of the item name candidate among the item names belonging to the subcluster.

標準項目名設定部２７は、主に標準項目名設定装置１０のプロセッサ１１、記憶装置１２及び通信用インターフェース１３により実現される。
具体的には、プロセッサ１１は、サブクラスタについて項目名候補生成部２６で生成した項目名候補のそれぞれの出現頻度を計数し、出現頻度が最大の項目名候補を、標準項目名として設定する。以下、プロセッサ１１による標準項目名設定処理の具体例について、図８を参照しながら説明する。 The standard item name setting unit 27 is mainly realized by the processor 11, the storage device 12, and the communication interface 13 of the standard item name setting device 10.
Specifically, the processor 11 counts the appearance frequency of each of the item name candidates generated by the item name candidate generation unit 26 for the subcluster, and sets the item name candidate having the highest appearance frequency as the standard item name. Hereinafter, a specific example of the standard item name setting process by the processor 11 will be described with reference to FIG.

図８には、プロセッサ１１により生成される項目名候補テーブルＴ２の一例を示した。図８に示されるように、プロセッサ１１は、項目名候補（ここでは項目名候補「１」〜項目名候補「３」とする）のそれぞれの出現頻度を計数する。
そして、プロセッサ１１は、出現頻度が最大の項目名候補「１」を標準項目名に設定する。図８に示す例では、項目名候補「１」について標準項目フラグを真（Ｔ）にそれ以外の候補については標準項目フラグを偽（Ｆ）に設定する。
なお、プロセッサ１１は、項目名候補「３」及び「２」についても、標準項目名を代替する第１候補、第２候補に設定してもよい。 FIG. 8 shows an example of the item name candidate table T2 generated by the processor 11. As shown in FIG. 8, the processor 11 counts the appearance frequency of each of the item name candidates (here, item name candidate “1” to item name candidate “3”).
Then, the processor 11 sets the item name candidate "1" having the highest appearance frequency as the standard item name. In the example shown in FIG. 8, the standard item flag is set to true (T) for the item name candidate “1”, and the standard item flag is set to false (F) for the other candidates.
The processor 11 may also set the item name candidates "3" and "2" as the first candidate and the second candidate that substitute for the standard item name.

また、標準項目名設定装置１０は、複数のクラスタのそれぞれについて、タイプ決定部２５と、項目名候補生成部２６と、標準項目名設定部２７による処理を実行して、複数の標準項目名を設定する。
すなわち、標準項目名設定装置１０は、複数のクラスタうちから１つを注目クラスタに順次設定して、設定した注目クラスタに基づき標準項目名の設定処理を実行することとする。 Further, the standard item name setting device 10 executes processing by the type determination unit 25, the item name candidate generation unit 26, and the standard item name setting unit 27 for each of the plurality of clusters, and obtains a plurality of standard item names. Set.
That is, the standard item name setting device 10 sequentially sets one of the plurality of clusters to the attention cluster, and executes the standard item name setting process based on the set attention cluster.

［標準項目名設定装置１０による処理の流れ］
次に、図９及び図１０を参照しながら、標準項目名設定装置１０により実行される処理の流れについて説明する。 [Process flow by standard item name setting device 10]
Next, the flow of processing executed by the standard item name setting device 10 will be described with reference to FIGS. 9 and 10.

［項目名のクラスタリング処理］
まず、図９に示すフロー図を参照しながら、項目名のクラスタリング処理の流れについて説明する。なお、以下においては、標準項目名設定装置１０は、項目名Ｉ_１〜項目名Ｉ_Ｎを含む項目名リスト６０を帳票処理装置３０から取得していることとする。なお、Ｎは項目名の数とする。 [Item name clustering process]
First, the flow of the item name clustering process will be described with reference to the flow chart shown in FIG. In the following, the standard item name setting device 10, and it is getting an item name list 60 containing items name I _{1 ~} item name I _N from the form processing unit 30. In addition, N is the number of item names.

図９に示されるように、標準項目名設定装置１０は、変数ｉを１に初期化して（Ｓ１）、項目名Ｉ_ｉを、形態素解析により単語に分解する（Ｓ２）。
そして、標準項目名設定装置１０は、項目名Ｉ_ｉを構成する単語のベクトルを合成して、項目名Ｉ_ｉの特徴ベクトルＶ_ｉを生成する（Ｓ３）。 As shown in FIG. 9, the standard item name setting device 10 initializes the variable i to 1 (S1) _{and decomposes the item name I i} into words by morphological analysis (S2).
The standard field names set device 10 combines the vector of the words constituting the item name I _i, and generates a feature vector V _i of the item name I _i (S3).

ここで、変数ｉがＮに達していない場合には（Ｓ４：Ｎｏ）、標準項目名設定装置１０は変数ｉに１を加算して（Ｓ５）、Ｓ２に戻る。 Here, when the variable i has not reached N (S4: No), the standard item name setting device 10 adds 1 to the variable i (S5) and returns to S2.

一方、変数ｉがＮに達している場合には（Ｓ４：Ｙｅｓ）、標準項目名設定装置１０は、特徴ベクトルＶ_１〜特徴ベクトルＶ_Ｎの類似度を計算する（Ｓ６）。
そして、標準項目名設定装置１０は計算した類似度に基づいて、特徴ベクトルＶ_１〜特徴ベクトルＶ_ＮをクラスタＣ_１〜Ｃ_Ｍにクラスタリングする（Ｓ７）。ここで、Ｍはクラスタ数とする。
以上の処理により、項目名Ｉ_１〜項目名Ｉ_Ｎを意味のまとまりに基づいて分類することができる。 On the other hand, if the variable i has reached N (S4: Yes), the standard item name setting device 10 calculates the similarity of the feature vector _{V 1} ~ feature vector _{V N} (S6).
Then, the standard field name setting device 10 based on the calculated similarity, clustering the feature vector _{V 1} ~ feature vector _{V N} to the cluster _C 1 _~C _M (S7). Here, M is the number of clusters.
By the above processing, it can be classified based on the collection of mean item name I _{1 ~} item name I _N.

［標準項目名設定処理］
次に、図１０に示すフロー図を参照しながら、標準項目名設定処理の流れについて説明する。以下に説明する処理は、上記説明したクラスタリング処理に続いて行われる処理である。 [Standard item name setting process]
Next, the flow of the standard item name setting process will be described with reference to the flow chart shown in FIG. The process described below is a process performed after the clustering process described above.

図１０に示されるように、標準項目名設定装置１０は、変数ｊと変数ｌをそれぞれ１に初期化して（Ｓ１１）、クラスタＣ_ｊを注目クラスタに設定する（Ｓ１２）。 As shown in FIG. 10, the standard item name setting device 10 initializes the variable j and the variable l to 1 (S11), _{and sets the cluster C j} to the cluster of interest (S12).

次に、標準項目名設定装置１０は、注目クラスタであるクラスタＣ_ｊの項目タイプを決定する（Ｓ１３）。このクラスタＣ_ｊの項目タイプの決定処理は、タイプ決定部２５により実行されるものである。 Then, the standard item name setting device 10, to determine the item type of cluster _{C j} is the cluster of interest (S13). The _{item type determination process of the cluster Cj} is executed by the type determination unit 25.

次に、標準項目名設定装置１０は、クラスタＣ_ｊを項目タイプに応じてサブクラスタＣ_ｊ１〜Ｃ_ｊＬに細分化する（Ｓ１４）。ここで、Ｌは、クラスタＣ_ｊのサブクラスタの数とする。 Next, the standard item name setting device 10 _{subdivides the cluster C j} _{into subclusters} _{C j1 to} C jL according to the item type (S14). Here, L is the number of subclusters in the cluster C _j.

次に、標準項目名設定装置１０は、サブクラスタＣ_ｊｌに分類される項目名に基づいて、項目名候補を生成する（Ｓ１５）。そして、標準項目名設定装置１０は、項目名候補についての出現頻度を計数し（Ｓ１６）、出現頻度が最も高い項目名候補を標準項目名に設定する（Ｓ１７）。 Next, the standard item name setting device 10 generates an item name candidate based on the item name classified into _{the subcluster C jl (S15).} Then, the standard item name setting device 10 counts the appearance frequency of the item name candidates (S16), and sets the item name candidate having the highest appearance frequency as the standard item name (S17).

ここで、変数ｌがＬに達していない場合には（Ｓ１８：Ｎｏ）、標準項目名設定装置１０は、変数ｌに１を加算して（Ｓ１９）、Ｓ１５に戻る。 Here, when the variable l has not reached L (S18: No), the standard item name setting device 10 adds 1 to the variable l (S19) and returns to S15.

一方、変数ｌがＬに達している場合には（Ｓ１８：Ｙｅｓ）、標準項目名設定装置１０は、さらに変数ｊがＭに達しているか否かを判定する（Ｓ２０）。
ここで、変数ｊがＭに達していない場合には（Ｓ２０：Ｎｏ）、標準項目名設定装置１０は、変数ｊに１を加算して（Ｓ２１）、Ｓ１２に戻る。 On the other hand, when the variable l has reached L (S18: Yes), the standard item name setting device 10 further determines whether or not the variable j has reached M (S20).
Here, when the variable j has not reached M (S20: No), the standard item name setting device 10 adds 1 to the variable j (S21) and returns to S12.

一方、変数ｊがＭに達している場合には（Ｓ２０：Ｙｅｓ）、標準項目名設定装置１０は、以上の処理で設定した標準項目名のデータ（標準項目データ）を出力して（Ｓ２２）、処理を終了する。
Ｓ２２において、標準項目名設定装置１０は、例えば標準項目名とその代替候補の情報を纏めたデータを上記の標準項目データとして、帳票処理装置３０に送信することとしてよい。 On the other hand, when the variable j reaches M (S20: Yes), the standard item name setting device 10 outputs the data (standard item data) of the standard item name set in the above processing (S22). , End the process.
In S22, the standard item name setting device 10 may transmit, for example, data summarizing the information of the standard item name and its alternative candidate to the form processing device 30 as the above standard item data.

［まとめ］
標準項目名設定装置１０は、複数の帳票に記載された複数の項目名を取得する項目名取得部２１と、複数の項目名を複数のクラスタのいずれかに分類するクラスタリング部２３と、複数のクラスタのうちの注目クラスタに分類された項目名に基づいて、当該注目クラスタのタイプを複数のタイプの中から決定するタイプ決定部２５と、タイプ決定部２５により決定したタイプに対応するルールに基づいて、注目クラスタに分類された項目名を共通の文字列を有する項目名からなるサブクラスタに細分化するとともに、当該サブクラスタに属する項目名について共通の文字列以外の文字列（例えば修飾語）に基づく複数の項目名候補を生成する項目名候補生成部２６と、サブクラスタの複数の項目名候補の中から所定の基準に基づいて選択した項目名候補を、サブクラスタに対応する標準項目名として設定する標準項目名設定部２７と、を備える。 [summary]
The standard item name setting device 10 includes an item name acquisition unit 21 for acquiring a plurality of item names described in a plurality of forms, a clustering unit 23 for classifying a plurality of item names into any of a plurality of clusters, and a plurality of clusters. Based on the item name classified into the attention cluster among the clusters, the type determination unit 25 that determines the type of the attention cluster from among a plurality of types, and the rule corresponding to the type determined by the type determination unit 25. Then, the item names classified into the cluster of interest are subdivided into subclusters consisting of item names having a common character string, and character strings other than the common character string for the item names belonging to the subcluster (for example, modifiers). Item name candidate generation unit 26 that generates multiple item name candidates based on The standard item name setting unit 27, which is set as, is provided.

標準項目名設定装置１０によれば、複数の帳票において対応する項目名に対して１つの標準的な項目名を設定することができる。これにより、対応する複数の項目名を１つの項目名にまとめる労力を軽減できる。 According to the standard item name setting device 10, one standard item name can be set for the corresponding item name in a plurality of forms. As a result, it is possible to reduce the labor of combining a plurality of corresponding item names into one item name.

標準項目名設定装置１０では、複数の項目名のそれぞれの特徴ベクトルを生成する特徴ベクトル生成部２２を備え、クラスタリング部２３は、複数の項目名のそれぞれの特徴ベクトルの類似度に基づいて、複数の項目名を複数のクラスタに分類する。
こうすることで、帳票に記載の互いに類似する複数の項目名に対して１つの標準的な項目名を設定することができる。これにより、複数の類似する項目名を１つの項目名にまとめる労力を軽減できる。 The standard item name setting device 10 includes a feature vector generation unit 22 that generates each feature vector of a plurality of item names, and a clustering unit 23 is a plurality of clustering units 23 based on the similarity of each feature vector of the plurality of item names. Classify item names into multiple clusters.
By doing so, one standard item name can be set for a plurality of item names similar to each other described in the form. As a result, it is possible to reduce the labor of combining a plurality of similar item names into one item name.

標準項目名設定装置１０では、学習データとしての１以上の帳票に出現する単語を機械学習した学習モデルを記憶する学習モデル記憶部２０を備え、特徴ベクトル生成部２２は、項目名を分解した各単語の学習モデルに基づくベクトルを合成して、項目名の特徴ベクトルを生成する。
こうすることで、類似する項目名をまとめて分類する精度を向上できる。 The standard item name setting device 10 includes a learning model storage unit 20 that stores a learning model that machine-learns words appearing in one or more forms as learning data, and a feature vector generation unit 22 decomposes each item name. A feature vector of item names is generated by synthesizing vectors based on a word learning model.
By doing so, it is possible to improve the accuracy of classifying similar item names together.

標準項目名設定装置１０では、複数のタイプごとに、キーワード、正規表現のうち少なくとも一方を含むマッチングパターンを対応付けて記憶したマッチングパターン記憶部２４を備え、タイプ決定部２５は、注目クラスタに分類された項目名にマッチするマッチングパターンに基づいて、複数のタイプのうちから注目クラスタのタイプを決定する。
こうすることで、クラスタのタイプの判定精度を向上できる。 The standard item name setting device 10 includes a matching pattern storage unit 24 that stores matching patterns including at least one of a keyword and a regular expression in association with each other for each of a plurality of types, and the type determination unit 25 is classified into a cluster of interest. The type of attention cluster is determined from among multiple types based on the matching pattern that matches the item name.
By doing so, the accuracy of determining the type of cluster can be improved.

標準項目名設定装置１０では、項目名候補生成部２６は、注目クラスタに分類された項目名のうち、注目クラスタのタイプに対応付けて記憶されたマッチングパターンに該当する文字列以外から共通の文字列を設定する。
こうすることで、１つのクラスタを１以上のサブクラスタに分類する基準を簡易に定めることができる。 In the standard item name setting device 10, the item name candidate generation unit 26 uses common characters other than the character strings corresponding to the matching pattern stored in association with the type of the attention cluster among the item names classified in the attention cluster. Set the column.
By doing so, it is possible to easily determine the criteria for classifying one cluster into one or more subclusters.

標準項目名設定装置１０では、標準項目名設定部２７は、サブクラスタについての複数の項目名候補のうち、サブクラスタにおける出現頻度が最も高い項目名候補を標準項目名として設定する。
こうすることで、同一のサブクラスタに分類された項目名のうち、最も良く使用されている表現に基づいて標準項目名を設定できる。 In the standard item name setting device 10, the standard item name setting unit 27 sets the item name candidate having the highest frequency of appearance in the subcluster as the standard item name among the plurality of item name candidates for the subcluster.
By doing so, the standard item name can be set based on the most commonly used expression among the item names classified into the same subcluster.

標準項目名設定装置１０では、複数のクラスタのそれぞれについて、タイプ決定部２５と、項目名候補生成部２６と、標準項目名設定部２７による処理を実行して、複数の標準項目名を設定する。
こうすることで、帳票の多様な項目名について標準項目名を設定できる。これにより、帳票から標準項目名を設定する労力を軽減できる。 The standard item name setting device 10 sets a plurality of standard item names by executing processing by the type determination unit 25, the item name candidate generation unit 26, and the standard item name setting unit 27 for each of the plurality of clusters. ..
By doing this, standard item names can be set for various item names in the form. This can reduce the labor of setting the standard item name from the form.

[その他の実施形態]
本発明は上記の実施形態に限定されるものではない。
標準項目名設定装置１０と帳票処理装置３０を１つの装置として構成してもよい。
また、標準項目名設定装置１０は、１台のコンピュータに限られず、複数台のコンピュータから構成されてもよい。 [Other embodiments]
The present invention is not limited to the above embodiment.
The standard item name setting device 10 and the form processing device 30 may be configured as one device.
Further, the standard item name setting device 10 is not limited to one computer, and may be composed of a plurality of computers.

また、タイプに対するキーワードの設定は上記の例に限定されない。例えば、マッチングパターン記憶部２４において、タイプに対し、ｋｅｙｗｏｒｄに加えて、ｋｅｙｗｏｒｄ＿ａｎｄを定義してもよい。この場合、ｋｅｙｗｏｒｄ＿ａｎｄとｋｅｙｗｏｒｄの両方に、項目名に含まれるキーワードが存在する場合に、ｋｅｙｗｏｒｄ＿ａｎｄとｋｅｙｗｏｒｄに対応するタイプがマッチングすると判定される。 Further, the setting of the keyword for the type is not limited to the above example. For example, in the matching pattern storage unit 24, keyword_and may be defined in addition to keyword for the type. In this case, when the keyword included in the item name exists in both keyword_and and keyword, it is determined that the types corresponding to keyword_and and keyword match.

また、ｋｅｙｗｏｒｄの代わりにｋｅｙｗｏｒｄ＿ｃｏｍｍｏｎを用いてもよい。これは全手続き共通の情報として、いくつかのキーワードのセットを予め定義しておき、それを手続きごとの定義で参照する機能である。
例えばｋｅｙｗｏｒｄ＿ｃｏｍｍｏｎ：ｃｈｉｌｄとし、共通定義にｃｈｉｌｄ．ｋｅｙｗｏｒｄ：［子ども，子供，こども，児童］とした場合に、ｋｅｙｗｏｒｄ＿ｃｏｍｍｏｎ：ｃｈｉｌｄはｋｅｙｗｏｒｄ：［子ども，子供，こども，児童］と同義となる。
このように、キーワードの指定には各種の方法を用いることができる。 Further, keyword_common may be used instead of keyword. This is a function that defines a set of some keywords in advance as information common to all procedures and refers to it in the definition for each procedure.
For example, keyword_common: child, and the common definition is child. When keyword: [child, child, child, child], keyword_common: child is synonymous with keyword: [child, child, child, child].
In this way, various methods can be used to specify the keyword.

１情報処理システム
１０標準項目名設定装置
１１プロセッサ
１２記憶装置
１３通信用インターフェース
２０学習モデル記憶部
２１項目名取得部
２２特徴ベクトル生成部
２３クラスタリング部
２４マッチングパターン記憶部
２５タイプ決定部
２６項目名候補生成部
２７標準項目名設定部
３０帳票処理装置
４０スキャナ
５０Ａ第１項目名
５０Ｂ第２項目名
５０Ｃ第３項目名
６０項目名リスト
７０ニューラルネットワーク（学習モデル）
７１入力層
７２隠れ層
７３出力層
Ａ１項目名候補
Ａ２項目名候補
ＡＬ項目名候補
Ｃ１クラスタ
Ｃ２クラスタ
ＣＭクラスタ
Ｐ帳票
ＰＡ第１帳票
ＰＢ第２帳票
ＰＣ第３帳票
ＲＩ標準項目名
Ｔ１マッチングパターンテーブル
Ｔ２項目名候補テーブル 1 Information processing system 10 Standard item name setting device 11 Processor 12 Storage device 13 Communication interface 20 Learning model storage unit 21 Item name acquisition unit 22 Feature vector generation unit 23 Clustering unit 24 Matching pattern storage unit 25 Type determination unit 26 Item name candidate Generation unit 27 Standard item name setting unit 30 Form processing device 40 Scanner 50A 1st item name 50B 2nd item name 50C 3rd item name 60 Item name list 70 Neural network (learning model)
71 Input layer 72 Hidden layer 73 Output layer A1 Item name candidate A2 Item name candidate AL Item name candidate C1 cluster C2 cluster CM cluster P form PA 1st form PB 2nd form PC 3rd form RI standard item name T1 Matching pattern table T2 Item name candidate table

Claims

Item name acquisition unit that acquires multiple item names described in multiple forms, and
A clustering unit that classifies the plurality of item names into one of a plurality of clusters,
A type determination unit that determines the type of the attention cluster from among the plurality of types based on the item names classified into the attention cluster among the plurality of clusters.
Based on the rules corresponding to the types determined by the type determination unit, the item names classified into the attention cluster are subdivided into subclusters consisting of item names having a common character string, and items belonging to the subcluster. About the name The item name candidate generation unit that generates multiple item name candidates based on character strings other than the common character string, and
A standard item name setting unit that sets an item name candidate selected based on a predetermined criterion from the plurality of item name candidates of the subcluster as a standard item name corresponding to the subcluster.
A standard item name setting device characterized by being equipped with.

It is provided with a feature vector generation unit that generates a feature vector for each of the plurality of item names.
The standard item name setting according to claim 1, wherein the clustering unit classifies the plurality of item names into the plurality of clusters based on the similarity of the feature vectors of the plurality of item names. Device.

It is equipped with a learning model storage unit that stores a learning model that machine-learns words that appear in one or more forms as learning data.
The standard item name according to claim 2, wherein the feature vector generation unit synthesizes a vector based on the learning model of each word obtained by decomposing the item name to generate a feature vector of the item name. Setting device.

Each of the plurality of types is provided with a matching pattern storage unit that stores matching patterns including at least one of a keyword and a regular expression in association with each other.
The type determination unit according to claim 1 to 3, wherein the type determination unit determines the type of the attention cluster from the plurality of types based on a matching pattern matching the item name classified into the attention cluster. The standard item name setting device described in any of them.

The item name candidate generation unit sets the common character string from the item names classified into the attention clusters other than the character strings corresponding to the matching pattern stored in association with the type of the attention cluster. The standard item name setting device according to claim 4.

The standard item name setting unit is characterized in that, among the plurality of item name candidates for the subcluster, the item name candidate having the highest frequency of appearance in the subcluster is set as the standard item name. The standard item name setting device according to any one of 5 to 5.

A claim characterized in that processing is executed by the type determination unit, the item name candidate generation unit, and the standard item name setting unit for each of the plurality of clusters to set a plurality of standard item names. The standard item name setting device according to any one of 1 to 6.

The computer
Get multiple item names listed in multiple forms,
Classify the multiple item names into one of multiple clusters and classify them into one of multiple clusters.
Based on the item name classified into the attention cluster among the plurality of clusters, the type of the attention cluster is determined from the plurality of types.
Based on the rule corresponding to the determined type, the item names classified into the attention cluster are subdivided into subclusters consisting of item names having a common character string, and the item names belonging to the subcluster are the same. Generate multiple item name candidates based on character strings other than the character string of
A standard item name setting method, characterized in that an item name candidate selected from the plurality of item name candidates of the subcluster based on a predetermined criterion is set as a standard item name corresponding to the subcluster.

Item name acquisition unit that acquires multiple item names described in multiple forms, and
A clustering unit that classifies the plurality of item names into one of a plurality of clusters,
A type determination unit that determines the type of the attention cluster from among the plurality of types based on the item names classified into the attention cluster among the plurality of clusters.
Based on the rules corresponding to the types determined by the type determination unit, the item names classified into the attention cluster are subdivided into subclusters consisting of item names having a common character string, and items belonging to the subcluster. About the name The item name candidate generation unit that generates multiple item name candidates based on character strings other than the common character string, and
To make the computer function as a standard item name setting unit that sets an item name candidate selected based on a predetermined criterion from the plurality of item name candidates of the subcluster as a standard item name corresponding to the subcluster. Standard item name setting program.