JP2007080019A

JP2007080019A - Natural language processing system, natural language processing method and natural language processing program

Info

Publication number: JP2007080019A
Application number: JP2005268034A
Authority: JP
Inventors: Shinichi Ando; 真一安藤; Kunihiko Sadamasa; 邦彦定政; Shinichi Doi; 伸一土井
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2005-09-15
Filing date: 2005-09-15
Publication date: 2007-03-29
Anticipated expiration: 2025-09-15
Also published as: JP4792885B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a natural language processing system or the like capable of appropriately extracting dictionary data to be shared by appropriately selecting a partner to share the dictionary data by the unit of an individual user. <P>SOLUTION: In the natural language processing system 1, a similarity degree calculation means 21 calculates a similarity degree between a first user dictionary and a second user dictionary. A registration candidate extraction means 22 extracts dictionary data included in the first user dictionary and not included in the second user dictionary as a registration candidate for the second dictionary when the similarity degree is a predetermined threshold or higher. A user dictionary registration means 23 registers the dictionary data included in the registration candidate to the second user dictionary. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、ユーザ辞書を用いて自然言語を処理する技術に関し、特に、ユーザ辞書を更新するために共有するべきデータを適切に抽出することができる自然言語処理システム等に関する。 The present invention relates to a technology for processing a natural language using a user dictionary, and more particularly to a natural language processing system and the like that can appropriately extract data to be shared in order to update the user dictionary.

仮名漢字変換、機械翻訳、音声認識、音声合成といった自然言語処理システムは、基本的に辞書に格納された単語とそこに付与された言語情報に基づいて処理を行なっており、辞書に格納されていない単語、すなわち未知語を適切に処理することは困難である。一方、自然言語は日々変化しており、新語が生まれたり、また既知語であっても新たな用法が発生したりするため、予めこれら全てを辞書に格納しておくことは難しい。
そこで従来から、個々のユーザが必要になった単語を個別に登録することができるユーザ辞書機能を提供することでこの問題に対応してきた。しかし複数のユーザで共通に必要な単語についてもユーザ毎で別々に辞書登録作業を行なわなければならず、ユーザ全体にとって冗長な作業が生じていた。これに対して個々のユーザが登録した辞書データを複数のユーザで共有する方法が提案されている。 Natural language processing systems such as kana-kanji conversion, machine translation, speech recognition, and speech synthesis basically perform processing based on words stored in the dictionary and the language information assigned thereto, and are stored in the dictionary. It is difficult to properly handle missing words, that is, unknown words. On the other hand, natural languages are changing day by day, and new words are born, and even if they are known words, new usage occurs, so it is difficult to store them all in the dictionary in advance.
Therefore, conventionally, this problem has been addressed by providing a user dictionary function that can individually register words that are required by individual users. However, even for words that are commonly required by a plurality of users, the dictionary registration work must be performed separately for each user, and redundant work has occurred for the entire user. On the other hand, a method of sharing dictionary data registered by individual users among a plurality of users has been proposed.

例えば特許文献１には、複数のユーザ辞書全体で一定回数以上出現する辞書データを共有すべき辞書データの候補として抽出し、これをユーザ全員で共有する方法が記載されている。しかしこの方法では、様々な組織や分野に関連する単語の辞書データがその区別のないまま一律に共有されてしまうため、ユーザの多い組織や分野に関連する単語の辞書データが共有され易くなり、少数派の組織や分野に属するユーザにとっては有益な辞書データを得ることができないという問題が生じる。また個々のユーザにとっては自分に関係のない組織や分野に関する不必要な単語の辞書データまでもが共有されてしまい、その影響で却って自然言語処理の精度が落ちる場合があった。 For example, Patent Literature 1 describes a method of extracting dictionary data that appears more than a certain number of times in all of a plurality of user dictionaries as candidates for dictionary data to be shared and sharing it with all users. However, in this method, the dictionary data of words related to various organizations and fields are uniformly shared without distinguishing them, so it becomes easy to share the dictionary data of words related to organizations and fields with many users, There arises a problem that useful dictionary data cannot be obtained for users belonging to a minority organization or field. In addition, for individual users, unnecessary word dictionary data related to an organization or field unrelated to him / her is also shared, and the influence of the influence sometimes reduces the accuracy of natural language processing.

この問題に対する従来例としては、特許文献１と特許文献３に、個々のユーザ辞書や辞書データを予め定められた分野に対応付けて管理し、個々の分野の単位で共有すべき辞書データの候補を抽出し、分野の単位で共有する方法が記載されている。
特許文献２には、個々のユーザ辞書や辞書データを予め定められた組織に対応付けて管理し、個々の組織の単位で共有すべき辞書データの候補を抽出し、組織の単位で共有する方法が記載されている。
これらの方法は辞書データを共有する相手を組織や分野が同一である範囲に制限することで、組織や分野を超えて不適切に辞書データが共有されることを防ごうとするものである。 As a conventional example for this problem, Patent Document 1 and Patent Document 3 manage individual user dictionaries and dictionary data in association with predetermined fields, and dictionary data candidates to be shared in units of individual fields. The method of extracting and sharing by the unit of field is described.
Patent Document 2 discloses a method of managing individual user dictionaries and dictionary data in association with predetermined organizations, extracting dictionary data candidates to be shared in units of individual organizations, and sharing them in units of organizations Is described.
These methods are intended to prevent the dictionary data from being inappropriately shared across organizations and fields by limiting the partners who share the dictionary data to a range where the organizations and fields are the same.

特許３４６４８８１Japanese Patent No. 3464881 特許３５５６４２５Patent 3556425 特開２００３−１５７２５７JP2003-157257

しかし、上記の従来技術では、組織や分野の定義はユーザの視点によって異なり、また時間とともに変化するものであるにもかかわらず、予め定められた固定的な構造で組織や分野を扱うことを前提としている。
そのため、個々のユーザの視点の違いや時間変化に対して柔軟に対応することができず、必要な辞書データと不適切な辞書データとが混在したまま共有すべき辞書データとして抽出されてしまうという問題があった。 However, in the above prior art, the definition of an organization or a field varies depending on the user's viewpoint and changes with time, but it is assumed that the organization or field is handled with a predetermined fixed structure. It is said.
For this reason, it is not possible to flexibly cope with differences in viewpoints and changes in time of individual users, and necessary dictionary data and inappropriate dictionary data are extracted as dictionary data to be shared while being mixed. There was a problem.

そこで、本発明は、辞書データを共有すべき相手を個々のユーザの単位で適切に選択することで、共有すべき辞書データが適切に抽出できる自然言語処理システム等を提供することをその目的とする。 Accordingly, an object of the present invention is to provide a natural language processing system or the like that can appropriately extract dictionary data to be shared by appropriately selecting a partner to share dictionary data in units of individual users. To do.

本発明の自然言語処理システムは、類似度計算手段が第１のユーザ辞書（辞書データを共有する相手のユーザ辞書）と第２のユーザ辞書（辞書データの登録対象となるユーザ辞書）の間の類似度を計算する。登録候補抽出手段は、類似度が予め定められた閾値以上である場合に第１のユーザ辞書に含まれかつ第２のユーザ辞書に含まれない辞書データを第２の辞書に対する登録候補として抽出する。ユーザ辞書登録手段は、登録候補に含まれる辞書データを第２のユーザ辞書に登録する（請求項１ないし請求項６）。 In the natural language processing system according to the present invention, the similarity calculation means is between the first user dictionary (the user dictionary of the other party sharing the dictionary data) and the second user dictionary (the user dictionary to be registered with the dictionary data). Calculate similarity. The registration candidate extraction unit extracts dictionary data included in the first user dictionary and not included in the second user dictionary as registration candidates for the second dictionary when the similarity is equal to or higher than a predetermined threshold. . The user dictionary registration means registers the dictionary data included in the registration candidates in the second user dictionary (claims 1 to 6).

上記自然言語処理システムによれば、登録候補抽出手段が、ユーザ辞書の類似度を基準として登録候補を抽出するユーザ辞書、すなわち辞書データを共有すべき相手のユーザ辞書を選択する。類似度が低ければ、第１のユーザ辞書のユーザと第２のユーザ辞書のユーザが同一の組織・分野に属していたとしても第１のユーザ辞書からは登録候補が抽出されない。
そのため、辞書データを共有する相手のユーザ辞書を個々のユーザ単位で適切に選択することができる。 According to the natural language processing system, the registration candidate extraction unit selects a user dictionary that extracts registration candidates based on the similarity of the user dictionary, that is, a partner user dictionary to which dictionary data should be shared. If the degree of similarity is low, registration candidates are not extracted from the first user dictionary even if the users of the first user dictionary and the users of the second user dictionary belong to the same organization / field.
Therefore, the user dictionary of the other party who shares dictionary data can be selected appropriately for each user.

上記自然言語処理システムにおいて、類似度を第１のユーザ辞書および第２のユーザ辞書に登録されている辞書データの総数（第１のユーザ辞書の辞書データ数と第２のユーザ辞書の辞書データ数から二つの辞書に共通して登録されている辞書データの数を引いた数）と第１のユーザ辞書および第２のユーザ辞書に共通して登録されている辞書データの数との比に基づいて算出するようにしても良い（請求項２）。
このようにすれば、共通した辞書データが多く登録されているユーザ辞書を辞書データの共有相手として選択することができる。 In the natural language processing system, the similarity is the total number of dictionary data registered in the first user dictionary and the second user dictionary (the number of dictionary data in the first user dictionary and the number of dictionary data in the second user dictionary). Based on the ratio of the number of dictionary data registered in common to the two dictionaries) and the number of dictionary data registered in common in the first user dictionary and the second user dictionary (Claim 2).
In this way, a user dictionary in which many common dictionary data are registered can be selected as a dictionary data sharing partner.

上記自然言語処理システムにおいて、第１のユーザ辞書を用いて過去に行われた自然言語処理の処理対象と第２のユーザ辞書を用いて過去に行われた自然言語処理の処理対象の類似度である処理対象類似度に基づいて類似度を計算するようにしても良い（請求項３）。
このようにすれば、取り扱う処理対象が類似するユーザのユーザ辞書を共有相手として選択することができる。 In the natural language processing system, the similarity between the processing target of the natural language processing performed in the past using the first user dictionary and the processing target of the natural language processing performed in the past using the second user dictionary. You may make it calculate a similarity based on a certain process target similarity (Claim 3).
In this way, it is possible to select a user dictionary of a user with a similar processing target to be handled as a sharing partner.

上記自然言語処理システムにおいて、辞書データは分類情報を含むようにし、第１のユーザ辞書に記憶された辞書データの中の同一の分類情報を持つ辞書データからなる第１の辞書データ集合と第２のユーザ辞書に記憶された辞書データの中の同一の分類情報を持つ辞書データからなる第２辞書データ集合の類似度である辞書データ集合類似度に基づいて類似度を計算するようにし、第１の辞書データ集合に含まれ、かつ、第２の辞書データ集合に含まれない辞書データを登録候補として抽出するようにしてもよい（請求項４）。
このようにすれば、辞書データの種類を考慮して登録候補の抽出対象を選択することができるため、ユーザ辞書全体を登録候補の抽出対象とする場合に比べ、より適切に辞書データの共有相手を選択することができる。 In the natural language processing system, the dictionary data includes classification information, and the first dictionary data set and the second dictionary data that are composed of dictionary data having the same classification information in the dictionary data stored in the first user dictionary. The similarity is calculated based on the dictionary data set similarity, which is the similarity of the second dictionary data set made up of dictionary data having the same classification information among the dictionary data stored in the user dictionary of the first, The dictionary data included in the dictionary data set and not included in the second dictionary data set may be extracted as registration candidates.
In this way, the candidate for registration candidate extraction can be selected in consideration of the type of dictionary data, so that the dictionary data sharing partner can be more appropriately compared with the case where the entire user dictionary is the candidate for registration candidate extraction. Can be selected.

上記自然言語処理システムにおいて、第１の辞書データ集合および第２の辞書データ集合に登録されている辞書データの総数と第１の辞書データ集合および第２の辞書データ集合に共通して登録されている辞書データの数との比に基づいて辞書データ集合類似度を計算するようにしても良い（請求項５）。
このようにすれば、共通した辞書データが多く登録されている辞書データ集合を辞書データの共有相手として選択することができる。 In the natural language processing system, the total number of dictionary data registered in the first dictionary data set and the second dictionary data set and the common registration in the first dictionary data set and the second dictionary data set. The dictionary data set similarity may be calculated based on the ratio to the number of dictionary data.
In this way, a dictionary data set in which many common dictionary data are registered can be selected as a dictionary data sharing partner.

上記自然言語処理システムにおいて、第１の辞書データ集合を用いて過去に行われた自然言語処理の処理対象と第２の辞書データ集合を用いて過去に行われた自然言語処理の処理対象の類似度に基づいて辞書データ類似度を計算するようにしても良い（請求項６）。
このようにすれば、辞書データ集合を単位として、取り扱う処理対象が類似するユーザのユーザ辞書の一部を共有相手として選択することができる。 In the natural language processing system, the processing target of the natural language processing performed in the past using the first dictionary data set is similar to the processing target of the natural language processing performed in the past using the second dictionary data set. The dictionary data similarity may be calculated based on the degree (claim 6).
In this way, it is possible to select, as a sharing partner, a part of a user dictionary of users with similar processing targets to be handled in units of dictionary data sets.

本発明の自然言語処理方法は、第１のユーザ辞書と第２のユーザ辞書を記憶装置から読み出して前記第１のユーザ辞書と前記第２のユーザ辞書との間の類似度を計算し、類似度が予め定められた閾値以上である場合に、第１のユーザ辞書に含まれ、かつ、第２のユーザ辞書に含まれない辞書データを第２のユーザ辞書に対する登録候補として抽出してこの抽出候補を記憶装置に記録し、登録候補を記憶装置から読み出し登録候補に含まれる辞書データを第２のユーザ辞書に登録する（請求項７）。 According to the natural language processing method of the present invention, the first user dictionary and the second user dictionary are read from the storage device, and the similarity between the first user dictionary and the second user dictionary is calculated. When the degree is equal to or greater than a predetermined threshold, the dictionary data included in the first user dictionary and not included in the second user dictionary is extracted as a registration candidate for the second user dictionary and extracted. The candidate is recorded in the storage device, the registration candidate is read from the storage device, and dictionary data included in the registration candidate is registered in the second user dictionary (claim 7).

上記自然言語処理方法によれば、ユーザ辞書の類似度を基準として登録候補を抽出するユーザ辞書、すなわち辞書データを共有すべき相手のユーザ辞書を選択する。類似度が低ければ、第１のユーザ辞書のユーザと第２のユーザ辞書のユーザが同一の組織・分野に属していたとしても第１のユーザ辞書からは登録候補が抽出されない。
そのため、辞書データを共有する相手のユーザ辞書を個々のユーザ単位で適切に選択することができる。 According to the natural language processing method, the user dictionary for extracting registration candidates based on the similarity of the user dictionary, that is, the partner user dictionary to which the dictionary data should be shared is selected. If the degree of similarity is low, registration candidates are not extracted from the first user dictionary even if the users of the first user dictionary and the users of the second user dictionary belong to the same organization / field.
Therefore, the user dictionary of the other party who shares dictionary data can be selected appropriately for each user.

本発明の自然言語処理プログラムは、コンピュータに、第１のユーザ辞書と第２のユーザ辞書を記憶装置から読み出して第１のユーザ辞書と第２のユーザ辞書との間の類似度を計算する類似度計算機能と、類似度が予め定められた閾値以上である場合に第１のユーザ辞書に含まれ、かつ、第２のユーザ辞書に含まれない辞書データを第２のユーザ辞書に対する登録候補として抽出し、この抽出候補を記憶装置に記録する登録候補抽出機能と、登録候補を記憶装置から読み出して、登録候補に含まれる辞書データを第２のユーザ辞書に登録するユーザ辞書登録機能とを実行させる（請求項８ないし請求項１３）。 The natural language processing program of the present invention reads the first user dictionary and the second user dictionary from the storage device and calculates the similarity between the first user dictionary and the second user dictionary on a computer. The dictionary data included in the first user dictionary and not included in the second user dictionary when the similarity is equal to or greater than a predetermined threshold as a registration candidate for the second user dictionary A registration candidate extraction function for extracting and recording the extraction candidates in the storage device and a user dictionary registration function for reading the registration candidates from the storage device and registering dictionary data included in the registration candidates in the second user dictionary are executed. (Claims 8 to 13).

上記自然言語処理プログラムによれば、コンピュータに、ユーザ辞書の類似度を基準として登録候補を抽出するユーザ辞書、すなわち辞書データを共有すべき相手のユーザ辞書を選択させる。この類似度が低ければ、コンピュータは、第１のユーザ辞書のユーザと第２のユーザ辞書のユーザが同一の組織・分野に属していたとしても第１のユーザ辞書からは登録候補を抽出しない。
そのため、コンピュータを自然言語処理システムとして動作させ、辞書データを共有する相手のユーザ辞書を個々のユーザ単位で適切に選択することができる。 According to the natural language processing program, the computer is caused to select a user dictionary for extracting registration candidates based on the similarity of the user dictionary, that is, a partner user dictionary to which dictionary data should be shared. If the similarity is low, the computer does not extract registration candidates from the first user dictionary even if the user of the first user dictionary and the user of the second user dictionary belong to the same organization / field.
Therefore, it is possible to operate the computer as a natural language processing system and appropriately select a partner user dictionary to share dictionary data for each user.

上記プログラムにおいて、第１のユーザ辞書および第２のユーザ辞書に登録されている辞書データの総数と第１のユーザ辞書および第２のユーザ辞書に共通して登録されている辞書データの数との比に基づいて類似度を計算するようにしても良い（請求項９）。
このようにすれば、共通した辞書データが多く登録されているユーザ辞書を辞書データの共有相手として選択することができる。 In the above program, the total number of dictionary data registered in the first user dictionary and the second user dictionary and the number of dictionary data registered in common in the first user dictionary and the second user dictionary The similarity may be calculated based on the ratio (claim 9).
In this way, a user dictionary in which many common dictionary data are registered can be selected as a dictionary data sharing partner.

上記プログラムにおいて、第１のユーザ辞書を用いて過去に行われた自然言語処理の処理対象と第２のユーザ辞書を用いて過去に行われた自然言語処理の処理対象の類似度である処理対象類似度に基づいて類似度を計算するようにしても良い（請求項１０）。
このようにすれば、取り扱う処理対象が類似するユーザのユーザ辞書を共有相手として選択することができる。 In the above program, a processing target that is a similarity between a processing target of natural language processing performed in the past using the first user dictionary and a processing target of natural language processing performed in the past using the second user dictionary The similarity may be calculated based on the similarity (claim 10).
In this way, it is possible to select a user dictionary of a user with a similar processing target to be handled as a sharing partner.

上記プログラムにおいて、辞書データは分類情報を含むようにし、第１のユーザ辞書に記憶された辞書データの中の同一の分類情報を持つ辞書データからなる第１の辞書データ集合と第２のユーザ辞書に記憶された辞書データの中の同一の分類情報を持つ辞書データからなる第２辞書データ集合の類似度である辞書データ集合類似度に基づいて類似度を計算するようにし、第１の辞書データ集合に含まれ、かつ、第２の辞書データ集合に含まれない辞書データを登録候補として抽出するようにしてもよい（請求項１１）。
このようにすれば、辞書データの種類を考慮して登録候補の抽出対象を選択することができるため、ユーザ辞書全体を登録候補の抽出対象とする場合に比べ、より適切に辞書データの共有相手を選択することができる。 In the above program, the dictionary data includes classification information, and the first dictionary data set and the second user dictionary composed of dictionary data having the same classification information in the dictionary data stored in the first user dictionary The similarity is calculated based on the dictionary data set similarity which is the similarity of the second dictionary data set made up of dictionary data having the same classification information in the dictionary data stored in the first dictionary data Dictionary data included in the set and not included in the second dictionary data set may be extracted as registration candidates (claim 11).
In this way, the candidate for registration candidate extraction can be selected in consideration of the type of dictionary data, so that the dictionary data sharing partner can be more appropriately compared with the case where the entire user dictionary is the candidate for registration candidate extraction. Can be selected.

上記プログラムにおいて、第１の辞書データ集合および第２の辞書データ集合に登録されている辞書データの総数と第１の辞書データ集合および第２の辞書データ集合に共通して登録されている辞書データの数との比に基づいて辞書データ集合類似度を計算するようにしても良い（請求項１２）。
このようにすれば、共通した辞書データが多く登録されている辞書データ集合を辞書データの共有相手として選択することができる。 In the above program, the total number of dictionary data registered in the first dictionary data set and the second dictionary data set, and dictionary data registered in common in the first dictionary data set and the second dictionary data set The dictionary data set similarity may be calculated on the basis of the ratio to the number of claims.
In this way, a dictionary data set in which many common dictionary data are registered can be selected as a dictionary data sharing partner.

上記プログラムにおいて、第１の辞書データ集合を用いて過去に行われた自然言語処理の処理対象と第２の辞書データ集合を用いて過去に行われた自然言語処理の処理対象の類似度である処理対象類似度に基づいて辞書データ類似度を計算するようにしても良い（請求項１３）。
このようにすれば、辞書データ集合を単位として、取り扱う処理対象が類似するユーザのユーザ辞書の一部を共有相手として選択することができる。 In the above program, the similarity between the processing target of the natural language processing performed in the past using the first dictionary data set and the processing target of the natural language processing performed in the past using the second dictionary data set. The dictionary data similarity may be calculated based on the processing target similarity.
In this way, it is possible to select, as a sharing partner, a part of a user dictionary of users with similar processing targets to be handled in units of dictionary data sets.

本発明によれば、ユーザ辞書の類似度を基準として登録候補を抽出するユーザ辞書、すなわち辞書データを共有すべき相手のユーザ辞書を選択する。類似度が低ければ、第１のユーザ辞書のユーザと第２のユーザ辞書のユーザが同一の組織・分野に属していたとしても第１のユーザ辞書からは登録候補が抽出されない。
そのため、辞書データを共有する相手のユーザ辞書を個々のユーザ単位で適切に選択することができる。 According to the present invention, a user dictionary from which registration candidates are extracted based on the similarity of the user dictionary, that is, a partner user dictionary to which dictionary data should be shared is selected. If the degree of similarity is low, registration candidates are not extracted from the first user dictionary even if the users of the first user dictionary and the users of the second user dictionary belong to the same organization / field.
Therefore, the user dictionary of the other party who shares dictionary data can be selected appropriately for each user.

次に、本発明の第１の実施形態である自然言語処理システム１０の構成と動作について図面を参照して詳細に説明する。
図１は、自然言語処理システム１０の構成を示す機能ブロック図である。
図１を参照すると、自然言語処理システム１０は例えばパーソナルコンピュータであり、キーボードやマイク等の入力装置１と、プログラム制御により動作するデータ処理装置２と、情報を記憶する記憶装置３と、ディスプレイ装置や印刷装置、スピーカ等の出力装置４とを備えている。 Next, the configuration and operation of the natural language processing system 10 according to the first embodiment of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a functional block diagram showing the configuration of the natural language processing system 10.
Referring to FIG. 1, a natural language processing system 10 is, for example, a personal computer, and includes an input device 1 such as a keyboard and a microphone, a data processing device 2 that operates under program control, a storage device 3 that stores information, and a display device. And a printing device, and an output device 4 such as a speaker.

記憶装置３は、例えばハードディスク装置により構成され、ユーザ辞書記憶部３１と登録候補記憶部３２とを備えている。
ユーザ辞書記憶部３１は、個々のユーザのユーザ辞書３１Ａ、３１Ｂ、３１Ｃ等を格納している。ここで各々のユーザ辞書はユーザが登録した単語とそれに対応する言語情報を格納している。ここで言語情報は後述する自然言語処理手段２４がその処理のために参照する情報であり、例えば、仮名表記、読み、訳語、品詞、意味情報などから構成される。
登録候補記憶部３２は、ユーザ辞書３１Ａに対応する登録候補３２Ａ等、個々のユーザ辞書毎に登録候補を記憶している。登録候補には、対応するユーザ辞書に対して登録する辞書データの候補が含まれている。ここで辞書データとはユーザ辞書に登録された情報の最小単位であり、単語とそれに対応する言語情報からなる。 The storage device 3 is configured by a hard disk device, for example, and includes a user dictionary storage unit 31 and a registration candidate storage unit 32.
The user dictionary storage unit 31 stores user dictionaries 31A, 31B, 31C of individual users. Here, each user dictionary stores words registered by the user and corresponding language information. Here, the linguistic information is information that the natural language processing means 24 to be described later refers to for the processing, and includes, for example, kana notation, reading, translation, part of speech, and semantic information.
The registration candidate storage unit 32 stores registration candidates for each individual user dictionary, such as a registration candidate 32A corresponding to the user dictionary 31A. The registration candidates include dictionary data candidates to be registered for the corresponding user dictionary. Here, the dictionary data is the smallest unit of information registered in the user dictionary, and is composed of words and corresponding language information.

データ処理装置２は例えばＣＰＵ(Central Processing Unit)であり、類似度計算手段２１と、登録候補抽出手段２２と、ユーザ辞書登録手段２３と、自然言語処理手段２４とを備えている。
類似度計算手段２１は、ユーザ辞書記憶部３１に格納されたユーザ辞書のうちの二つのユーザ辞書（第１のユーザ辞書と第２のユーザ辞書）の間の類似度を計算する。ここでユーザ辞書間の類似度としては、例えば各々のユーザ辞書に登録された辞書データの一致度を用いることができる。この一致度は各々のユーザ辞書に登録されている辞書データの総数と両方のユーザ辞書に共通して登録されている辞書データの数の比で定義することができる。
また自然言語処理手段２４の行なう自然言語処理が仮名漢字変換処理や機械翻訳処理、音声合成処理のように自然言語の文字列を処理対象とする場合には、ユーザ辞書間の類似度として、過去にそのユーザ辞書を用いて行なった自然言語処理の処理対象の間での類似度を用いても良い。ここで処理対象間の類似度は、例えばSaltonらによって提案され、情報検索分野で広く利用されているベクトル空間モデル（G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval”, McGraw-Hill, 1983）に基づいて計算することができる。ベクトル空間モデルによると、例えば、処理対象はそこに含まれる自立語を成分とした特徴ベクトルで表現することができ、処理対象間の類似度は各々の特徴ベクトルがなす角の余弦等で定義することができる。 The data processing device 2 is, for example, a CPU (Central Processing Unit), and includes a similarity calculation unit 21, a registration candidate extraction unit 22, a user dictionary registration unit 23, and a natural language processing unit 24.
The similarity calculation means 21 calculates the similarity between two user dictionaries (the first user dictionary and the second user dictionary) among the user dictionaries stored in the user dictionary storage unit 31. Here, as the degree of similarity between user dictionaries, for example, the degree of coincidence of dictionary data registered in each user dictionary can be used. This degree of coincidence can be defined by the ratio between the total number of dictionary data registered in each user dictionary and the number of dictionary data registered in both user dictionaries.
If the natural language processing performed by the natural language processing means 24 is to process natural language character strings such as kana-kanji conversion processing, machine translation processing, and speech synthesis processing, the similarity between the user dictionaries is set as the past. Alternatively, the similarity between the processing objects of the natural language processing performed using the user dictionary may be used. Here, the similarity between objects to be processed is, for example, a vector space model (G. Salton and MJ McGill, “Introduction to Modern Information Retrieval”, McGraw-Hill, 1983, proposed by Salton et al. ). According to the vector space model, for example, the processing target can be expressed by a feature vector whose component is an independent word contained therein, and the similarity between the processing targets is defined by the cosine of the angle formed by each feature vector. be able to.

登録候補抽出手段２２は、類似度計算手段２１で計算された類似度が閾値以上であるユーザ辞書の組（第１のユーザ辞書と第２のユーザ辞書）を取り出し、第１のユーザ辞書に含まれていて、かつ第２のユーザ辞書に含まれない辞書データを、共有すべき辞書データの候補（登録候補）として抽出する。登録候補が抽出できた場合には、その辞書データを第２のユーザ辞書に対応付けて登録候補記憶部３２に記録する。例えば、ユーザ辞書３１Ｂを第２のユーザ辞書、ユーザ辞書３１Ａを第１のユーザ辞書とすると、登録候補抽出手段２２は、ユーザ辞書３１Ｂには含まれていてかつユーザ辞書３１Ａには含まれていない辞書データがあった場合、これをユーザ辞書３１Ａに対応する登録候補として登録候補記憶部３２Ａに記録する。 The registration candidate extraction unit 22 takes out a set of user dictionaries (first user dictionary and second user dictionary) whose similarity calculated by the similarity calculation unit 21 is equal to or greater than a threshold value, and includes it in the first user dictionary. Dictionary data that is not included in the second user dictionary is extracted as dictionary data candidates (registration candidates) to be shared. If registration candidates can be extracted, the dictionary data is recorded in the registration candidate storage unit 32 in association with the second user dictionary. For example, if the user dictionary 31B is the second user dictionary and the user dictionary 31A is the first user dictionary, the registration candidate extraction unit 22 is included in the user dictionary 31B and not included in the user dictionary 31A. If there is dictionary data, it is recorded in the registration candidate storage unit 32A as a registration candidate corresponding to the user dictionary 31A.

ユーザ辞書登録手段２３は、登録候補記憶部３２に記録された登録候補を読み出し、その中に含まれる辞書データを対応するユーザ辞書に登録する。ここでは読み出した登録候補に含まれる辞書データを出力装置４に表示し、その辞書データを登録するかどうかをユーザに問い合わせてもよい。この場合はさらに、ユーザが必要ないと判断した辞書データを登録対象外であることを表す情報とともに登録候補記憶部３２に記録しておき、その後で登録候補抽出手段２２が同じ辞書データを共有すべき辞書データの候補として抽出したとしても、登録対象外として除外する形態をとってもよい。 The user dictionary registration unit 23 reads registration candidates recorded in the registration candidate storage unit 32 and registers dictionary data included therein in a corresponding user dictionary. Here, the dictionary data included in the read registration candidates may be displayed on the output device 4 and the user may be inquired whether to register the dictionary data. In this case, further, the dictionary data determined to be unnecessary by the user is recorded in the registration candidate storage unit 32 together with information indicating that the user is not registered, and then the registration candidate extraction unit 22 shares the same dictionary data. Even if it is extracted as a candidate for power dictionary data, it may be excluded from being registered.

自然言語処理手段２４は、ユーザからの入力を受け付け、ユーザ辞書記憶部３１に格納されているそのユーザのユーザ辞書を利用して自然言語処理を施し、処理の結果を出力する。ここで自然言語処理手段２４の行なう自然言語処理は例えば、仮名漢字変換処理や機械翻訳処理、音声認識処理、音声合成処理である。ここで仮名漢字変換処理は入力された仮名文字列を漢字仮名混じり文字列に変換する処理であり、機械翻訳処理は入力された第一の言語の文字列を第二の言語の文字列に変換する処理であり、音声認識処理は入力された音声信号を文字列に変換する処理であり、音声合成処理は入力された文字列を音声信号に変換する処理である。 The natural language processing unit 24 receives an input from the user, performs natural language processing using the user dictionary of the user stored in the user dictionary storage unit 31, and outputs a processing result. Here, the natural language processing performed by the natural language processing means 24 is, for example, kana-kanji conversion processing, machine translation processing, speech recognition processing, or speech synthesis processing. Here, the kana-kanji conversion process is a process that converts the input kana character string into a kanji-kana mixed character string, and the machine translation process converts the input first language character string into a second language character string. The speech recognition process is a process for converting the input speech signal into a character string, and the speech synthesis process is a process for converting the input character string into a speech signal.

次に、図１及び図２のフローチャートを参照して、自然言語処理システム１０が共有すべき辞書データの候補を登録候補として抽出する動作について詳細に説明する。
図２は、自然減処理システム１０が登録候補を抽出する動作を示すフローチャートである。動作の概要としては、ユーザ辞書記憶部３１に記憶されているユーザ辞書から選んだ二つの辞書の組み合わせの全てについて類似度を計算し、類似度が閾値以上である組み合わせについてはそれぞれのユーザ辞書に対応する登録候補を抽出して登録候補記憶部３２に記録する。 Next, with reference to the flowcharts of FIG. 1 and FIG. 2, an operation of extracting dictionary data candidates to be shared by the natural language processing system 10 as registration candidates will be described in detail.
FIG. 2 is a flowchart showing an operation in which the natural reduction processing system 10 extracts registration candidates. As an outline of the operation, the similarity is calculated for all combinations of two dictionaries selected from the user dictionaries stored in the user dictionary storage unit 31, and the combinations whose similarity is equal to or greater than a threshold are stored in the respective user dictionaries. Corresponding registration candidates are extracted and recorded in the registration candidate storage unit 32.

まず、類似度計算手段２１はユーザ辞書記憶部３１の中から二つのユーザ辞書の組を取り出す（ステップＳ１０１）。次に未処理のユーザ辞書の組が取り出せたかどうかを調べ（ステップＳ１０２）、取り出せなかった場合には処理を終了する。
未処理のユーザ辞書の組を取り出せた場合、類似度計算手段２１は、それらのユーザ辞書の間の類似度を計算する（ステップＳ１０３）。
次に登録候補抽出手段２２は類似度計算手段２１で得られた類似度と閾値とを比較する（ステップＳ１０４）。得られた類似度が閾値よりも小さい場合はステップＳ１０１に戻って処理を継続する。得られた類似度が閾値以上である場合、登録候補抽出手段２２は、これらのユーザ辞書に格納された辞書データを比較し、片方のユーザ辞書にだけ含まれている辞書データを登録候補として抽出する（ステップＳ１０５）。ここで登録候補を抽出できなかった場合はステップＳ１０１に戻って処理を継続する（ステップＳ１０６）。
登録候補を抽出できた場合、登録候補抽出手段２２は、その登録候補をそれを含んでいない方のユーザ辞書に対応する登録候補として登録候補記憶部３２に格納した後、ステップＳ１０１に戻って処理を継続する（ステップＳ１０７）。 First, the similarity calculation means 21 takes out a set of two user dictionaries from the user dictionary storage unit 31 (step S101). Next, it is checked whether or not a set of unprocessed user dictionaries can be extracted (step S102), and if it cannot be extracted, the process ends.
When a set of unprocessed user dictionaries can be extracted, the similarity calculation means 21 calculates the similarity between these user dictionaries (step S103).
Next, the registration candidate extraction unit 22 compares the similarity obtained by the similarity calculation unit 21 with a threshold value (step S104). If the obtained similarity is smaller than the threshold value, the process returns to step S101 and continues. If the obtained similarity is equal to or greater than the threshold value, the registration candidate extraction unit 22 compares the dictionary data stored in these user dictionaries, and extracts dictionary data included only in one of the user dictionaries as a registration candidate. (Step S105). If the registration candidate cannot be extracted, the process returns to step S101 and continues (step S106).
When the registration candidate can be extracted, the registration candidate extraction unit 22 stores the registration candidate in the registration candidate storage unit 32 as a registration candidate corresponding to the user dictionary that does not include the registration candidate, and then returns to step S101 for processing. (Step S107).

なお、ここではユーザ辞書という単位で二つのユーザ辞書の間の類似度を計算し登録候補を抽出しているが、ユーザ辞書内の辞書データに分類情報が設定されている場合には、個々のユーザ辞書において同一の分類情報が設定された辞書データの集合を単位として類似度を計算し、その単位毎に共有すべき辞書データの候補を抽出しても良い。
例えば、ユーザ辞書３１Ａに含まれる辞書データの中で同一の分類情報イを持つものの集合（第２の辞書データ集合）とユーザ辞書３１Ｂに含まれる辞書データの中で同一の分類情報ロを持つものの集合（ここで、分類情報イと分類情報ロは同一でも良いし異なっていても良い）との間の類似度（辞書データ集合類似度）を計算し、この類似度を辞書データの類似度とする。そして、例えば、第２の辞書データ集合にのみ含まれる辞書データをユーザ辞書３１Ａの登録候補データとする。辞書データ集合の類似度は、上記に説明したユーザ辞書の類似度と同様の方法で計算することができる。
さらにこの場合、得られた辞書データの集合の、同一のユーザ辞書の範囲での組み合わせを単位として類似度を計算し、その単位毎に共有すべき辞書データの候補を抽出しても良い。 Here, the similarity between two user dictionaries is calculated in units of user dictionaries and registration candidates are extracted. However, if classification information is set in the dictionary data in the user dictionary, Similarity may be calculated for a set of dictionary data in which the same classification information is set in the user dictionary, and dictionary data candidates to be shared for each unit may be extracted.
For example, a set of dictionary data included in the user dictionary 31A having the same classification information A (second dictionary data set) and a dictionary data included in the user dictionary 31B having the same classification information B The similarity (dictionary data set similarity) between the sets (classification information i and classification information b may be the same or different) is calculated, and this similarity is calculated as the similarity of dictionary data. To do. For example, dictionary data included only in the second dictionary data set is set as registration candidate data for the user dictionary 31A. The similarity of the dictionary data set can be calculated by the same method as the similarity of the user dictionary described above.
Furthermore, in this case, the similarity may be calculated in units of combinations of the obtained dictionary data within the range of the same user dictionary, and dictionary data candidates to be shared may be extracted for each unit.

またこの登録候補を抽出する処理は、例えば一定期間毎に動作させる形態をとることができる。あるいは、ユーザ辞書記憶部３１内の一つのユーザ辞書が更新されたときに動作させ、更新されたユーザ辞書とそれ以外のユーザ辞書の組み合わせに対してだけこの処理を適用することで、効率的に処理を行なう形態をとることもできる。 Moreover, the process which extracts this registration candidate can take the form operated for every fixed period, for example. Alternatively, when one user dictionary in the user dictionary storage unit 31 is updated, this process is applied only to the combination of the updated user dictionary and other user dictionaries. It can also take the form which processes.

次に、図１及び図３のフローチャートを参照して、自然言語処理システム１０が登録候補に含まれる辞書データをユーザ辞書に登録する動作について詳細に説明する。
図３は、自然言語処理システム１０が抽出された辞書データをユーザ辞書に登録する動作を示すフローチャートである。動作の概要としては、登録候補記憶部３２から登録すべき辞書データの候補を取り出して個々のユーザに提示し、ユーザの指示に応じてそれぞれの辞書データを対応するユーザ辞書に登録する。 Next, the operation of the natural language processing system 10 for registering the dictionary data included in the registration candidates in the user dictionary will be described in detail with reference to the flowcharts of FIGS. 1 and 3.
FIG. 3 is a flowchart showing an operation of registering the extracted dictionary data in the user dictionary by the natural language processing system 10. As an outline of the operation, dictionary data candidates to be registered are extracted from the registration candidate storage unit 32 and presented to individual users, and each dictionary data is registered in a corresponding user dictionary in accordance with a user instruction.

まずユーザ辞書登録手段２３は、登録候補記憶部３２からユーザ辞書に対応する登録候補を取り出す（図３のステップＳ１１１）。次に、ユーザ辞書登録手段２３は、登録候補記憶部３２から登録候補が取り出せたか、取り出せたならその登録候補に含まれる辞書データが登録対象外でないかを検査し、ユーザ辞書に登録すべき辞書データの候補が存在するかどうかを調べる（ステップＳ１１２）。ここで登録すべき辞書データの候補が存在しない場合、すなわち、登録候補を取り出せなかった場合、または、登録候補を取り出せたがそれに含まれる辞書データの全てに登録対象外であることを示す情報が付されている場合には処理を終了する。 First, the user dictionary registration unit 23 extracts registration candidates corresponding to the user dictionary from the registration candidate storage unit 32 (step S111 in FIG. 3). Next, the user dictionary registration means 23 checks whether or not a registration candidate can be extracted from the registration candidate storage unit 32, and if it can be extracted, whether dictionary data included in the registration candidate is not a registration target, and is a dictionary to be registered in the user dictionary. It is checked whether or not there is a data candidate (step S112). Here, when there is no dictionary data candidate to be registered, that is, when the registration candidate could not be extracted, or information indicating that the registration candidate was extracted but all of the dictionary data included in the candidate could not be registered. If so, the process ends.

登録すべき辞書データの候補が存在する場合、ユーザ辞書登録手段２３は、出力装置４にその辞書データを表示し、個々の辞書データについてそれをユーザ辞書に登録するかどうかをユーザに問い合わせる（ステップＳ１１３）。ユーザ辞書登録手段２３は、その後、入力装置１から入力を受け付け、登録対象外と入力された辞書データが存在するかどうかを調べる（ステップＳ１１４）。登録対象外と入力された辞書データが存在する場合、ユーザ辞書登録手段２３は当該辞書データを登録対象外であることを表す情報とともに登録候補記憶部３２に記録する（ステップＳ１１５）。その後、もしくはステップＳ１１３で登録対象外と入力された辞書データが存在しない（ステップＳ１１４の判断がノーの）場合、ユーザ辞書登録手段２３は、登録すると入力された辞書データが存在するかどうかを調べる（ステップＳ１１６）。登録すると入力された辞書データが存在する場合、ユーザ辞書登録手段２３は当該辞書データをユーザ辞書に登録する（ステップＳ１１７）。その後、もしくはステップＳ１１６で登録すると入力された辞書データが存在しない場合は処理を終了する。 If there is a candidate for dictionary data to be registered, the user dictionary registration means 23 displays the dictionary data on the output device 4 and inquires of the user whether or not to register each dictionary data in the user dictionary (step) S113). Thereafter, the user dictionary registration means 23 receives an input from the input device 1 and checks whether there is dictionary data that has been input as non-registered (step S114). If there is dictionary data that is input as non-registration target, the user dictionary registration unit 23 records the dictionary data in the registration candidate storage unit 32 together with information indicating that it is out of registration target (step S115). After that, or when there is no dictionary data input that is not subject to registration in step S113 (NO in step S114), the user dictionary registration means 23 checks whether the input dictionary data exists when it is registered. (Step S116). If there is dictionary data that has been input upon registration, the user dictionary registration means 23 registers the dictionary data in the user dictionary (step S117). After that, or when registered in step S116 and the input dictionary data does not exist, the process ends.

抽出された辞書データをユーザ辞書に登録する処理は、例えばユーザが陽にこの処理を呼び出したタイミングで動作させる形態や、ユーザが自然言語処理手段２４を呼び出したタイミングで動作させる形態をとることができる。 The process of registering the extracted dictionary data in the user dictionary may take, for example, a form that operates when the user explicitly calls this process or a form that operates when the user calls the natural language processing means 24. it can.

次に、自然言語処理システム１０の第１の具体的な動作例について説明する。
この例では、特に自然言語処理手段２４が機械翻訳処理を行なうものとし、図１のユーザ辞書３１Ａ、ユーザ辞書３１Ｂの二つのユーザ辞書から各々の登録候補を抽出し、この登録候補に含まれる辞書データをユーザ辞書Ａおよびユーザ辞書Ｂに登録する動作について説明する。
図４（ａ）にユーザ辞書３１Ａの内容を、図４（ｂ）にユーザ辞書３１Ｂのデータ内容を示す。図４（ａ）と図４（ｂ）にはユーザ辞書３１Ａとユーザ辞書３１Ｂの２つのユーザ辞書の内容が表形式で示されており、各々の表の一行が一つの辞書データを表している。例えば図４（ａ）一行目は、日本語が「キメラ」、英語が「chimera」、品詞が「名詞」である辞書データを表している。この例では、「キメラ」が単語であり、「chimera」と「名詞」が言語情報である。 Next, a first specific operation example of the natural language processing system 10 will be described.
In this example, it is assumed that the natural language processing means 24 performs machine translation processing in particular, and each registration candidate is extracted from the two user dictionaries 31A and 31B in FIG. 1, and the dictionaries included in the registration candidates are extracted. An operation for registering data in the user dictionary A and the user dictionary B will be described.
FIG. 4A shows the contents of the user dictionary 31A, and FIG. 4B shows the data contents of the user dictionary 31B. FIG. 4A and FIG. 4B show the contents of two user dictionaries, user dictionary 31A and user dictionary 31B, in tabular form, and one row of each table represents one dictionary data. . For example, the first line in FIG. 4A represents dictionary data in which “chimera” is Japanese, “chimera” is English, and “noun” is a part of speech. In this example, “chimera” is a word, and “chimera” and “noun” are language information.

登録候補を抽出する処理では、類似度計算手段２１が、ユーザ辞書記憶部３１からユーザ辞書３１Ａとユーザ辞書３１Ｂを取り出し、その間の類似度を計算する。
ユーザ辞書３１Ａには１０個、ユーザ辞書３１Ｂには１１個の辞書データが登録されており、この中で両者に共通して登録されている辞書データは「キメラ」の行など９個であるから、二つの辞書に登録されている辞書データの総数は１０＋１１−９＝１２個となる。
これらの数値に基づいて類似度計算手段２１は、ユーザ辞書３１Ａとユーザ辞書３１Ｂの類似度を９／１２＝０．７５と算出する。 In the process of extracting registration candidates, the similarity calculation means 21 extracts the user dictionary 31A and the user dictionary 31B from the user dictionary storage unit 31, and calculates the similarity between them.
Ten dictionary data are registered in the user dictionary 31A, and eleven dictionary data are registered in the user dictionary 31B. Of these, nine dictionary data are commonly registered in both of them, such as the “chimera” line. The total number of dictionary data registered in the two dictionaries is 10 + 11−9 = 12.
Based on these numerical values, the similarity calculation means 21 calculates the similarity between the user dictionary 31A and the user dictionary 31B as 9/12 = 0.75.

次に登録候補抽出手段２２は、類似度計算手段２１が出力した類似度０．７５と閾値とを比較する。ここで閾値が０．７であったとすると、ユーザ辞書３１Ａとユーザ辞書３１Ｂの類似度はこの値以上であるため、両者から登録候補を抽出する処理に進む。ここでは、まずユーザ辞書３１Ａについては、ユーザ辞書３１Ｂに含まれ、かつユーザ辞書３１Ａに含まれない「ジーン」と「遺伝子診断」の辞書データを登録候補として抽出し、ユーザ辞書３１Ａに対応する登録候補の記憶領域である登録候補記憶部３２Ａに記録する（この場合ユーザ辞書３１Ａが第２のユーザ辞書、ユーザ辞書３１Ｂが第１のユーザ辞書に相当する）。またユーザ辞書３１Ｂについては、ユーザ辞書３１Ａに含まれ、かつユーザ辞書３１Ｂに含まれない「トランスポゾン」の辞書データを抽出し、ユーザ辞書３１Ｂに対応する登録候補の記憶領域である登録候補記憶部３２Ｂに格納する（この場合ユーザ辞書３１Ａが第１のユーザ辞書、ユーザ辞書３１Ｂが第２のユーザ辞書に相当する）。 Next, the registration candidate extracting unit 22 compares the similarity 0.75 output by the similarity calculating unit 21 with a threshold value. If the threshold is 0.7, the similarity between the user dictionary 31A and the user dictionary 31B is equal to or greater than this value, and the process proceeds to processing for extracting registration candidates from both. Here, first, regarding the user dictionary 31A, the dictionary data of “gene” and “gene diagnosis” that are included in the user dictionary 31B and not included in the user dictionary 31A are extracted as registration candidates, and registration corresponding to the user dictionary 31A is performed. It records in the registration candidate memory | storage part 32A which is a candidate memory | storage area | region (In this case, user dictionary 31A corresponds to a 2nd user dictionary, and user dictionary 31B corresponds to a 1st user dictionary). As for the user dictionary 31B, the dictionary data of “transposon” that is included in the user dictionary 31A and not included in the user dictionary 31B is extracted, and a registration candidate storage unit 32B that is a storage area of registration candidates corresponding to the user dictionary 31B. (In this case, the user dictionary 31A corresponds to the first user dictionary and the user dictionary 31B corresponds to the second user dictionary).

これらの処理の結果として得られる登録候補記憶部３２Ａと登録候補記憶部３２Ｂの例を図５（ａ）と図５（ｂ）に示す。図５によると、登録候補記憶部３２にはユーザ辞書に格納されていた辞書データの他にその辞書データが登録対象外であるかどうかを示す情報を格納する領域が確保されている。ここで「−」は登録対象外であるかどうかの検査が完了していないことを表し、「対象外」は過去に登録対象外と指定された辞書データであることを表している。すなわち辞書データＡにおいて「ジーン」は未検査であるのに対して、「遺伝子診断」は以前にも共有すべき辞書データの候補として抽出されたことがあり、その後ユーザ辞書３１Ａの所有者であるユーザＡに登録の要不要を問い合わせた結果、登録の必要なしと入力された経緯があるということを表している。 Examples of the registration candidate storage unit 32A and the registration candidate storage unit 32B obtained as a result of these processes are shown in FIGS. 5 (a) and 5 (b). According to FIG. 5, the registration candidate storage unit 32 has an area for storing information indicating whether or not the dictionary data is not subject to registration in addition to the dictionary data stored in the user dictionary. Here, “−” indicates that the inspection of whether or not the registration target is not completed, and “not target” indicates that the dictionary data has been designated as a registration target in the past. That is, “gene” has not been examined in the dictionary data A, whereas “gene diagnosis” has been extracted as a candidate for dictionary data to be shared before, and is then the owner of the user dictionary 31A. As a result of inquiring the user A about the necessity of registration, it is indicated that there is a history of inputting that registration is not necessary.

自然言語処理システム１０は、上記の登録候補を抽出する処理を、ユーザ辞書記憶部３１に格納されたユーザ辞書の二つの組み合わせ全てに対して繰り返すことで、すべてのユーザ辞書について登録候補を抽出し、登録候補記憶部３２内の各々のユーザ辞書に対応する領域に抽出した登録候補を格納する。 The natural language processing system 10 extracts registration candidates for all user dictionaries by repeating the process of extracting the registration candidates for all two combinations of user dictionaries stored in the user dictionary storage unit 31. The extracted registration candidates are stored in an area corresponding to each user dictionary in the registration candidate storage unit 32.

登録候補に含まれる辞書データをユーザ辞書に登録する処理では、ユーザ辞書登録手段２４が、登録候補記憶部３２から登録候補を取り出し、ユーザ辞書に登録するかどうかをユーザに問い合わせる。
例えばユーザ辞書Ａの所有者であるユーザＡが陽にこの処理を呼び出した場合を考える。ユーザ辞書登録手段２３は、まず、登録候補記憶部３２Ａから登録候補を取り出し、その内容を検査する。ここで登録候補記憶部３２Ａの内容が図５（ａ）である場合を考える。ユーザ辞書登録手段２４は、ここから「ジーン」と「遺伝子診断」の２つの辞書データを取り出し、これらの辞書データが登録対象外であるかどうかを検査して、登録対象外でない「ジーン」をユーザに対する登録要否の問い合わせの対象となる最終的な登録候補として選定する。 In the process of registering the dictionary data included in the registration candidate in the user dictionary, the user dictionary registration unit 24 takes out the registration candidate from the registration candidate storage unit 32 and inquires of the user whether to register in the user dictionary.
For example, consider a case where user A who is the owner of user dictionary A explicitly calls this processing. First, the user dictionary registration unit 23 takes out registration candidates from the registration candidate storage unit 32A and inspects the contents. Here, consider the case where the contents of the registration candidate storage unit 32A are as shown in FIG. The user dictionary registration means 24 takes out two dictionary data of “gene” and “gene diagnosis” from here, checks whether these dictionary data are out of registration, and selects “gene” which is not out of registration. It is selected as a final registration candidate to be inquired about whether or not registration is necessary for the user.

さらにユーザ辞書登録手段２４は、「ジーン」に対応する登録情報を出力装置４に表示し、登録するかどうかをユーザＡに問い合わせる。図６にユーザに対する問い合わせ画面の例を示す。画面上部には登録候補をユーザ辞書に登録するか否かを問うための表示として「下記の辞書データがユーザ辞書への登録候補として検出されました。ユーザ辞書に登録しますか？」という文字列６１が表示されている。リストボックス６３には、登録候補の辞書データとユーザが質問に対する回答を入力するためのラジオボタン６２ａ、６２ｂ、６２ｃが表示されている。画面下部には、入力を確定またはキャンセルするためのボタン６４および６５が表示されている。
ここでユーザＡが「ジーン」の「登録する」のラジオボタン６２ａをチェックして「実行」ボタン６４を押した場合、ユーザ辞書登録手段２３は、ユーザ辞書３１Ａに「ジーン」の辞書データを登録し、登録候補記憶部３２Ａの中の「ジーン」の辞書データを削除する。
またユーザＡが「登録しない」のラジオボタン６１ｂをチェックして「実行」ボタン６４を押した場合、ユーザ辞書登録手段２３は、登録候補記憶部３２Ａの「ジーン」の辞書データの登録対象外の欄に「対象外」の情報を書き込む。
ユーザＡが「保留」のラジオボタン６２ｃをチェックして「実行」ボタン６４を押した場合と、ラジオボタンのチェック状態がどのようであれ「キャンセル」ボタン６５を押した場合は、ユーザ辞書登録手段２３は、何もしない。 Further, the user dictionary registration unit 24 displays registration information corresponding to “gene” on the output device 4 and inquires of the user A whether or not to register. FIG. 6 shows an example of an inquiry screen for the user. In the upper part of the screen, the text “The following dictionary data has been detected as a candidate for registration in the user dictionary. Do you want to register it in the user dictionary?” As a display to ask whether to register the candidate in the user dictionary. A column 61 is displayed. The list box 63 displays registration candidate dictionary data and radio buttons 62a, 62b, and 62c for the user to input an answer to the question. Buttons 64 and 65 for confirming or canceling the input are displayed at the bottom of the screen.
When the user A checks the “Register” radio button 62a of “Gen” and presses the “Execute” button 64, the user dictionary registration unit 23 registers the dictionary data of “Gen” in the user dictionary 31A. Then, the “gene” dictionary data in the registration candidate storage unit 32A is deleted.
When the user A checks the radio button 61b of “not registered” and presses the “execute” button 64, the user dictionary registration unit 23 is not subject to registration of the dictionary data of “gene” in the registration candidate storage unit 32A. Write “Not applicable” information in the column.
When the user A checks the “pending” radio button 62c and presses the “execute” button 64, or presses the “cancel” button 65 whatever the radio button is checked, the user dictionary registration means 23 does nothing.

図６では登録するかどうかを指定するユーザインターフェイスにラジオボタンを用いる例を示したが、図７のようにチェックボックスを用いてもよい。図７の場合、辞書データ「ジーン」のチェックボックス６６がチェックされていれば、ユーザ辞書登録手段２３は、ユーザ辞書３１Ａに「ジーン」の辞書データを登録し、登録候補記憶部３２Ａの中の「ジーン」の辞書データを削除する。
辞書データ「ジーン」のチェックボックス６６がチェックされていない場合、ユーザ辞書登録手段２３は、登録候補記憶部３２Ａの「ジーン」の辞書データの登録対象外の欄に「対象外」の情報を書き込む。 Although FIG. 6 shows an example in which a radio button is used for a user interface for specifying whether to register, a check box may be used as shown in FIG. In the case of FIG. 7, if the check box 66 for the dictionary data “gene” is checked, the user dictionary registration means 23 registers the dictionary data for “gene” in the user dictionary 31A, and stores it in the registration candidate storage unit 32A. Delete the “Gen” dictionary data.
When the dictionary data “gene” check box 66 is not checked, the user dictionary registration unit 23 writes the information “not subject” to the non-registration column of the dictionary data “gene” in the registration candidate storage unit 32A. .

次に、自然言語処理システム１０の第２の具体的な動作例について説明する。
この例では、特に自然言語処理手段２４が機械翻訳処理を行なうものとし、図１のユーザ辞書３１Ａ、ユーザ辞書３１Ｂの二つのユーザ辞書から各々の登録候補を抽出し、この登録候補に含まれる辞書データをユーザ辞書３１Ａおよびユーザ辞書３１Ｂに登録する動作について説明する。
図８（ａ）にユーザ辞書３１Ａの内容を、図８（ｂ）にユーザ辞書３１Ｂのデータ内容を示す。図４（ａ）、図４（ｂ）と同様にユーザ辞書３１Ａとユーザ辞書３１Ｂの２つのユーザ辞書の内容が表形式で示されているが、さらに個々の辞書データに対して分類情報が記録されている。ユーザ辞書３１Ａには「遺伝子関連」という分類情報が設定された辞書データが登録されており、ユーザ辞書３１Ｂには「遺伝子基礎」という分類情報が設定された辞書データと「遺伝子応用」という分類情報が設定された辞書データが登録されている。
ここで分類情報としては、例えば、各々のユーザが場面に応じてユーザ辞書内の各辞書データを使い分ける目的で設定した情報を用いても良い。 Next, a second specific operation example of the natural language processing system 10 will be described.
In this example, it is assumed that the natural language processing means 24 performs machine translation processing in particular, and each registration candidate is extracted from the two user dictionaries 31A and 31B in FIG. 1, and the dictionaries included in the registration candidates are extracted. An operation of registering data in the user dictionary 31A and the user dictionary 31B will be described.
FIG. 8A shows the contents of the user dictionary 31A, and FIG. 8B shows the data contents of the user dictionary 31B. The contents of the two user dictionaries, user dictionary 31A and user dictionary 31B, are shown in tabular form as in FIGS. 4 (a) and 4 (b). Further, classification information is recorded for each dictionary data. Has been. Dictionary data set with classification information “gene related” is registered in the user dictionary 31A, and dictionary data set with classification information “gene basis” and classification information “gene application” are registered in the user dictionary 31B. Dictionary data set with is registered.
Here, as the classification information, for example, information set for the purpose of selectively using each dictionary data in the user dictionary according to the scene may be used.

登録候補を抽出する処理では、類似度計算手段２１が、ユーザ辞書３１Ａとユーザ辞書３１Ｂに含まれる辞書データ集合の間の類似度を計算する。類似度計算手段２１は、個々のユーザ辞書を同一の分類情報を持つ辞書データの集合に分割してその辞書データ集合の組を取り出す。ここでユーザ辞書３１Ａからは分類情報が「遺伝子関連」である１０個の辞書データからなる「遺伝子関連」辞書データ集合を取り出すことができ、またユーザ辞書３１Ｂからは分類情報が「遺伝子基礎」である９個の辞書データからなる「遺伝子基礎」辞書データ集合と、分類情報が「遺伝子応用」である２個の辞書データからなる「遺伝子応用」辞書データ集合を取り出すことができる。 In the process of extracting registration candidates, the similarity calculation means 21 calculates the similarity between the dictionary data sets included in the user dictionary 31A and the user dictionary 31B. The similarity calculation means 21 divides each user dictionary into a set of dictionary data having the same classification information, and takes out the set of dictionary data sets. Here, from the user dictionary 31A, a “gene related” dictionary data set consisting of ten dictionary data whose classification information is “gene related” can be extracted, and from the user dictionary 31B, the classification information is “gene basis”. It is possible to extract a “gene basis” dictionary data set including nine dictionary data and a “gene application” dictionary data set including two dictionary data whose classification information is “gene application”.

次に類似度計算手段２１は、各々の辞書データ集合の組に対して類似度を計算する。例えば第１の具体的動作例で示した類似度計算法を用いると、「遺伝子関連」辞書データ集合と「遺伝子基礎」辞書データ集合の間の類似度は８／１１＝０．７３、「遺伝子関連」辞書データ集合と「遺伝子応用」辞書データ集合の間の類似度は１／１１＝０．０９となる。 Next, the similarity calculation means 21 calculates the similarity for each set of dictionary data sets. For example, using the similarity calculation method shown in the first specific operation example, the similarity between the “gene related” dictionary data set and the “gene basis” dictionary data set is 8/11 = 0.73, “gene The similarity between the “related” dictionary data set and the “gene application” dictionary data set is 1/11 = 0.09.

次に登録候補抽出手段２２は、類似度計算手段２１が出力した類似度と閾値とを比較する。ここで閾値が０．７であったとすると、「遺伝子関連」辞書データ集合と「遺伝子基礎」辞書データ集合の類似度がこの値以上になるため、この組から共有すべき辞書データを抽出する。
ここでは、まずユーザ辞書Ａについては、「遺伝子基礎」辞書データ集合に含まれ、かつ「遺伝子関連」辞書データ集合に含まれない「ジーン」の辞書データを抽出し、ユーザ辞書３１Ａに登録すべき登録候補の記憶領域である登録候補記憶部３２Ａに格納する。またユーザ辞書３２Ｂについては、「遺伝子関連」辞書データ集合に含まれ、かつ「遺伝子基礎」辞書データ集合に含まれない「トランスポゾン」の辞書データを抽出し、ユーザ辞書３２Ｂに登録すべき登録候補の記憶領域である登録候補記憶部３２Ｂに格納する。これらの処理の結果として得られる登録候補記憶部３２Ａと登録候補記憶部３２Ｂの例を図９（ａ）と図９（ｂ）に示す。図５（ａ）、図５（ｂ）と同様に、抽出された辞書データと、その辞書データが登録対象外であるか否かを示す情報が記録されている。 Next, the registration candidate extraction unit 22 compares the similarity output from the similarity calculation unit 21 with a threshold value. If the threshold value is 0.7, the similarity between the “gene related” dictionary data set and the “gene basis” dictionary data set is equal to or greater than this value, and dictionary data to be shared is extracted from this set.
Here, first, for the user dictionary A, the “gene” dictionary data included in the “gene basis” dictionary data set and not included in the “gene related” dictionary data set should be extracted and registered in the user dictionary 31A. It is stored in the registration candidate storage unit 32A, which is a storage area for registration candidates. For the user dictionary 32B, dictionary data of “transposon” included in the “gene related” dictionary data set and not included in the “gene basis” dictionary data set is extracted, and registration candidates to be registered in the user dictionary 32B are extracted. It stores in the registration candidate storage part 32B which is a storage area. 9A and 9B show examples of the registration candidate storage unit 32A and the registration candidate storage unit 32B obtained as a result of these processes. Similar to FIGS. 5A and 5B, the extracted dictionary data and information indicating whether or not the dictionary data is not registered are recorded.

登録候補を抽出する処理では、ユーザ辞書記憶部３１に格納されたユーザ辞書の全辞書データ集合の組み合わせ全てに対して本処理を繰り返すことで、各々のユーザ辞書に登録すべき辞書データの候補を抽出し、登録候補記憶部３２の各々に対応する領域に抽出した辞書データを格納する。
なお抽出された辞書データを登録する処理は第１の具体的動作例の場合と同じであるため、説明を省略する。 In the process of extracting registration candidates, by repeating this process for all combinations of all dictionary data sets of user dictionaries stored in the user dictionary storage unit 31, dictionary data candidates to be registered in the respective user dictionaries are obtained. The extracted dictionary data is stored in an area corresponding to each of the registration candidate storage units 32.
Note that the process of registering the extracted dictionary data is the same as in the case of the first specific operation example, and thus description thereof is omitted.

以上のように、自然言語処理システム１０では、類似度計算手段２１が個々のユーザ辞書の間の類似度を算出し、この類似度に基づいて登録候補抽出手段２２が辞書データを共有すべき相手を選別するように構成されているため、ユーザ辞書の単位で柔軟に適切な共有相手を見つけることができる。
また、類似度計算手段２１が、個々のユーザ辞書の間の類似度をその都度計算するため、その時点で最適な共有相手を見つけることができる。このため、共有すべき辞書データを適切に抽出することができる。 As described above, in the natural language processing system 10, the similarity calculating unit 21 calculates the similarity between individual user dictionaries, and the registration candidate extracting unit 22 is to share the dictionary data based on the similarity. Therefore, it is possible to flexibly find an appropriate sharing partner for each user dictionary.
Moreover, since the similarity calculation means 21 calculates the similarity between individual user dictionaries each time, the most suitable sharing partner can be found at that time. For this reason, dictionary data to be shared can be appropriately extracted.

本発明は、自然言語処理システム１０の各機能をコンピュータに実行させるプログラムとしても実施することができる。
このような実施形態を図１０に示す。
コンピュータ８は、入力装置１、ハードディスク装置等の記憶装置３、出力装置４、ＲＡＭ(Random Access Memory)等の主記憶装置７、前記の各装置を制御する機能と演算機能とを備えたＣＰＵ５を備えている。
主記憶装置７に記憶された自然言語処理用プログラム６は、ＣＰＵ５に読み込まれＣＰＵ５により実行されてコンピュータ８の動作を制御し、記憶装置３にユーザ辞書記憶部３１と登録候補記憶部３２を生成する。また、自然言語処理用プログラム６はＣＰＵ５を図１のデータ処理装置２として動作させ、コンピュータ８を自然言語処理システム１０として動作させる。 The present invention can also be implemented as a program that causes a computer to execute each function of the natural language processing system 10.
Such an embodiment is shown in FIG.
The computer 8 includes an input device 1, a storage device 3 such as a hard disk device, an output device 4, a main storage device 7 such as a RAM (Random Access Memory), and a CPU 5 having a function for controlling each of the above devices and an arithmetic function. I have.
The natural language processing program 6 stored in the main storage device 7 is read by the CPU 5 and executed by the CPU 5 to control the operation of the computer 8, and the user dictionary storage unit 31 and the registration candidate storage unit 32 are generated in the storage device 3. To do. The natural language processing program 6 causes the CPU 5 to operate as the data processing device 2 in FIG. 1 and causes the computer 8 to operate as the natural language processing system 10.

本発明によれば、入力された仮名文字列を漢字仮名混じり文字列に変換する仮名漢字変換装置や、入力された第一の言語の文字列を第二の言語の文字列に変換する機械翻訳装置、入力された音声信号を文字列に変換する音声認識装置、入力された文字列を音声信号に変換する音声合成装置をコンピュータに実現するためのプログラムといった用途に適用できる。また自然言語処理で用いる辞書の作成を支援する辞書作成支援装置をコンピュータに実現するためのプログラムといった用途にも適用できる。 According to the present invention, a kana-kanji conversion device that converts an input kana character string into a kanji-kana mixed character string, or a machine translation that converts an input first language character string into a second language character string The present invention can be applied to applications such as an apparatus, a speech recognition device that converts an input speech signal into a character string, and a program for realizing a speech synthesis device that converts an input character string into a speech signal. Further, the present invention can be applied to a use of a program for realizing a dictionary creation support apparatus that supports creation of a dictionary used in natural language processing on a computer.

本発明の第１実施形態である自然言語処理システムの機能ブロック図である。It is a functional block diagram of the natural language processing system which is 1st Embodiment of this invention. 自然言語処理システムの登録候補抽出動作を示すフローチャートである。It is a flowchart which shows the registration candidate extraction operation | movement of a natural language processing system. 自然言語処理システムの辞書データ登録動作を示すフローチャートである。It is a flowchart which shows the dictionary data registration operation | movement of a natural language processing system. ユーザ辞書のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a user dictionary. 登録候補記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a registration candidate memory | storage part. 辞書データの登録確認画面の一例を示す図である。It is a figure which shows an example of the registration confirmation screen of dictionary data. 辞書データの登録確認画面の一例を示す図である。It is a figure which shows an example of the registration confirmation screen of dictionary data. ユーザ辞書のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a user dictionary. 登録候補記憶部のデータ構造例を示す図である。It is a figure which shows the example of a data structure of a registration candidate memory | storage part. コンピュータプログラムによる実施形態を示す図である。It is a figure which shows embodiment by a computer program.

Explanation of symbols

１入力装置
２データ処理装置
３記憶装置
４出力装置
５ＣＰＵ
６自然言語処理用プログラム
７主記憶装置
８コンピュータ
１０自然言語処理システム
２１類似度計算手段
２２登録候補抽出手段
２３ユーザ辞書登録手段
２４自然言語処理手段
３１ユーザ辞書記憶部
３２登録候補記憶部 1 Input Device 2 Data Processing Device 3 Storage Device 4 Output Device 5 CPU
6 Natural language processing program 7 Main storage device 8 Computer 10 Natural language processing system 21 Similarity calculation means 22 Registration candidate extraction means 23 User dictionary registration means 24 Natural language processing means 31 User dictionary storage section 32 Registration candidate storage section

Claims

In a natural language processing system for performing natural language processing using a user dictionary storing dictionary data including a word and language information corresponding to the word,
Similarity calculating means for calculating the similarity between a first user dictionary that is a user dictionary of a partner to share dictionary data and a second user dictionary that is a user dictionary to which dictionary data is registered;
When the similarity is equal to or higher than a predetermined threshold, dictionary data included in the first user dictionary and not included in the second user dictionary is registered as a registration candidate for the second user dictionary. Registration candidate extraction means for extracting;
A natural language processing system comprising: user dictionary registration means for registering dictionary data included in the registration candidates in the second user dictionary.

The similarity calculation means is registered in common with the total number of dictionary data registered in the first user dictionary and the second user dictionary and in the first user dictionary and the second user dictionary. The natural language processing system according to claim 1, wherein the similarity is calculated based on a ratio to the number of dictionary data.

The similarity calculation means is similar to the processing target of the natural language processing performed in the past using the first user dictionary and the processing target of the natural language processing performed in the past using the second user dictionary. The natural language processing system according to claim 1, wherein the similarity is calculated based on a processing target similarity that is a degree.

The dictionary data includes classification information;
The similarity calculation means is stored in the second user dictionary and the first dictionary data set composed of dictionary data having the same classification information in the dictionary data stored in the first user dictionary. Calculating the similarity based on the dictionary data set similarity which is the similarity of the second dictionary data set consisting of dictionary data having the same classification information in the dictionary data;
The registration candidate extraction unit extracts dictionary data included in the first dictionary data set and not included in the second dictionary data set as the registration candidates. Natural language processing system.

The similarity calculation means is common to the total number of dictionary data registered in the first dictionary data set and the second dictionary data set, and to the first dictionary data set and the second dictionary data set. The natural language processing system according to claim 4, wherein the dictionary data set similarity is calculated based on a ratio with the number of dictionary data registered in the above.

The similarity calculation means includes a processing target for natural language processing performed in the past using the first dictionary data set and a processing target for natural language processing performed in the past using the second dictionary data set. The natural language processing system according to claim 4, wherein the dictionary data similarity is calculated based on a processing target similarity that is a similarity of

In a natural language processing method for performing natural language processing using a user dictionary storing dictionary data including a word and language information for the word,
A first user dictionary that is a user dictionary of a partner to which dictionary data is shared and a second user dictionary that is a user dictionary to which dictionary data is registered are read from the storage device, and the first user dictionary and the second user dictionary are read out. A similarity calculation step of calculating the similarity between the user dictionary and
Operates when the similarity is greater than or equal to a predetermined threshold, and registers dictionary data included in the first user dictionary and not included in the second user dictionary in the second user dictionary A registration candidate extraction step of extracting as a candidate and recording the registration candidate in a storage device;
A natural language processing method comprising: a user dictionary registration step of reading the registration candidates from the storage device and registering dictionary data included in the registration candidates in the second user dictionary.

In a natural language processing program for performing natural language processing using a user dictionary storing dictionary data including a word and language information for the word,
On the computer,
A first user dictionary that is a user dictionary of a partner to which dictionary data is shared and a second user dictionary that is a user dictionary to which dictionary data is registered are read from the storage device, and the first user dictionary and the second user dictionary are read out. A similarity calculation function for calculating the similarity between the user dictionary and
When the similarity is equal to or higher than a predetermined threshold, dictionary data included in the first user dictionary and not included in the second user dictionary is extracted as a registration candidate for the second user dictionary. A registration candidate extraction function for recording the registration candidates in a storage device;
A natural language processing program that reads a registration candidate from the storage device and executes a user dictionary registration function for registering dictionary data included in the registration candidate in the second user dictionary.

The similarity calculation function includes a total number of dictionary data registered in the first user dictionary and the second user dictionary and a common registration in the first user dictionary and the second user dictionary. The natural language processing program according to claim 8, wherein the similarity is calculated based on a ratio to the number of dictionary data.

The similarity calculation function is similar to a processing target of natural language processing performed in the past using the first user dictionary and a processing target of natural language processing performed in the past using the second user dictionary. The natural language processing program according to claim 8, wherein the similarity is calculated based on a processing target similarity that is a degree.

The dictionary data includes classification information;
The similarity calculation function is stored in the first dictionary data set consisting of dictionary data having the same classification information in the dictionary data stored in the first user dictionary and in the second user dictionary. Calculating the similarity based on the dictionary data set similarity which is the similarity of the second dictionary data set consisting of dictionary data having the same classification information in the dictionary data;
The registration candidate extraction function is for extracting dictionary data included in the first dictionary data set and not included in the second dictionary data set as the registration candidates. The natural language processing program according to 8.

The similarity calculation function is common to the total number of dictionary data registered in the first dictionary data set and the second dictionary data set, and to the first dictionary data set and the second dictionary data set. 12. The natural language processing program according to claim 11, wherein the dictionary data set similarity is calculated based on a ratio with the number of dictionary data registered in this way.

The similarity calculation function includes a processing target of natural language processing performed in the past using the first dictionary data set and a processing target of natural language processing performed in the past using the second dictionary data set. The natural language processing program according to claim 11, wherein the dictionary data similarity is calculated based on a processing target similarity which is a similarity of