JP2019012457A

JP2019012457A - Information processing device, information processing method, and program

Info

Publication number: JP2019012457A
Application number: JP2017129309A
Authority: JP
Inventors: 勲園部; Isao Sonobe; 喬三淵; Takashi Mitsubuchi; 田中　秀明; Hideaki Tanaka; 秀明田中; 弘明鷹栖; Hiroaki Takasu; 一宏山田; Kazuhiro Yamada; 泰弘光野; Yasuhiro Kono
Original assignee: NS Solutions Corp
Current assignee: NS Solutions Corp
Priority date: 2017-06-30
Filing date: 2017-06-30
Publication date: 2019-01-24

Abstract

To provide an information processing device, an information processing method, and a program for supporting realization of accuracy improvement of word category classification.SOLUTION: A classification model is learned that includes: extracting words from a plurality of sentences; determining, based on a category dictionary including information on categories corresponding to the words, the category of the words registered in the category dictionary among the extracted words; acquiring, based on a plurality of sentences, a distributed representation of words whose categories have been determined; and classifying, based on the acquired distributed representation and the determined category, categories of words represented by distributed representation.SELECTED DRAWING: Figure 5

Description

本発明は、情報処理装置、情報処理方法及びプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

ユーザからの問い合わせに自動応答するチャットボット等では、自然言語処理の技術を用いて、ＦＡＱや問い合わせ対応ログ等のテキストデータから学習し、文章の意味解釈等を行う。このような自然言語処理においては、文の構造（形態素、係り受け構造等）と意味（述語項構造、意味役割、意味フレーム等）とを解析する技術が必要とされる。例えば、意味解析システムには、入力文を形態素解析と係り受け解析して得られた文節に対してカテゴリ辞書を参照してカテゴリを付与し、その後、述語とそれに係る文節のカテゴリ情報をキーに意味フレーム辞書を参照して述語項構造解析を行い、その後、述語の語義（意味フレーム）を同定して、係り関係にある項の意味役割を付与するものがある。しかし、カテゴリ辞書に登録されていない単語が文中に含まれる場合、そのままではキーとなる文節のカテゴリを決定できないため意味フレーム辞書を引けず、意味役割を付与できない問題が起こる。つまり、意味解析できる文章の範囲は、カテゴリ辞書に登録された単語の範囲によって制限される。単語のカテゴリを正しく決定することは重要なステップである。
カテゴリ辞書に登録された単語数を増やせば、単語のカテゴリをより正しく決定することができるが、人手による単語登録には単語の意味に関する専門知識が必要である。専門用語の数が多かったり日々増えたりする領域（ドメイン）において、人手のみでカテゴリ辞書を作成するのは労力が必要である。そこで、カテゴリ辞書に登録された単語の類義語を抽出し、類義語に対しても前記単語と同一のカテゴリを指定することで、登録単語数を増やす方法が考えられる。
類義語の抽出技術として、例えば、特許文献１には、単語共起データに基づき、入力された単語の類義語を抽出する技術が開示されている。特許文献１に開示された技術は、入力単語と類義語候補の共起頻度に基づいているため、意味的には関係ないが共起する単語も類義語と判定することがあり、精度が低いという問題がある。
また、特許文献２には、テキストから抽出した単語の組に対して生成した素性ベクトルに対し、類義語辞書を参照してラベルを付与し学習を行うことで、任意の単語の組の２つの単語が類義語か否かを判定する技術が開示されている。特許文献２に開示された類義語判定技術は教師あり学習の一種であり、精度を向上させるためには、多量の正解データを要する。また、類義語辞書は類義語である単語ペアが格納されたデータであるから、カテゴリ辞書を正解データとして利用することができないという問題がある。 In a chat bot or the like that automatically responds to an inquiry from a user, natural language processing technology is used to learn from text data such as FAQs and inquiry correspondence logs, and to interpret the meaning of sentences. In such natural language processing, a technique for analyzing the structure (morpheme, dependency structure, etc.) and meaning (predicate term structure, semantic role, semantic frame, etc.) of the sentence is required. For example, in the semantic analysis system, a category is assigned to a phrase obtained by performing morphological analysis and dependency analysis of an input sentence, and then a category is assigned to the phrase, and then the category information of the predicate and the related phrase is used as a key. Some predicate term structure analysis is performed with reference to the semantic frame dictionary, and then the meaning of the predicate (semantic frame) is identified and the semantic roles of the terms in the relationship are given. However, when a word that is not registered in the category dictionary is included in the sentence, there is a problem that the semantic category cannot be assigned and the semantic role cannot be assigned because the category of the phrase that is a key cannot be determined as it is. In other words, the range of sentences that can be semantically analyzed is limited by the range of words registered in the category dictionary. Proper determination of word categories is an important step.
If the number of words registered in the category dictionary is increased, the category of the word can be determined more correctly, but manual word registration requires specialized knowledge regarding the meaning of the word. In an area (domain) in which the number of technical terms is large or increases day by day, it takes labor to create a category dictionary by hand. Therefore, a method of increasing the number of registered words by extracting synonyms of words registered in the category dictionary and designating the same category as the words for the synonyms can be considered.
As a synonym extraction technique, for example, Patent Document 1 discloses a technique for extracting a synonym of an input word based on word co-occurrence data. Since the technique disclosed in Patent Document 1 is based on the co-occurrence frequency of an input word and a synonym candidate, there is a problem that a word that co-occurs although it is not related semantically may be determined as a synonym, and accuracy is low. There is.
Further, in Patent Document 2, two words of an arbitrary word set are learned by assigning a label to a feature vector generated with respect to a word set extracted from a text with reference to a synonym dictionary and learning. A technique for determining whether or not is a synonym is disclosed. The synonym determination technique disclosed in Patent Document 2 is a kind of supervised learning, and a large amount of correct answer data is required to improve accuracy. Further, since the synonym dictionary is data in which word pairs that are synonyms are stored, there is a problem that the category dictionary cannot be used as correct answer data.

特許第５６１１１７３号公報Japanese Patent No. 5611173 特許第５３５６１９７号公報Japanese Patent No. 5356197

上記のような問題により、単語のカテゴリが決定できなかったり、誤ったカテゴリが決定されたりして、単語のカテゴリ分類が失敗する場合がある。そこで、本発明は、単語のカテゴリ分類の精度向上の実現を支援することを目的とする。 Due to the problems described above, the category of a word may fail because the category of the word cannot be determined or an incorrect category is determined. Therefore, an object of the present invention is to support improvement in accuracy of word category classification.

そこで、本発明の情報処理装置は、複数の文章から、単語を抽出する抽出手段と、単語に対応するカテゴリの情報を含むカテゴリ辞書に基づいて、前記抽出手段により抽出された単語のうち、前記カテゴリ辞書に登録されている単語のカテゴリを決定する決定手段と、前記複数の文章に基づいて、前記決定手段によりカテゴリが決定された単語の分散表現を取得する取得手段と、前記取得手段により取得された分散表現と、前記決定手段により決定されたカテゴリと、に基づいて、分散表現で表された単語のカテゴリを分類する分類モデルを学習する学習手段と、を有する。 Therefore, the information processing apparatus of the present invention is based on an extraction unit that extracts a word from a plurality of sentences and a category dictionary that includes information on a category corresponding to the word, and among the words extracted by the extraction unit, Determination means for determining a category of a word registered in a category dictionary, acquisition means for acquiring a distributed representation of a word whose category is determined by the determination means based on the plurality of sentences, and acquisition by the acquisition means Learning means for learning a classification model for classifying the category of the word represented by the distributed expression based on the determined distributed expression and the category determined by the determining means.

本発明によれば、単語のカテゴリ分類の精度向上の実現を支援することができる。 According to the present invention, it is possible to support improvement in accuracy of word category classification.

図１は、情報処理装置のハードウェア構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a hardware configuration of the information processing apparatus. 図２は、カテゴリ辞書の一例を示す図である。FIG. 2 is a diagram illustrating an example of a category dictionary. 図３は、意味フレーム辞書の一例を示す図である。FIG. 3 is a diagram illustrating an example of a semantic frame dictionary. 図４は、情報処理装置の機能構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a functional configuration of the information processing apparatus. 図５は、情報処理装置の処理の一例を示すフローチャートである。FIG. 5 is a flowchart illustrating an example of processing of the information processing apparatus. 図６は、情報処理装置の処理の一例を示すフローチャートである。FIG. 6 is a flowchart illustrating an example of processing of the information processing apparatus. 図７は、情報処理装置の処理の一例を示すフローチャートである。FIG. 7 is a flowchart illustrating an example of processing of the information processing apparatus. 図８は、意味空間における単語同士の関係の一例を示す図である。FIG. 8 is a diagram illustrating an example of a relationship between words in the semantic space. 図９は、単語同士の関係性の一例を示す図である。FIG. 9 is a diagram illustrating an example of the relationship between words.

以下、本発明の実施形態について図面に基づいて説明する。
＜実施形態１＞
（本実施形態の処理の概要）
本実施形態の処理の概要を説明する。本実施形態では、情報処理装置１００が処理の主体であるとする。情報処理装置１００は、パーソナルコンピュータ（ＰＣ）、サーバ装置、タブレット装置、スマートホン等の情報処理装置である。
本実施形態では、情報処理装置１００は、課題管理システムにおける複数のチケットデータを含むコーパスデータから、文章中の単語を抽出する。コーパスデータは、チケットデータ以外の文章を含むこととしてもよい。そして、情報処理装置１００は、カテゴリ辞書を参照して、コーパスデータから抽出された単語のうち、カテゴリ辞書に登録されている単語についてカテゴリを決定する。カテゴリとは、単語の性質を区分する上での分類である。そして、情報処理装置１００は、カテゴリを決定した各単語の分散表現を取得する。そして、情報処理装置１００は、カテゴリを決定した各単語の分散表現と、カテゴリを決定した各単語のカテゴリと、に基づいて、カテゴリの分類に利用されるカテゴリ分類モデルを学習する。カテゴリ分類モデルは、例えば、単語の分散表現を入力として受付け、その単語のカテゴリを出力として返す分類器である。
情報処理装置１００は、学習した分類器を用いて、コーパスデータから抽出された単語のうち、カテゴリ辞書に登録されていない単語のカテゴリを決定する。これにより、情報処理装置１００は、コーパスデータから抽出された単語のうち、カテゴリ辞書に登録されていない単語についても、カテゴリを決定することができる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
<Embodiment 1>
(Outline of processing of this embodiment)
An overview of the processing of this embodiment will be described. In the present embodiment, it is assumed that the information processing apparatus 100 is a main subject of processing. The information processing device 100 is an information processing device such as a personal computer (PC), a server device, a tablet device, or a smart phone.
In this embodiment, the information processing apparatus 100 extracts a word in a sentence from corpus data including a plurality of ticket data in the problem management system. The corpus data may include text other than ticket data. Then, the information processing apparatus 100 refers to the category dictionary and determines a category for the words registered in the category dictionary among the words extracted from the corpus data. A category is a classification for classifying the characteristics of words. Then, the information processing apparatus 100 acquires a distributed representation of each word whose category has been determined. Then, the information processing apparatus 100 learns a category classification model used for category classification based on the distributed representation of each word for which the category has been determined and the category of each word for which the category has been determined. The category classification model is, for example, a classifier that accepts a distributed expression of a word as an input and returns the category of the word as an output.
The information processing apparatus 100 uses the learned classifier to determine a category of words that are not registered in the category dictionary among words extracted from the corpus data. Thereby, the information processing apparatus 100 can determine a category also for a word that is not registered in the category dictionary among words extracted from the corpus data.

また、情報処理装置１００は、カテゴリを決定した各単語について、不適切なカテゴリが付与されている単語を検知する。情報処理装置１００は、互いに類義する単語Ａと単語Ｂとを特定する。そして、情報処理装置１００は、単語Ａとある関係性を有する単語Ｃと、単語Ｂと同様の関係性を有する単語Ｄと、を特定する。情報処理装置１００は、特定した単語Ａ、Ｂ、Ｃ、Ｄの分散表現の意味空間上での位置関係に基づいて、単語Ｃ、Ｄについて、カテゴリが不適切な可能性があることを検知する。
情報処理装置１００は、単語Ｃ、Ｄについてカテゴリを修正し、カテゴリを修正した単語に基づいて、カテゴリ分類モデルを学習し直す。 Further, the information processing apparatus 100 detects a word to which an inappropriate category is assigned for each word for which a category has been determined. The information processing apparatus 100 specifies the word A and the word B that are similar to each other. Then, the information processing apparatus 100 specifies a word C having a certain relationship with the word A and a word D having a similar relationship to the word B. The information processing apparatus 100 detects that the categories of the words C and D may be inappropriate based on the positional relationship in the semantic space of the distributed expressions of the identified words A, B, C, and D. .
The information processing apparatus 100 corrects the categories for the words C and D, and learns the category classification model again based on the words whose categories are corrected.

（情報処理装置のハードウェア構成）
図１は、情報処理装置１００のハードウェア構成の一例を示す図である。
情報処理装置１００は、ＣＰＵ１０１、主記憶装置１０２、補助記憶装置１０３、ネットワークＩ／Ｆ１０４、入出力Ｉ／Ｆ１０５を含む。各要素は、システムバス１０６を介して、相互に通信可能に接続されている。
ＣＰＵ１０１は、情報処理装置１００を制御する中央演算装置である。主記憶装置１０２は、ＣＰＵ１０１のワークエリアやデータの一時的な記憶場所として機能するＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ（ＲＡＭ）等の記憶装置である。 (Hardware configuration of information processing device)
FIG. 1 is a diagram illustrating an example of a hardware configuration of the information processing apparatus 100.
The information processing apparatus 100 includes a CPU 101, a main storage device 102, an auxiliary storage device 103, a network I / F 104, and an input / output I / F 105. Each element is communicably connected to each other via a system bus 106.
The CPU 101 is a central processing unit that controls the information processing apparatus 100. The main storage device 102 is a storage device such as a random access memory (RAM) that functions as a work area for the CPU 101 or a temporary storage location for data.

補助記憶装置１０３は、各種設定情報、各種プログラム、コーパスデータ、教師データ、各種辞書データ、各種モデル情報、各種閾値の情報等を記憶する記憶装置である。本実施形態では、補助記憶装置１０３は、文章データである課題管理システムにおけるチケットデータを複数含むコーパスデータを記憶する。また、補助記憶装置１０３は、カテゴリ辞書データを記憶する。カテゴリ辞書データには、複数の単語が登録されており、登録されている各単語のカテゴリの情報が含まれる。カテゴリ辞書データは、例えば、単語と、カテゴリと、の対応を示すテーブル形式のデータである。図２は、カテゴリ辞書データの一例を示す図である。図２の例では、単語の項目は、登録されている各単語を示し、カテゴリの項目は、対応する単語のカテゴリを示す。
また、補助記憶装置１０３は、意味フレーム辞書データを記憶する。意味フレーム辞書データは、述語と、その述語に係る文節のカテゴリと、が複数登録されており、登録されている「述語＋その述語に係る文節のカテゴリ」に対応する意味フレーム（意味役割・意味等の情報）が含まれる。意味フレーム辞書データは、例えば、述語と、その述語に係る文節のカテゴリと、意味フレームと、の対応を示すテーブル形式のデータである。図３は、意味フレーム辞書データの一例を示す図である。図３の例では、述語の項目は、登録されている述語を示し、「項」の項目は、対応する述語に係る文節の情報を示す。また、「カテゴリ」の項目は、対応する文節内の単語のカテゴリを示し、「格」の項目は、対応する文節内の単語に続く助詞を示し、「意味役割」の項目は、対応する文節の意味役割を示す。また、「述語の意味」の項目は、対応する述語の意味を示す。補助記憶装置１０３は、例えば、ＲｅａｄＯｎｌｙＭｅｍｏｒｙ（ＲＯＭ）、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）、フラッシュメモリ等の記憶媒体として、構成される。
ネットワークＩ／Ｆ１０４は、インターネットやＬＡＮ等のネットワークを介した外部の装置との間での通信に利用されるインターフェースである。入出力Ｉ／Ｆ１０５は、マウス、キーボード、タッチパネルの操作部等の入力装置からの情報の入力に利用されるインターフェースである。また、入出力Ｉ／Ｆ１０５は、ディスプレイ、タッチパネルの表示部、スピーカ等の出力装置への情報の出力に利用されるインターフェースである。
ＣＰＵ１０１が、補助記憶装置１０３に記憶されたプログラムに基づき処理を実行することで、図４で後述する情報処理装置１００の機能及び図５、６、７で後述するフローチャートの処理等が実現される。 The auxiliary storage device 103 is a storage device that stores various setting information, various programs, corpus data, teacher data, various dictionary data, various model information, various threshold information, and the like. In the present embodiment, the auxiliary storage device 103 stores corpus data including a plurality of ticket data in the assignment management system that is text data. The auxiliary storage device 103 stores category dictionary data. A plurality of words are registered in the category dictionary data, and information on the category of each registered word is included. The category dictionary data is, for example, table format data indicating correspondence between words and categories. FIG. 2 is a diagram illustrating an example of category dictionary data. In the example of FIG. 2, the word item indicates each registered word, and the category item indicates the category of the corresponding word.
The auxiliary storage device 103 stores semantic frame dictionary data. In the semantic frame dictionary data, a plurality of predicates and clause categories related to the predicates are registered, and semantic frames (semantic roles and meanings) corresponding to the registered “predicates + clause categories related to the predicates”. Etc.). The semantic frame dictionary data is, for example, data in a table format indicating correspondence between predicates, clause categories related to the predicates, and semantic frames. FIG. 3 is a diagram illustrating an example of semantic frame dictionary data. In the example of FIG. 3, the predicate item indicates a registered predicate, and the “term” item indicates clause information related to the corresponding predicate. The “category” item indicates the category of the word in the corresponding phrase, the “case” item indicates the particle following the word in the corresponding phrase, and the “semantic role” item indicates the corresponding phrase. Indicates the semantic role of In addition, the item “predicate meaning” indicates the meaning of the corresponding predicate. The auxiliary storage device 103 is configured as a storage medium such as a read only memory (ROM), a hard disk drive (HDD), a solid state drive (SSD), or a flash memory.
The network I / F 104 is an interface used for communication with an external device via a network such as the Internet or a LAN. The input / output I / F 105 is an interface used for inputting information from an input device such as an operation unit of a mouse, a keyboard, or a touch panel. The input / output I / F 105 is an interface used to output information to an output device such as a display, a display unit of a touch panel, and a speaker.
When the CPU 101 executes processing based on a program stored in the auxiliary storage device 103, functions of the information processing apparatus 100 described later with reference to FIG. 4 and processing of flowcharts described with reference to FIGS. .

（情報処理装置の機能構成）
図４は、情報処理装置１００の機能構成の一例を示す図である。
情報処理装置１００は、取得部４０１、学習部４０２、抽出部４０３、更新部４０４を含む。
取得部４０１は、補助記憶装置１０３に記憶されたコーパスデータ内の文章から、単語を抽出する。そして、取得部４０１は、コーパスデータに基づいて、抽出した単語の分散表現を取得する。
分散表現とは、単語を複数次元（例えば、１００〜３００次元）の実数ベクトルで表現する技術である。文章内の単語の意味は、周辺の単語（文脈語）から定まるとする分布仮説というものがある。分布仮説を前提とすると、単語を、各要素が各文脈語の出現確率を示すベクトルとして表現できる。文脈語となる単語は、膨大（１兆個以上）であるため、このベクトルのサイズも膨大（１兆以上の次元）となってしまい、ＰＣのメモリ等に格納不可なサイズとなる。しかし、このベクトルの要素は、ほとんどが０である。そのため、このベクトルは、圧縮が可能（例えば、１００〜３００次元のサイズに圧縮）である。分散表現では、分布仮説を前提として、単語がこのように圧縮されたベクトルとして表現されることとなる。 (Functional configuration of information processing device)
FIG. 4 is a diagram illustrating an example of a functional configuration of the information processing apparatus 100.
The information processing apparatus 100 includes an acquisition unit 401, a learning unit 402, an extraction unit 403, and an update unit 404.
The acquisition unit 401 extracts words from sentences in the corpus data stored in the auxiliary storage device 103. Then, the acquisition unit 401 acquires a distributed representation of the extracted word based on the corpus data.
The distributed expression is a technique for expressing a word with a real vector of a plurality of dimensions (for example, 100 to 300 dimensions). There is a distribution hypothesis that the meaning of a word in a sentence is determined from surrounding words (context words). Given the distribution hypothesis, a word can be represented as a vector where each element indicates the appearance probability of each context word. Since the number of context words is enormous (1 trillion or more), the size of this vector also becomes enormous (1 trillion or more dimensions), and cannot be stored in a PC memory or the like. However, most of the elements of this vector are zero. Therefore, this vector can be compressed (for example, compressed to a size of 100 to 300 dimensions). In the distributed expression, the word is expressed as a vector compressed in this way, on the premise of the distribution hypothesis.

分散表現で表された単語同士は、単語同士の意味が近い程、近いベクトルとなる。この性質により、取得部４０１により抽出された単語の分散表現が示すベクトルは、意味が近いものほど近いベクトルとなる。
また、分散表現で表された単語には、次のような性質もある。即ち、単語（１）と単語（２）との関係と、単語（３）と単語（４）との関係と、が類似する程、単語（１）と単語（２）との差を示すベクトルと、単語（３）と単語（４）との差を示すベクトルと、が近しいベクトルになるという性質である。図７で後述する処理では、情報処理装置１００は、この性質を利用することで、正しくカテゴリが分類されていない単語を検出する。 The words expressed in the distributed representation become closer vectors as the meaning of the words is closer. Due to this property, the vector indicated by the distributed representation of the word extracted by the acquisition unit 401 becomes a closer vector as the meaning is closer.
In addition, the words expressed in the distributed expression also have the following properties. That is, a vector indicating the difference between the word (1) and the word (2) as the relationship between the word (1) and the word (2) and the relationship between the word (3) and the word (4) are similar. And the vector indicating the difference between the word (3) and the word (4) is a close vector. In the processing described later with reference to FIG. 7, the information processing apparatus 100 uses this property to detect words whose categories are not correctly classified.

学習部４０２は、補助記憶装置１０３に記憶されるカテゴリ辞書データに基づいて、取得部４０１により分散表現が取得された単語のカテゴリを決定する。この際、学習部４０２は、カテゴリ辞書データに登録されていない単語については、カテゴリを決定できないこととなる。そして、学習部４０２は、カテゴリを決定した単語の分散表現に基づいて、分散表現で表された単語のカテゴリの分類に利用されるカテゴリ分類モデルを学習する。
また、学習部４０２は、単語間の意味的関係の分類に利用される意味的関係分類モデルを学習する。単語間の意味的関係とは、一方の単語の意味と他方の単語の意味との間の関係である。単語間の意味的関係には、例えば、上位概念／下位概念関係（例えば、「国」と「日本」との関係）、類義関係（例えば、共に「国」の下位概念である「日本」と「米国」との関係）、包含関係（例えば、「車」と「タイヤ」とのように一方の中に他方が含まれる関係）等がある。
抽出部４０３は、特定の条件を満たす単語を、コーパスデータから抽出する。抽出部４０３は、コーパスデータから互いに類義する２つの単語を抽出したり、ある単語と意味的関係を有する単語を抽出したりする。
更新部４０４は、単語のカテゴリを更新する。 Based on the category dictionary data stored in the auxiliary storage device 103, the learning unit 402 determines the category of the word for which the distributed representation has been acquired by the acquisition unit 401. At this time, the learning unit 402 cannot determine a category for words that are not registered in the category dictionary data. Then, the learning unit 402 learns a category classification model used for classification of the category of the word represented by the distributed expression based on the distributed expression of the word for which the category has been determined.
The learning unit 402 learns a semantic relationship classification model used for classification of semantic relationships between words. A semantic relationship between words is a relationship between the meaning of one word and the meaning of the other word. The semantic relationship between words includes, for example, a superordinate concept / subordinate concept relationship (for example, a relationship between “country” and “Japan”), a synonymous relationship (for example, “Japan” which is a subordinate concept of “country”) And "US"), inclusion relations (for example, "car" and "tyre", one of which includes the other).
The extraction unit 403 extracts words that satisfy a specific condition from the corpus data. The extraction unit 403 extracts two words that are similar to each other from the corpus data, or extracts a word that has a semantic relationship with a certain word.
The update unit 404 updates the word category.

（カテゴリ分類モデルの学習処理）
図５は、情報処理装置１００の処理の一例を示すフローチャートである。図５を用いて、単語のカテゴリの分類に利用されるカテゴリ分類モデル等の学習処理の一例を説明する。
Ｓ５０１において、取得部４０１は、補助記憶装置１０３に記憶されたコーパスデータから、単語を抽出する。
Ｓ５０２において、取得部４０１は、補助記憶装置１０３に記憶されたカテゴリ辞書に基づいて、Ｓ５０１で抽出された単語のうち、カテゴリ辞書に登録されている単語のカテゴリを決定する。
Ｓ５０３において、取得部４０１は、補助記憶装置１０３に記憶されたコーパスデータに基づいて、Ｓ５０２でカテゴリが決定された全ての単語の分散表現を取得する。取得部４０１は、例えば、コーパスデータ内の文章データにおける単語とその単語の周辺の単語とに基づいて、教師なし学習（ｗｏｒｄ２ｖｅｃ等）により単語の分散表現を取得する。
Ｓ５０４において、学習部４０２は、Ｓ５０３で取得された分散表現と、Ｓ５０２でカテゴリが決定された単語のカテゴリと、に基づいて、カテゴリ分類モデルを学習する。
学習部４０２は、例えば、Ｓ５０３で取得された分散表現と、Ｓ５０３で取得された分散表現それぞれに対応するカテゴリと、を教師データとして、単語の分散表現を入力として、その単語のカテゴリを出力する識別器を、カテゴリ分類モデルとして学習する。学習部４０２は、この教師データを、補助記憶装置１０３に記憶する。学習部４０２は、例えば、識別器として、単純ベイズやサポートベクタマシン（ＳＶＭ）等を用いる。
本実施形態では、学習部４０２は、Ｓ５０４の処理で、Ｓ５０２でカテゴリが決定された全ての単語の分散表現と、対応するカテゴリと、を教師データとして、カテゴリ分類モデルを学習することとした。しかし、学習部４０２は、Ｓ５０２でカテゴリが決定された単語のうち、ランダムにサンプリングされた予め定められた数の単語の分散表現と、対応するカテゴリと、を教師データとして、カテゴリ分類モデルを学習することとしてもよい。 (Category classification model learning process)
FIG. 5 is a flowchart illustrating an example of processing of the information processing apparatus 100. An example of learning processing such as a category classification model used for classifying a word category will be described with reference to FIG.
In step S 501, the acquisition unit 401 extracts words from the corpus data stored in the auxiliary storage device 103.
In S 502, the acquisition unit 401 determines a category of a word registered in the category dictionary among the words extracted in S 501 based on the category dictionary stored in the auxiliary storage device 103.
In step S 503, the acquisition unit 401 acquires a distributed representation of all the words whose categories are determined in step S 502 based on the corpus data stored in the auxiliary storage device 103. For example, the acquisition unit 401 acquires a distributed representation of a word by unsupervised learning (word2vec or the like) based on a word in sentence data in corpus data and a word around the word.
In S504, the learning unit 402 learns a category classification model based on the distributed representation acquired in S503 and the category of the word whose category is determined in S502.
For example, the learning unit 402 uses the distributed representation acquired in S503 and the categories corresponding to the respective distributed representations acquired in S503 as teacher data, inputs the distributed representation of the word, and outputs the category of the word. The classifier is learned as a category classification model. The learning unit 402 stores this teacher data in the auxiliary storage device 103. The learning unit 402 uses, for example, naive Bayes or a support vector machine (SVM) as a discriminator.
In the present embodiment, the learning unit 402 learns the category classification model by using the distributed representation of all the words whose categories are determined in S502 and the corresponding categories in the processing of S504 as teacher data. However, the learning unit 402 learns a category classification model by using, as teacher data, a distributed representation of a predetermined number of randomly sampled words and corresponding categories among the words whose categories are determined in S502. It is good to do.

以上、図５の処理により、情報処理装置１００は、分散表現で表された単語のカテゴリの分類に利用されるカテゴリ分類モデルを学習した。これにより、情報処理装置１００は、分散表現で表された単語のカテゴリの分類を可能とすることができる。情報処理装置１００は、例えば、コーパスデータから抽出された単語の中から、カテゴリが未定のものを選択し、Ｓ５０３と同様の方法で選択した単語の分散表現を取得する。情報処理装置１００は、図５の処理で学習した分類モデルに、取得した分散表現を入力することで、その単語のカテゴリを決定できる。
また、情報処理装置１００は、カテゴリを決定した単語に基づいて、文章中の述語や単語に意味役割を付与することができる。情報処理装置１００は、文章から述語と、その述語に係る文節と、を抽出する。そして、情報処理装置１００は、抽出した文節に含まれる単語のカテゴリが未定である場合、その単語の分散表現を取得し、取得した分散表現を、学習した分類モデルに入力することで、その単語のカテゴリを決定する。そして、情報処理装置１００は、例えば、抽出した述語と、抽出した文節と、その文節に含まれる単語のカテゴリと、に基づいて、図３の意味フレーム辞書から、抽出した述語や抽出した文節中の単語の意味役割・意味を決定することができる。 As described above, the information processing apparatus 100 has learned the category classification model used for classifying the categories of words represented in the distributed representation by the processing of FIG. Thereby, the information processing apparatus 100 can enable classification of the category of words expressed in a distributed expression. For example, the information processing apparatus 100 selects a word whose category is undetermined from words extracted from corpus data, and acquires a distributed representation of the selected word in the same manner as in S503. The information processing apparatus 100 can determine the category of the word by inputting the acquired distributed expression to the classification model learned in the process of FIG.
Further, the information processing apparatus 100 can assign a semantic role to a predicate or a word in a sentence based on the word for which the category is determined. The information processing apparatus 100 extracts a predicate and a clause related to the predicate from the sentence. Then, when the category of the word included in the extracted phrase is undecided, the information processing apparatus 100 acquires a distributed representation of the word, and inputs the acquired distributed representation into the learned classification model so that the word Determine the category. Then, the information processing apparatus 100, for example, in the extracted predicate or the extracted clause from the semantic frame dictionary of FIG. 3 based on the extracted predicate, the extracted clause, and the category of the word included in the clause. The meaning role and meaning of the word can be determined.

（意味的関係分類モデルの学習処理）
図６は、情報処理装置１００の処理の一例を示すフローチャートである。図６を用いて、単語同士の意味的関係の分類に利用される意味的関係分類モデル等の学習処理の一例を説明する。
Ｓ６０１において、学習部４０２は、意味的関係分類モデルの学習を実行するか否かを判定する。学習部４０２は、例えば、予め補助記憶装置１０３に記憶された意味的関係分類モデルの学習を実行するか否かを示す情報に基づいて、意味的関係分類モデルの学習を実行するか否かを判定する。学習部４０２は、例えば、補助記憶装置１０３に記憶された意味的関係分類モデルの学習を実行するか否かを示す情報が、実行することを示す情報である場合、意味的関係分類モデルの学習を実行すると判定する。また、学習部４０２は、例えば、補助記憶装置１０３に記憶された意味的関係分類モデルの学習を実行するか否かを示す情報が、実行しないことを示す情報である場合、意味的関係分類モデルの学習を実行しないと判定する。
ＣＰＵ１０１は、例えば、入出力Ｉ／Ｆ１０５を介して接続された入力装置を介したユーザの操作に基づいて、補助記憶装置１０３に記憶された意味的関係分類モデルの学習を実行するか否かを示す情報を更新することができる。例えば、補助記憶装置１０３に予め学習された意味的関係分類モデルが記憶されている場合、ユーザは、意味的関係分類モデルの学習を要しないと考える場合がある。このような場合、ユーザは、補助記憶装置１０３に記憶された意味的関係分類モデルの学習を実行するか否かを示す情報の内容を、実行しないことを示すように更新するよう、入力装置を介して、情報処理装置１００に指示することとなる。 (Semantic relationship classification model learning process)
FIG. 6 is a flowchart illustrating an example of processing of the information processing apparatus 100. An example of learning processing such as a semantic relation classification model used for classification of semantic relations between words will be described with reference to FIG.
In step S601, the learning unit 402 determines whether to perform learning of the semantic relationship classification model. For example, the learning unit 402 determines whether to perform learning of the semantic relationship classification model based on information indicating whether to perform learning of the semantic relationship classification model stored in advance in the auxiliary storage device 103. judge. For example, the learning unit 402 learns the semantic relationship classification model when the information indicating whether to execute the learning of the semantic relationship classification model stored in the auxiliary storage device 103 is information indicating the execution. Is determined to be executed. For example, when the information indicating whether or not to learn the semantic relationship classification model stored in the auxiliary storage device 103 is information indicating that the learning unit 402 does not execute the semantic relationship classification model, It is determined that no learning is performed.
For example, the CPU 101 determines whether to execute learning of the semantic relation classification model stored in the auxiliary storage device 103 based on a user operation via an input device connected via the input / output I / F 105. The information shown can be updated. For example, when the semantic relation classification model learned in advance is stored in the auxiliary storage device 103, the user may think that learning of the semantic relation classification model is not required. In such a case, the user updates the input device so as to update the content of the information indicating whether or not the learning of the semantic relation classification model stored in the auxiliary storage device 103 is to be executed. The information processing apparatus 100 is instructed.

Ｓ６０２において、取得部４０１は、例えば、コーパスデータ内の文章データにおける単語とその単語の周辺の単語とに基づいて、教師なし学習（ｗｏｒｄ２ｖｅｃ等）により、コーパスデータ内の単語の分散表現を取得する。
Ｓ６０３において、学習部４０２は、意味的関係辞書（例えば、ＷｏｒｄＮｅｔ等）から、２つの単語とその単語同士の意味的関係との組を抽出する。
Ｓ６０４において、学習部４０２は、Ｓ６０２で取得された分散表現から、Ｓ６０３で抽出された全ての組に含まれる２つの単語の分散表現を取得する。そして、学習部４０２は、取得した２つの単語の分散表現と、Ｓ６０３で抽出された組に含まれる意味的関係と、の組を教師データとして、２つの単語を入力として意味的関係を出力する識別器を、意味的関係分類モデルとして学習する。学習部４０２は、例えば、識別器として、サポートベクタマシン（ＳＶＭ）等を用いる。
以上、図６の処理により、情報処理装置１００は、分散表現で表された２つの単語の意味的関係の分類に利用される意味的関係分類モデルを学習した。これにより、情報処理装置１００は、分散表現で表された２つの単語の意味的関係の分類を可能とすることができる。情報処理装置１００は、例えば、コーパスデータから抽出された２つの単語の分散表現を、図６の処理で学習した意味的関係分類モデルに入力することで、その単語同士の意味的関係を取得できる。
本実施形態では、学習部４０２は、Ｓ６０４の処理で、Ｓ６０３で抽出された全ての組に含まれる２つの単語の分散表現と、対応する関係性と、を教師データとして、意味的関係分類モデルを学習することとした。しかし、学習部４０２は、Ｓ６０４の処理で、Ｓ６０３で抽出された組のうち、ランダムにサンプリングされた予め定められた数の組に含まれる２つの単語の分散表現と、対応する関係性と、を教師データとして、意味的関係分類モデルを学習することとしてもよい。 In S 602, the acquisition unit 401 acquires a distributed representation of words in the corpus data by unsupervised learning (word2vec or the like) based on, for example, the words in the sentence data in the corpus data and the words around the words. .
In S603, the learning unit 402 extracts a pair of two words and a semantic relationship between the words from a semantic relationship dictionary (for example, WordNet).
In S604, the learning unit 402 acquires a distributed expression of two words included in all the pairs extracted in S603 from the distributed expression acquired in S602. Then, the learning unit 402 outputs the semantic relationship with the input of the two words, using the set of the distributed representation of the acquired two words and the semantic relationship included in the set extracted in S603 as teacher data. The classifier is learned as a semantic relation classification model. The learning unit 402 uses, for example, a support vector machine (SVM) as a discriminator.
As described above, the information processing apparatus 100 has learned the semantic relationship classification model used for classification of the semantic relationship between two words expressed in a distributed expression by the processing of FIG. Thereby, the information processing apparatus 100 can enable classification of the semantic relationship between two words expressed in a distributed expression. For example, the information processing apparatus 100 can acquire the semantic relationship between the words by inputting the distributed representation of the two words extracted from the corpus data to the semantic relationship classification model learned in the process of FIG. 6. .
In the present embodiment, the learning unit 402 performs the semantic relationship classification model using the distributed representations of the two words included in all the pairs extracted in S603 and the corresponding relationships as teacher data in the processing of S604. I decided to learn. However, the learning unit 402, in the process of S604, among the pairs extracted in S603, a distributed representation of two words included in a predetermined number of randomly sampled pairs, and the corresponding relationships, It is good also as learning a semantic relation classification model by using as teacher data.

（カテゴリの不適切な単語の特定処理）
教師データに含まれる単語に対応するカテゴリが不適切な単語が存在する場合や教師データに含まれる単語の数が十分でない場合、情報処理装置１００は、図５の処理で学習したカテゴリ分類モデルを用いて、適切に、分散表現で表された単語のカテゴリの分類を行うことができない。
このような場合、カテゴリ分類モデルを用いてカテゴリが分類された単語のうち、カテゴリの分類結果が不適切である単語のカテゴリを更新し、カテゴリが更新された単語に基づいて、再度、カテゴリ分類モデルを学習することで、カテゴリ分類モデルの分類精度を向上させることができる。例えば、教師データに含まれる単語のカテゴリが更新された場合、再度、教師データに基づいて、カテゴリ分類モデルを学習することで、カテゴリ分類モデルの分類精度を向上させることができる。また、例えば、教師データに含まれない単語のカテゴリが更新された場合、教師データにその単語を加えたものを新たな教師データとして、その教師データに基づいて、カテゴリ分類モデルを学習することで、カテゴリ分類モデルの分類精度を向上させることができる。
また、教師データにカテゴリの分類結果が不適切である単語が含まれるような場合、教師データから、これらの単語を除いて、再度、カテゴリ分類モデルを学習することで、カテゴリ分類モデルの分類精度の向上が期待できる。
このような処理を行い、単語のカテゴリ分類の精度向上を実現するためには、カテゴリが不適切である可能性のある単語を特定する必要がある。
そこで、単語のカテゴリ分類の精度向上を実現するために、カテゴリが不適切である可能性のある単語を特定したいという要望があった。 (Identification of inappropriate words in categories)
When there is a word having an inappropriate category corresponding to the word included in the teacher data or when the number of words included in the teacher data is not sufficient, the information processing apparatus 100 uses the category classification model learned in the process of FIG. It is not possible to properly classify the categories of words represented in a distributed representation.
In such a case, among the words whose categories are classified using the category classification model, the category of the word whose category classification result is inappropriate is updated, and the category classification is performed again based on the words whose category is updated. By learning the model, the classification accuracy of the category classification model can be improved. For example, when the category of the word included in the teacher data is updated, the classification accuracy of the category classification model can be improved by learning the category classification model again based on the teacher data. Further, for example, when a category of a word not included in the teacher data is updated, a category classification model is learned based on the teacher data by using the teacher data plus the word as new teacher data. The classification accuracy of the category classification model can be improved.
Also, if the teacher data contains words with inappropriate category classification results, remove these words from the teacher data and learn the category classification model again. Improvement can be expected.
In order to perform such processing and improve the accuracy of word category classification, it is necessary to identify a word that may have an inappropriate category.
Therefore, in order to improve the accuracy of word category classification, there has been a demand to specify a word whose category may be inappropriate.

単語の分散表現には、２つの単語の関係と、別の２つの単語の関係と、が類似する程、分散表現の意味空間上でこれら２つの関係を示すベクトルは類似するという性質がある。例えば、「日本」という単語と「野球」という単語との関係が、「米国」という単語と「ホッケー」という単語との関係に類似するなら、分散表現の意味空間上における「日本」という単語と「野球」という単語との関係を示すベクトルと、分散表現の意味空間上における「米国」という単語と「ホッケー」という単語との関係を示すベクトルと、が類似することとなる。
分散表現の意味空間とは、空間内の各座標が、単語の分散表現が示すベクトルを表現する空間である。各座標が示すベクトルは、意味空間の原点を始点として、その座標を終点とするベクトルである。分散表現の意味空間の次元数は、単語の分散表現が示すベクトルと同じ次元である。例えば、分散表現の意味空間の次元数が２とすると、分散表現の意味空間上における座標（ｘ、ｙ）は、ベクトル（ｘ、ｙ）を示すこととなる。分散表現の意味空間上における２つの単語の関係を示すベクトルとは、単語同士がどのような関係であるかを示すベクトルであり、例えば、一方の単語が示すベクトルと、他方の単語が示すベクトルと、の差を示すベクトルである。 The distributed representation of a word has a property that, as the relationship between two words and the relationship between two other words are similar, the vectors indicating these two relationships in the semantic space of the distributed representation are similar. For example, if the relationship between the word “Japan” and the word “baseball” is similar to the relationship between the word “US” and the word “hockey”, the word “Japan” in the semantic space of the distributed expression The vector indicating the relationship with the word “baseball” is similar to the vector indicating the relationship between the word “US” and the word “hockey” in the distributed expression semantic space.
The meaning space of the distributed expression is a space in which each coordinate in the space represents a vector indicated by the distributed expression of the word. The vector indicated by each coordinate is a vector having the origin of the semantic space as the start point and the coordinate as the end point. The number of dimensions in the semantic space of the distributed representation is the same as the vector indicated by the distributed representation of the word. For example, if the number of dimensions in the semantic space of the distributed representation is 2, the coordinates (x, y) on the semantic space of the distributed representation indicate a vector (x, y). The vector indicating the relationship between two words in the semantic space of the distributed representation is a vector indicating what relationship the words are, for example, a vector indicated by one word and a vector indicated by the other word And a vector indicating the difference between.

この性質から、以下のようなことが仮定できる。例えば、互いに類義する単語（Ｘ）と単語（Ｙ）とがあるとする。更に、単語（Ｘ）に関係性を有する単語（Ａ）と、単語（Ｘ）と単語（Ａ）との間の関係性と同様の関係性を単語（Ｙ）に有する単語（Ｂ）とがあるとする。また、分散表現の意味空間上における単語（Ｂ）と単語（Ｙ）との関係を示すベクトルと、分散表現の意味空間上における単語（Ａ）と単語（Ｘ）との関係を示すベクトルと、が類似するとする。この場合、単語（Ａ）と単語（Ｂ）とは、共に類義する単語と同様な関係を有する単語であり、同じカテゴリに属することが仮定できる。
本実施形態では、情報処理装置１００は、図５の処理で学習したカテゴリ分類モデルを利用して、補助記憶装置１０３に記憶されたコーパスデータに含まれる単語について、カテゴリを分類する。そして、情報処理装置１００は、図７で後述する処理で、この分散表現の性質に基づく仮定を利用して、カテゴリを分類したコーパスデータに含まれる単語から、カテゴリが不適切な可能性のある単語を特定する。そして、情報処理装置１００は、決定したカテゴリが不適切な可能性のある単語について、カテゴリを更新し、カテゴリを更新した単語を教師データとして利用して、再度、カテゴリ分類モデルを学習する。 From this property, the following can be assumed. For example, it is assumed that there are a word (X) and a word (Y) that are similar to each other. Furthermore, a word (A) having a relationship with the word (X) and a word (B) having a relationship similar to the relationship between the word (X) and the word (A) in the word (Y). Suppose there is. A vector indicating the relationship between the word (B) and the word (Y) on the semantic space of the distributed representation, a vector indicating the relationship between the word (A) and the word (X) on the semantic space of the distributed representation, Are similar. In this case, it can be assumed that the word (A) and the word (B) are words having the same relationship as the words similar to each other and belong to the same category.
In the present embodiment, the information processing apparatus 100 classifies categories for words included in the corpus data stored in the auxiliary storage device 103 using the category classification model learned in the process of FIG. Then, the information processing apparatus 100 may use an assumption based on the nature of the distributed representation in the process described later with reference to FIG. 7, and the category may be inappropriate from words included in the corpus data in which the category is classified. Identify words. Then, the information processing apparatus 100 updates the category for a word whose determined category may be inappropriate, and learns the category classification model again using the word whose category has been updated as teacher data.

図７は、情報処理装置１００の処理の一例を示すフローチャートである。図７を用いて、カテゴリが不適切な可能性のある単語を決定し、決定した単語のカテゴリを更新し、再度、カテゴリ分類モデルを学習する処理の一例について説明する。図７の処理における単語（Ｘ）は、第１の単語の一例である。図７の処理における単語（Ｙ）は、第３の単語の一例である。図７の処理における単語（Ａ）は、第２の単語の一例である。図７の処理における単語（Ｂ）は、第４の単語の一例である。
Ｓ７０１において、抽出部４０３は、補助記憶装置１０３に記憶されたコーパスデータから互いに類義する２つの単語を抽出する。抽出部４０３は、例えば、予め定められた単語について、同一の意味的関係を有する２つの単語を抽出することで、互いに類義する２つの単語を抽出する。抽出部４０３は、例えば、Ｓ６０４で学習された意味的関係分類モデルを用いて、予め定められた単語と意味的関係を有する単語を抽出する。抽出部４０３は、例えば、Ｓ６０４で学習された意味的関係分類モデルに、予め定められた単語と他の単語との２つの単語を入力し、その２つの単語の意味的関係を取得する。そして、抽出部４０３は、取得した意味的関係が、類義関係を示す関係である場合、その２つの単語を、互いに類義する２つの単語として抽出する。また、抽出部４０３は、例えば、予め学習された意味的関係分類モデルを用いて、予め定められた単語と意味的関係を有する単語を抽出することとしてもよい。本実施形態では、抽出部４０３は、予め定められた単語（「国」）と、同一の意味的関係（上位概念／下位概念関係）を有する単語（Ｘ）（「日本」）と単語（Ｙ）（「米国」）との２つの単語を抽出する。 FIG. 7 is a flowchart illustrating an example of processing of the information processing apparatus 100. With reference to FIG. 7, an example of processing for determining a word that may have an inappropriate category, updating the category of the determined word, and learning the category classification model again will be described. The word (X) in the process of FIG. 7 is an example of the first word. The word (Y) in the process of FIG. 7 is an example of a third word. The word (A) in the process of FIG. 7 is an example of the second word. The word (B) in the process of FIG. 7 is an example of a fourth word.
In step S 701, the extraction unit 403 extracts two words that are similar to each other from the corpus data stored in the auxiliary storage device 103. For example, the extraction unit 403 extracts two words that are similar to each other by extracting two words having the same semantic relationship with respect to a predetermined word. For example, the extraction unit 403 extracts words having a semantic relationship with a predetermined word using the semantic relationship classification model learned in S604. For example, the extraction unit 403 inputs two words, a predetermined word and another word, into the semantic relationship classification model learned in S604, and acquires the semantic relationship between the two words. Then, when the acquired semantic relationship is a relationship indicating a synonymous relationship, the extracting unit 403 extracts the two words as two words that are synonymous with each other. Further, the extraction unit 403 may extract words having a semantic relationship with a predetermined word using, for example, a semantic relationship classification model learned in advance. In the present embodiment, the extraction unit 403 includes a predetermined word (“country”), a word (X) (“Japan”) having the same semantic relationship (superordinate concept / subordinate concept relationship), and a word (Y ) ("US").

Ｓ７０２において、抽出部４０３は、コーパスデータから、Ｓ７０１で抽出された単語（Ｘ）と関係性を有する単語（Ａ）を抽出する。本実施形態では、抽出部４０３は、コーパスデータから、単語（Ｘ）（「日本」）と共起関係を有する単語（Ａ）（「野球」）を抽出する。抽出部４０３は、例えば、コーパスデータに含まれる文章において、単語（Ｘ）の周辺に、どのような単語がどのような確率で出現するかを特定する。例えば、抽出部４０３は、コーパスデータに含まれる文章において、単語（Ｘ）の周辺に出現した各単語について、出現回数を集計し、集計した出現回数を、コーパスデータ内での単語（Ｘ）の出現回数で割ることで、各単語の出現確率を取得する。そして、抽出部４０３は、特定した確率が予め定められた閾値以上である単語を、単語（Ｘ）と共起関係にある単語（Ａ）として抽出する。
また、抽出部４０３は、例えば、Ｓ６０４で学習された意味的関係分類モデルを用いて、単語（Ｘ）と予め定められた意味的関係を有する単語を抽出してもよい。抽出部４０３は、例えば、Ｓ６０４で学習された意味的関係分類モデルに、単語（Ｘ）と他の単語とを入力し、出力された意味的関係が、予め定められた意味的関係である場合、単語（Ｘ）と共に意味的関係分類モデルに入力された単語を、単語（Ｘ）と予め定められた意味的関係を有する単語として抽出する。意味的関係の例としては、包含関係や類義関係がある。また、抽出部４０３は、例えば、補助記憶装置１０３に記憶された予め学習された意味的関係分類モデルを用いて、単語（Ｘ）と意味的関係を有する単語を抽出してもよい。また、抽出部４０３は、例えば、補助記憶装置１０３に予め記憶された意味的関係辞書を用いて、単語（Ｘ）と意味的関係を有する単語を抽出してもよい。
また、抽出部４０３は、単語（Ｘ）の分散表現（ベクトル）と類似するベクトルを分散表現とする単語を、単語（Ｘ）と類義関係を有する単語（Ａ）として抽出してもよい。
以下では、Ｓ７０１で抽出された単語（Ｘ）とＳ７０２で抽出された単語（Ａ）との関係性を、拡張共起関係とする。拡張共起関係は、例えば、共起関係や意味的関係、分散表現（ベクトル）の類似関係等の関係性である。
Ｓ７０２の処理は、第２の単語を特定する第１の特定処理の一例である。 In S702, the extraction unit 403 extracts, from the corpus data, a word (A) having a relationship with the word (X) extracted in S701. In the present embodiment, the extraction unit 403 extracts the word (A) (“baseball”) having a co-occurrence relationship with the word (X) (“Japan”) from the corpus data. For example, in the sentence included in the corpus data, the extraction unit 403 identifies what kind of word appears with what probability around the word (X). For example, the extraction unit 403 counts the number of appearances of each word that appears around the word (X) in the text included in the corpus data, and calculates the total number of appearances of the word (X) in the corpus data. By dividing by the number of appearances, the appearance probability of each word is acquired. Then, the extraction unit 403 extracts a word having a specified probability that is equal to or greater than a predetermined threshold as a word (A) having a co-occurrence relationship with the word (X).
Further, the extraction unit 403 may extract a word having a predetermined semantic relationship with the word (X), for example, using the semantic relationship classification model learned in S604. For example, the extraction unit 403 inputs the word (X) and another word to the semantic relationship classification model learned in S604, and the output semantic relationship is a predetermined semantic relationship. The word input to the semantic relationship classification model together with the word (X) is extracted as a word having a predetermined semantic relationship with the word (X). Examples of semantic relationships include inclusion relationships and synonymous relationships. For example, the extraction unit 403 may extract a word having a semantic relationship with the word (X) using a pre-learned semantic relationship classification model stored in the auxiliary storage device 103. Further, the extraction unit 403 may extract a word having a semantic relationship with the word (X) using, for example, a semantic relationship dictionary stored in advance in the auxiliary storage device 103.
Further, the extraction unit 403 may extract a word having a vector similar to the distributed expression (vector) of the word (X) as a distributed expression as a word (A) having a similar relationship with the word (X).
Hereinafter, the relationship between the word (X) extracted in S701 and the word (A) extracted in S702 is referred to as an expanded co-occurrence relationship. The expanded co-occurrence relationship is, for example, a relationship such as a co-occurrence relationship, a semantic relationship, or a similarity relationship of distributed expressions (vectors).
The process of S702 is an example of a first specifying process that specifies the second word.

Ｓ７０３において、抽出部４０３は、コーパスデータから、Ｓ７０１で抽出された単語（Ｙ）との間に、単語（Ｘ）と単語（Ａ）との間の拡張共起関係と同様の関係性を有し、分散表現の意味空間上における単語（Ｙ）との関係が、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係と類似する単語（Ｂ）を抽出する。本実施形態では、抽出部４０３は、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係を示すベクトル（例えば、単語（Ａ）−単語（Ｘ）が示すベクトル等）と、単語（Ｙ）と単語（Ｂ）との関係を示すベクトル（例えば、単語（Ｂ）−単語（Ｙ）が示すベクトル等）と、が類似する場合、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係と、単語（Ｙ）と単語（Ｂ）との関係と、が類似するとする。
そこで、抽出部４０３は、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係を示すベクトル（例えば、単語（Ａ）−単語（Ｘ）が示すベクトル等）と、単語（Ｙ）と単語（Ｂ）との関係を示すベクトル（例えば、単語（Ｂ）−単語（Ｙ）が示すベクトル等）と、が類似するように、単語（Ｂ）を抽出する。単語（Ａ）−単語（Ｘ）のベクトルは、分散表現の意味空間上における第１の単語と第２の単語との関係を示す第２の関係ベクトルの一例である。 In S703, the extraction unit 403 has a relationship similar to the expanded co-occurrence relationship between the word (X) and the word (A) between the word (Y) extracted in S701 from the corpus data. Then, the word (B) whose relationship with the word (Y) in the distributed representation semantic space is similar to the relationship between the word (X) and the word (A) in the distributed representation semantic space is extracted. In the present embodiment, the extraction unit 403 includes a vector indicating the relationship between the word (X) and the word (A) in the distributed expression meaning space (for example, a vector indicated by word (A) -word (X)), and the like. When the vector indicating the relationship between the word (Y) and the word (B) (for example, the vector indicated by the word (B) -the word (Y)) is similar, the word (X ) And the word (A) and the relationship between the word (Y) and the word (B) are similar.
Therefore, the extraction unit 403 includes a vector (for example, a vector indicated by the word (A) −the word (X)) indicating the relationship between the word (X) and the word (A) in the semantic space of the distributed representation, The word (B) is extracted so that the vector indicating the relationship between Y) and the word (B) (for example, the vector indicated by word (B) -word (Y)) is similar. The word (A) -word (X) vector is an example of a second relationship vector indicating the relationship between the first word and the second word in the distributed expression semantic space.

抽出部４０３は、例えば、以下のようにして、単語（Ｂ）を抽出する。即ち、抽出部４０３は、例えば、コーパスデータから、単語（Ｙ）と共起関係を有する単語（Ｂ）を抽出する。抽出部４０３は、例えば、コーパスデータに含まれる文章において、単語（Ｙ）の周辺にどのような単語がどのような確率で出現するかを特定する。例えば、抽出部４０３は、コーパスデータに含まれる文章において、単語（Ｙ）の周辺に出現した各単語について、出現回数を集計し、集計した出現回数を、コーパスデータ内での単語（Ｙ）の出現回数で割ることで、各単語の出現確率を取得する。そして、抽出部４０３は、特定した確率が予め定められた閾値以上である単語を、単語（Ｙ）と共起関係にある単語（Ｂ）として抽出する。
また、抽出部４０３は、Ｓ６０４で学習された意味的関係分類モデルを用いて、単語（Ｙ）と、単語（Ｘ）と単語（Ａ）との間の関係性と同様の関係性を有する単語を、単語（Ｂ）の候補として抽出してもよい。また、抽出部４０３は、例えば、補助記憶装置１０３に記憶された予め学習された意味的関係分類モデルを用いて、単語（Ｙ）と意味的関係を有する単語を抽出してもよい。また、抽出部４０３は、例えば、補助記憶装置１０３に予め記憶された意味的関係辞書を用いて、単語（Ｙ）と意味的関係を有する単語を抽出してもよい。また、抽出部４０３は、単語（Ｙ）の分散表現（ベクトル）と類似するベクトルを分散表現とする単語を、単語（Ｙ）と類義関係を有する単語（Ｂ）として抽出してもよい。
そして、抽出部４０３は、単語（Ｂ）の候補から、分散表現の意味空間上における単語（Ｙ）との関係が、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係と類似する単語を、単語（Ｂ）として特定する。抽出部４０３は、例えば、分散表現の意味空間上における単語（Ａ）が示すベクトルと単語（Ｘ）が示すベクトルとの差を示すベクトルを、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係を示すベクトルとして取得する。そして、抽出部４０３は、取得したベクトルと、単語（Ｙ）が示すベクトルと、を足し合わせたベクトルを取得する。このベクトルは、分散表現の意味空間上における単語（Ｙ）との関係が、分散表現の意味空間上における単語（Ｘ）と単語（Ａ）との関係と同一である単語を示すベクトルとなる。 For example, the extraction unit 403 extracts the word (B) as follows. That is, for example, the extraction unit 403 extracts a word (B) having a co-occurrence relationship with the word (Y) from the corpus data. For example, in the sentence included in the corpus data, the extraction unit 403 specifies what word appears with a certain probability around the word (Y). For example, the extraction unit 403 adds up the number of appearances of each word that appears around the word (Y) in a sentence included in the corpus data, and calculates the total number of appearances of the word (Y) in the corpus data. By dividing by the number of appearances, the appearance probability of each word is acquired. Then, the extraction unit 403 extracts a word whose specified probability is equal to or greater than a predetermined threshold as a word (B) that has a co-occurrence relationship with the word (Y).
Further, the extraction unit 403 uses the semantic relationship classification model learned in S604, and a word having the same relationship as the relationship between the word (Y) and the word (X) and the word (A). May be extracted as a word (B) candidate. The extraction unit 403 may extract a word having a semantic relationship with the word (Y) using, for example, a previously learned semantic relationship classification model stored in the auxiliary storage device 103. The extraction unit 403 may extract a word having a semantic relationship with the word (Y) using, for example, a semantic relationship dictionary stored in advance in the auxiliary storage device 103. Further, the extraction unit 403 may extract a word having a vector similar to the distributed expression (vector) of the word (Y) as a distributed expression as a word (B) having a similar relationship with the word (Y).
Then, the extraction unit 403 determines whether the relationship between the word (B) candidate and the word (Y) in the distributed representation semantic space is the relationship between the word (X) and the word (A) in the distributed representation semantic space. Is identified as a word (B). For example, the extraction unit 403 uses a vector indicating the difference between the vector indicated by the word (A) and the vector indicated by the word (X) in the semantic space of the distributed expression as the word (X) and the word Obtained as a vector indicating the relationship with (A). Then, the extraction unit 403 acquires a vector obtained by adding the acquired vector and the vector indicated by the word (Y). This vector is a vector indicating a word whose relationship with the word (Y) in the distributed representation semantic space is the same as the relationship between the word (X) and the word (A) in the distributed representation semantic space.

抽出部４０３は、取得したこのベクトルと、単語（Ｂ）の候補それぞれが示すベクトルと、の類似度を取得する。ベクトル同士の類似度は、そのベクトル同士の類似の程度を示す指標であり、例えば、コサイン類似度、内積、そのベクトル同士の差を示すベクトルの絶対値等がある。本実施形態では、抽出部４０３は、このベクトルと、単語（Ｂ）の候補それぞれが示すベクトルと、のコサイン類似度を取得する。
そして、抽出部４０３は、取得したコサイン類似度の中から、予め定められた閾値以上のものを特定する。抽出部４０３は、予め定められた閾値以上のものが複数ある場合、最大のものを特定する。抽出部４０３は、単語（Ｂ）の候補のうち、特定した類似度に対応する単語を、単語（Ｂ）として特定する。これにより、抽出部４０３は、単語（Ｘ）と単語（Ａ）との関係を示すベクトルと、単語（Ｙ）と単語（Ｂ）との関係を示すベクトルと、が類似するように（即ち、単語（Ｘ）と単語（Ａ）との関係と、単語（Ｙ）と単語（Ｂ）との関係と、が類似するように）、単語（Ｂ）を特定できる。
本実施形態では、単語（Ｂ）として特定された単語は、「ホッケー」であるとする。 The extraction unit 403 acquires the similarity between the acquired vector and the vector indicated by each word (B) candidate. The similarity between vectors is an index indicating the degree of similarity between the vectors, and includes, for example, a cosine similarity, an inner product, an absolute value of a vector indicating a difference between the vectors, and the like. In the present embodiment, the extraction unit 403 acquires the cosine similarity between this vector and the vector indicated by each candidate word (B).
Then, the extraction unit 403 identifies the acquired cosine similarity that is greater than or equal to a predetermined threshold. The extraction unit 403 specifies the maximum one when there are a plurality of items exceeding a predetermined threshold. The extraction unit 403 identifies, as the word (B), a word corresponding to the identified similarity among the candidates for the word (B). Thereby, the extraction unit 403 makes the vector indicating the relationship between the word (X) and the word (A) and the vector indicating the relationship between the word (Y) and the word (B) similar (that is, The word (B) can be specified such that the relationship between the word (X) and the word (A) is similar to the relationship between the word (Y) and the word (B).
In the present embodiment, it is assumed that the word specified as the word (B) is “hockey”.

図８は、分散表現の意味空間上における単語（Ｘ）、単語（Ｙ）、単語（Ａ）、単語（Ｂ）の関係の一例を示す図である。単語（Ｘ）と単語（Ａ）との関係を示すベクトル（単語（Ａ）−単語（Ｘ））と、単語（Ｙ）と単語（Ｂ）との関係を示すベクトル（単語（Ｂ）−単語（Ｙ））と、が類似していることが分かる。図８から分かるように、単語（Ｘ）、単語（Ｙ）、単語（Ａ）、単語（Ｂ）は、分散表現の意味空間上で、近似的な平行四辺形の形状を為す。しかし、分散表現の意味空間上で、近似的な平行四辺形の形状を為す４つの単語が、常に、単語（Ｘ）、単語（Ｙ）、単語（Ａ）、単語（Ｂ）のような関係（単語（Ｘ）と単語（Ｙ）とが類義し、単語（Ｘ）と単語（Ａ）との間に関係性（拡張共起関係）が有り、単語（Ｙ）と単語（Ｂ）との間にも同様の関係性が有る関係）が成り立つとは限らない。
図９は、単語（Ｘ）、単語（Ｙ）、単語（Ａ）、単語（Ｂ）それぞれの関係性の一例を示す図である。単語（Ｘ）と単語（Ｙ）とは、互いに単語「国」の下位概念であり、互いに類義することが示されている。また、単語（Ａ）、単語（Ｂ）は、それぞれ単語（Ｘ）、単語（Ｙ）との間に共起関係を有することが示されている。 FIG. 8 is a diagram illustrating an example of the relationship between the word (X), the word (Y), the word (A), and the word (B) in the semantic space of the distributed representation. A vector indicating the relationship between the word (X) and the word (A) (word (A) -word (X)) and a vector indicating the relationship between the word (Y) and the word (B) (word (B) -word (Y)) are similar to each other. As can be seen from FIG. 8, the word (X), the word (Y), the word (A), and the word (B) have an approximate parallelogram shape on the distributed expression semantic space. However, in the distributed representation semantic space, the four words having an approximate parallelogram shape are always in a relationship such as word (X), word (Y), word (A), and word (B). (Word (X) and word (Y) are similar, and there is a relationship (extended co-occurrence relationship) between word (X) and word (A). Word (Y) and word (B) It is not always true that there is a similar relationship between the two.
FIG. 9 is a diagram illustrating an example of the relationship between the word (X), the word (Y), the word (A), and the word (B). The word (X) and the word (Y) are subordinate concepts of the word “country”, and are shown to be similar to each other. Further, it is shown that the word (A) and the word (B) have a co-occurrence relationship with the word (X) and the word (Y), respectively.

Ｓ７０３の処理は、第４の単語を特定する第２の特定処理の一例である。
Ｓ７０４において、抽出部４０３は、Ｓ７０２で抽出された単語（Ａ）とＳ７０３で抽出された単語（Ｂ）とのカテゴリが異なるか否かを判定する。
抽出部４０３は、Ｓ７０２で抽出された単語（Ａ）とＳ７０３で抽出された単語（Ｂ）とのカテゴリが異なると判定した場合、単語（Ａ）と単語（Ｂ）とについて、カテゴリが不適切である可能性があることを決定し、Ｓ７０５の処理に進む。抽出部４０３は、Ｓ７０２で抽出された単語（Ａ）とＳ７０３で抽出された単語（Ｂ）とのカテゴリが同じであると判定した場合、単語（Ａ）と単語（Ｂ）とのカテゴリが適切であるとして、図７の処理を終了する。Ｓ７０４の処理は、第２の単語のカテゴリが第４の単語のカテゴリと異なるか否かに基づいて、第２の単語と第４の単語とについて、カテゴリが不適切な可能性があるか否かを決定する決定処理の一例である。 The process of S703 is an example of a second specifying process that specifies the fourth word.
In S704, the extraction unit 403 determines whether the category of the word (A) extracted in S702 and the word (B) extracted in S703 are different.
If the extraction unit 403 determines that the categories of the word (A) extracted in S702 and the word (B) extracted in S703 are different, the categories of the word (A) and the word (B) are inappropriate. Is determined, and the process proceeds to S705. If the extraction unit 403 determines that the categories of the word (A) extracted in S702 and the word (B) extracted in S703 are the same, the categories of the word (A) and the word (B) are appropriate. As a result, the process of FIG. The processing of S704 is based on whether or not the category of the second word is different from the category of the fourth word, and whether or not the category may be inappropriate for the second word and the fourth word. It is an example of the determination process which determines these.

Ｓ７０５において、更新部４０４は、Ｓ７０４で、カテゴリが不適切である可能性があることが決定された単語（Ａ）と単語（Ｂ）とについて、何れか一方又は双方のカテゴリを更新する。
更新部４０４は、例えば、単語（Ａ）と単語（Ｂ）とのうち、何れか一方のカテゴリが確定されている場合、他方のカテゴリをその一方のカテゴリと同じカテゴリに更新する。例えば、単語（Ａ）と単語（Ｂ）とのうちの一方が、予め補助記憶装置１０３に記憶された単語のカテゴリを示す辞書データに含まれており、カテゴリがその辞書データが示すカテゴリと一致するとする。その場合、更新部４０４は、その一方の単語をカテゴリが確定された単語として決定し、他方の単語のカテゴリをその一方の単語のカテゴリと同じカテゴリに更新する。 In step S 705, the update unit 404 updates one or both categories of the word (A) and the word (B) that are determined to have an inappropriate category in step S 704.
For example, when any one of the word (A) and the word (B) is determined, the update unit 404 updates the other category to the same category as the one category. For example, one of the word (A) and the word (B) is included in the dictionary data indicating the category of the word stored in advance in the auxiliary storage device 103, and the category matches the category indicated by the dictionary data. Then. In that case, the update unit 404 determines the one word as a word whose category has been determined, and updates the category of the other word to the same category as the category of the one word.

また、更新部４０４は、以下のような処理を行うこととしてもよい。即ち、更新部４０４は、入出力Ｉ／Ｆ１０５を介して接続された出力装置に、単語（Ａ）と単語（Ｂ）とについて、カテゴリが不適切である可能性があることを示す情報を出力する。更新部４０４は、例えば、入出力Ｉ／Ｆ１０５を介して接続された出力装置であるディスプレイに、単語（Ａ）と単語（Ｂ）とについて、カテゴリが不適切である可能性があることを示す情報を表示することで出力する。これにより、更新部４０４は、ユーザに対して、単語（Ａ）と単語（Ｂ）とについて、カテゴリが不適切である可能性があることを提示できる。更に、更新部４０４は、例えば、ディスプレイに、単語（Ａ）と単語（Ｂ）とについてのカテゴリの指定に利用される指定画面を表示することとしてもよい。そして、更新部４０４は、入出力Ｉ／Ｆ１０５を介して接続された入力装置を介したユーザによる指定画面への操作に基づいて、単語（Ａ）と単語（Ｂ）とのうち、何れか一方又は双方についてのカテゴリの指定を受付けてもよい。更新部４０４は、単語（Ａ）と単語（Ｂ）とのうち、何れか一方又は双方について、受付けた指定が示すカテゴリに更新する。
指定画面は、例えば、単語（Ａ）と単語（Ｂ）とに対応するカテゴリの入力欄を含む。この場合、更新部４０４は、指定画面内のカテゴリの入力欄に入力された情報を取得することで、カテゴリの指定を受付けることができる。 Further, the update unit 404 may perform the following processing. That is, the update unit 404 outputs information indicating that the category may be inappropriate for the word (A) and the word (B) to the output device connected via the input / output I / F 105. To do. For example, the update unit 404 indicates that there is a possibility that the category of the word (A) and the word (B) may be inappropriate on a display that is an output device connected via the input / output I / F 105. Output by displaying information. Thereby, the update part 404 can show that a category may be inappropriate about a word (A) and a word (B) with respect to a user. Furthermore, the update unit 404 may display, for example, a designation screen used for designation of categories for the word (A) and the word (B) on the display. The update unit 404 then selects one of the word (A) and the word (B) based on an operation on the designation screen by the user via the input device connected via the input / output I / F 105. Alternatively, specification of categories for both may be accepted. The update unit 404 updates either or both of the word (A) and the word (B) to the category indicated by the accepted designation.
The designation screen includes, for example, a category input field corresponding to the word (A) and the word (B). In this case, the update unit 404 can accept the designation of the category by acquiring information input in the category input field in the designation screen.

Ｓ７０６において、学習部４０２は、Ｓ７０５でカテゴリが更新された単語に基づいて、再度、カテゴリ分類モデルを学習する。Ｓ７０５でカテゴリが更新された単語が補助記憶装置１０３に記憶された教師データに含まれない場合、学習部４０２は、教師データと、Ｓ７０５でカテゴリが更新された単語と、に基づいて、カテゴリ分類モデルを学習する。
また、Ｓ７０５でカテゴリが更新された単語が補助記憶装置１０３に記憶された教師データに含まれる場合、学習部４０２は、以下のような処理を行う。即ち、学習部４０２は、教師データに含まれるＳ７０５でカテゴリが更新された単語について、教師データ内の対応するカテゴリを更新し、更新した教師データに基づいて、カテゴリ分類モデルを学習する。
また、抽出部４０３は、単語（Ｘ）と単語（Ｙ）とのカテゴリが異なるか否かを判定することもできる。抽出部４０３により単語（Ｘ）と単語（Ｙ）とのカテゴリが異なると判定された場合、更新部４０４は、Ｓ７０５と同様の方法で、カテゴリが不適切である可能性がある単語（Ｘ）と単語（Ｙ）とについて、何れか一方又は双方のカテゴリを更新する。そして、学習部４０２は、カテゴリが更新された単語に基づいて、カテゴリ分類モデルを学習することとなる。 In S706, the learning unit 402 learns the category classification model again based on the word whose category has been updated in S705. If the word whose category has been updated in S705 is not included in the teacher data stored in the auxiliary storage device 103, the learning unit 402 performs category classification based on the teacher data and the word whose category has been updated in S705. Learn the model.
In addition, when the word whose category is updated in S705 is included in the teacher data stored in the auxiliary storage device 103, the learning unit 402 performs the following process. That is, the learning unit 402 updates the corresponding category in the teacher data for the word whose category is updated in S705 included in the teacher data, and learns the category classification model based on the updated teacher data.
The extraction unit 403 can also determine whether the categories of the word (X) and the word (Y) are different. When the extraction unit 403 determines that the categories of the word (X) and the word (Y) are different, the update unit 404 uses the same method as in S705 to indicate that the category (X) may have an inappropriate category. One or both categories are updated for the word and the word (Y). Then, the learning unit 402 learns the category classification model based on the word whose category has been updated.

（効果）
以上、本実施形態では、情報処理装置１００は、カテゴリが既知である単語で構成される教師データに基づいて、分散表現の単語についてのカテゴリの分類に利用されるカテゴリ分類モデルを学習した。これにより、情報処理装置１００は、分散表現で表された単語のカテゴリの分類を可能とすることで、単語のカテゴリ分類の精度向上の実現を支援することができる。
また、情報処理装置１００は、カテゴリが分類された単語の中から、カテゴリが不適切な可能性のある単語を決定した。これにより、情報処理装置１００は、単語のカテゴリ分類の精度向上の実現を支援することができる。例えば、類義する単語同士が同じカテゴリであると仮定をおき、単に類義語辞書を用いることで、以下のような処理を行うことが可能である。即ち、類義する２つの単語（例えば、本実施形態における単語（Ｘ）、単語（Ｙ））を抽出し、抽出した単語同士のカテゴリが一致しない場合、これらの単語をカテゴリが不適切な可能性がある単語として決定する処理である。しかし、この方法では、２つの単語が直接的な類義関係を有さないと、これらの単語が、カテゴリが不適切な可能性があるか否かを決定できない。しかし、本実施形態の処理により、情報処理装置１００は、直接的な類義関係を有する２つの単語それぞれと同一の関係性（拡張共起関係）を有する２つの単語について、この２つの単語が直接的な類義関係を有さなくとも、カテゴリが不適切な可能性があるか否かを決定できる。 (effect)
As described above, in the present embodiment, the information processing apparatus 100 has learned the category classification model used for classifying the categories of the words in the distributed representation based on the teacher data including the words whose categories are known. Thereby, the information processing apparatus 100 can support the improvement of the accuracy of the category classification of the word by enabling the classification of the category of the word represented by the distributed expression.
In addition, the information processing apparatus 100 determines a word that may have an inappropriate category from words classified into categories. Thereby, the information processing apparatus 100 can support realization of an improvement in accuracy of word category classification. For example, assuming that the words to be synonymous are in the same category and simply using a synonym dictionary, the following processing can be performed. That is, if two similar words (for example, word (X) and word (Y) in the present embodiment) are extracted and the categories of the extracted words do not match, the categories of these words may be inappropriate. This is a process for determining a unique word. However, with this method, if the two words do not have a direct synonym, these words cannot determine whether the category may be inappropriate. However, with the processing of this embodiment, the information processing apparatus 100 determines that the two words have the same relationship (extended co-occurrence relationship) with each of the two words having a direct synonym relationship. Even if there is no direct synonym, it can be determined whether there is a possibility that the category is inappropriate.

また、情報処理装置１００は、決定したカテゴリが不適切な可能性のある単語のカテゴリを更新し、カテゴリを更新した単語に基づいて、再度、カテゴリ分類モデルを学習した。これにより、情報処理装置１００は、カテゴリ分類モデルによる分類の精度を向上させることができる。
例えば、特定の分野（業界、業務）の専門用語や社内用語等においては、特殊な単語が用いられたり、単語が特殊な意味を持ったりする場合がある。このような場合、例えば、単語が辞書に収録されていなかったり、単語の意味が辞書に記載のある意味と異なったりする。例えば、「ホスト」という単語は、「（客をもてなす）主人」の意味を有し、「人」というカテゴリに属する。しかし、ＩＴ分野における「ホスト」という単語は、「（別のマシンにサービスを提供する）ホストコンピュータ」という意味を有し、「モノ」というカテゴリに属する。
このような場合、一般用語の単語とカテゴリとを含む教師データに基づき学習されたカテゴリ分類モデルでは、適切に単語のカテゴリを分類できない可能性がある。このような場合でも、情報処理装置１００は、カテゴリが不適切な可能性のある単語を決定し、決定した単語のカテゴリを更新し、カテゴリを更新した単語に基づいて、再度、カテゴリ分類モデルを学習することで、カテゴリ分類モデルによる特定の分野に係る単語の分類の精度を向上させることができる。
このように、情報処理装置１００は、カテゴリ分類モデルによる分類精度の向上のため、新たな辞書データ等を用いる必要がなく、新たな辞書データ等を用意するコストを削減できる。また、情報処理装置１００は、単語の意味内容が時間と共に変化し、新たな専門用語が生まれたり、従来と異なる意味で用いられたりすることとなった場合でも、こうした状況変化に適応しながら、カテゴリ分類モデルを追加で学習することができる。 In addition, the information processing apparatus 100 updates the category of a word whose determined category may be inappropriate, and learns the category classification model again based on the word whose category has been updated. Thereby, the information processing apparatus 100 can improve the accuracy of classification based on the category classification model.
For example, a special word may be used or a word may have a special meaning in technical terms or company terms in a specific field (industry or business). In such a case, for example, the word is not recorded in the dictionary, or the meaning of the word is different from the meaning described in the dictionary. For example, the word “host” has the meaning of “master (who has customers)” and belongs to the category “people”. However, the word “host” in the IT field means “host computer (providing service to another machine)” and belongs to the category “thing”.
In such a case, there is a possibility that the category of the word cannot be properly classified by the category classification model learned based on the teacher data including the word and the category of the general term. Even in such a case, the information processing apparatus 100 determines a word that may have an inappropriate category, updates the category of the determined word, and again determines a category classification model based on the updated word. By learning, it is possible to improve the accuracy of classification of words related to a specific field by the category classification model.
In this way, the information processing apparatus 100 does not need to use new dictionary data or the like for improving the classification accuracy by the category classification model, and can reduce the cost of preparing new dictionary data or the like. In addition, even when the meaning content of a word changes with time and a new technical term is born or is used with a meaning different from the conventional one, the information processing apparatus 100 adapts to such a situation change, Additional categorization models can be learned.

（変形例）
本実施形態では、情報処理装置１００は、単体の情報処理装置であるとした。しかし、情報処理装置１００は、ネットワーク（ＬＡＮやインターネット）を介して相互に通信可能に接続された複数の情報処理装置を含むシステムとして構成されることとしてもよい。その場合、情報処理装置１００に含まれる複数の情報処理装置それぞれのＣＰＵが、それぞれの情報処理装置の補助記憶装置に記憶されたプログラムに基づき処理を連携して実行することで、図４の機能及び図５、６、７のフローチャートの処理等が実現される。
本実施形態では、情報処理装置１００は、図５の処理で学習したカテゴリ分類モデルを用いて分類されたコーパスデータ内の単語から、カテゴリが不適切な可能性のある単語を特定することとした。しかし、情報処理装置１００は、予めカテゴリが設定された単語から、カテゴリが不適切な可能性のある単語を特定することとしてもよい。 (Modification)
In the present embodiment, the information processing apparatus 100 is a single information processing apparatus. However, the information processing apparatus 100 may be configured as a system including a plurality of information processing apparatuses connected to be communicable with each other via a network (LAN or the Internet). In that case, the CPU of each of the plurality of information processing apparatuses included in the information processing apparatus 100 executes the processing in cooperation with each other based on the program stored in the auxiliary storage device of each information processing apparatus. And the processing of the flowcharts of FIGS.
In the present embodiment, the information processing apparatus 100 identifies a word that may have an inappropriate category from words in the corpus data classified using the category classification model learned in the process of FIG. . However, the information processing apparatus 100 may identify a word that may have an inappropriate category from words that have a category set in advance.

本実施形態では、抽出部４０３は、Ｓ７０３で、分散表現の意味空間上における単語（Ｙ）が示すベクトルに、単語（Ａ）が示すベクトルを加え、単語（Ｘ）が示すベクトルを引いたベクトルと、類似するベクトルに対応する単語を、単語（Ｂ）を特定することとした。しかし、抽出部４０３は、Ｓ７０３で、分散表現の意味空間上における単語（Ａ）が示すベクトルから単語（Ｘ）が示すベクトルを引いたベクトルと、単語（Ｂ）の候補夫々が示すベクトルから単語（Ｙ）が示すベクトルを引いたベクトルと、の類似度に基づいて、単語（Ｂ）を特定することとしてもよい。単語（Ｂ）の候補が示すベクトルから単語（Ｙ）が示すベクトルを引いたベクトルは、分散表現の意味空間上における第３の単語と候補単語との関係を示す第１の関係ベクトルの一例である。
抽出部４０３は、例えば、分散表現の意味空間上における単語（Ａ）が示すベクトルから単語（Ｘ）が示すベクトルを引いたベクトルと、単語（Ｂ）の候補夫々が示すベクトルから単語（Ｙ）が示すベクトルを引いたベクトルと、の類似度を取得する。そして、抽出部４０３は、取得した類似度が予め定められた閾値以上となる単語（Ｂ）の候補を、単語（Ｂ）として特定してもよい。また、抽出部４０３は、取得した類似度が予め定められた閾値以上となる単語（Ｂ）の候補が複数ある場合、類似度が最大のものを、単語（Ｂ）として特定してもよい。 In this embodiment, the extraction unit 403 adds the vector indicated by the word (A) to the vector indicated by the word (Y) in the semantic space of the distributed representation, and subtracts the vector indicated by the word (X) in S703. The word (B) is specified as a word corresponding to a similar vector. However, in S703, the extraction unit 403 obtains the word from the vector obtained by subtracting the vector indicated by the word (X) from the vector indicated by the word (A) in the semantic space of the distributed expression and the vector indicated by each of the candidates for the word (B). The word (B) may be specified based on the similarity with the vector obtained by subtracting the vector indicated by (Y). The vector obtained by subtracting the vector indicated by the word (Y) from the vector indicated by the candidate for the word (B) is an example of a first relation vector indicating the relationship between the third word and the candidate word in the distributed expression semantic space. is there.
For example, the extraction unit 403 extracts a word (Y) from a vector obtained by subtracting a vector indicated by the word (X) from a vector indicated by the word (A) in the semantic space of the distributed representation and a vector indicated by each of the candidates for the word (B) The degree of similarity with the vector obtained by subtracting the vector indicated by is acquired. Then, the extraction unit 403 may specify a word (B) candidate whose acquired similarity is equal to or greater than a predetermined threshold as the word (B). In addition, when there are a plurality of word (B) candidates whose acquired similarity is equal to or greater than a predetermined threshold, the extraction unit 403 may specify the word (B) that has the maximum similarity.

本実施形態では、情報処理装置１００は、Ｓ７０１〜Ｓ７０４の処理を行うことで、カテゴリが不適切な可能性のある単語を決定した。上位概念、下位概念の単語同士は、同じカテゴリに属すると仮定できる。そこで、情報処理装置１００は、Ｓ７０１〜Ｓ７０４の処理に加えて、以下の処理を行うことで、カテゴリが不適切な可能性のある単語を決定してもよい。
即ち、情報処理装置１００は、ある単語について、意味的分類モデルに基づいて、その単語の上位概念又は下位概念の単語を特定する。そして、情報処理装置１００は、特定した単語と元の単語とのカテゴリを比較し、同じでなければ、これらの単語は、カテゴリが不適切である可能性のある単語として決定してもよい。そして、情報処理装置１００は、カテゴリが不適切である可能性のある単語として決定した単語について、Ｓ７０５と同様の処理でカテゴリを更新し、カテゴリを更新した単語に基づいて、カテゴリ分類モデルを学習する。これにより、情報処理装置１００は、更に、カテゴリ分類モデルによる分類精度を向上させることができる。 In the present embodiment, the information processing apparatus 100 determines a word that may have an inappropriate category by performing the processes of S701 to S704. It can be assumed that the words of the superordinate concept and the subordinate concept belong to the same category. Therefore, the information processing apparatus 100 may determine a word that may have an inappropriate category by performing the following process in addition to the processes of S701 to S704.
That is, the information processing apparatus 100 identifies a word of a higher concept or a lower concept of a word based on a semantic classification model. Then, the information processing apparatus 100 compares categories of the identified word and the original word, and if they are not the same, these words may be determined as words that may have inappropriate categories. Then, the information processing apparatus 100 updates the category in the same process as in S705 for the word determined as a word that may have an inappropriate category, and learns the category classification model based on the updated word. To do. Thereby, the information processing apparatus 100 can further improve the classification accuracy based on the category classification model.

以上、本発明の好ましい実施形態について詳述したが、本発明は係る特定の実施形態に限定されるものではない。
例えば、上述した情報処理装置１００の機能構成の一部又は全てをハードウェアとして情報処理装置１００に実装してもよい。 As mentioned above, although preferable embodiment of this invention was explained in full detail, this invention is not limited to the specific embodiment which concerns.
For example, part or all of the functional configuration of the information processing apparatus 100 described above may be implemented in the information processing apparatus 100 as hardware.

１００情報処理装置
１０１ＣＰＵ
４０２学習部 100 Information processing apparatus 101 CPU
402 Learning Department

Claims

Extraction means for extracting words from a plurality of sentences;
A determining unit for determining a category of a word registered in the category dictionary among the words extracted by the extracting unit based on a category dictionary including information on a category corresponding to the word;
Obtaining means for obtaining a distributed representation of a word whose category is determined by the determining means based on the plurality of sentences;
Learning means for learning a classification model for classifying a category of words represented by the distributed expression based on the distributed expression acquired by the acquiring means and the category determined by the determining means;
An information processing apparatus.

A classification means for classifying a category of words included in a clause related to an input sentence predicate;
Using a semantic frame dictionary, assign meaning to the previous description word based on the previous description word, the phrase, and the category of the word classified by the classification means, and assign a semantic role to the phrase Granting means;
The information processing apparatus according to claim 1, further comprising:

First specifying means for specifying a second word having a relationship with the first word;
The first word in the semantic space has the relationship with the third word similar to the first word and the relationship with the third word in the semantic space of the distributed expression And second specifying means for specifying a fourth word similar to the relationship between the first word and the second word specified by the first specifying means;
Based on whether the category of the second word specified by the first specifying means is different from the category of the fourth word specified by the second specifying means, the second word and Determining means for determining whether the category may be inappropriate for the fourth word;
When it is determined by the determining means that there is a possibility that the category is inappropriate for the second word and the fourth word, one or both of the second word and the fourth word Updating means for updating the category,
Further comprising
The information processing apparatus according to claim 1, wherein the learning unit learns the classification model again based on the word whose category has been updated by the updating unit.

The semantic space is a space in which each coordinate on the semantic space indicates a vector indicated by a distributed representation of a word,
The second specifying means uses a word having the relationship with the third word as a candidate word, and a first relationship vector indicating a relationship between the third word and the candidate word in the semantic space is The candidate word is specified as the fourth word when similar to a second relation vector indicating a relation between the first word and the second word in the semantic space. Information processing device.

The second specifying means may be configured such that the first relation vector, which is a vector indicating a difference between a vector indicated by the third word on the semantic space and a vector indicated by the candidate word, is displayed on the semantic space. The candidate word is specified as the fourth word when similar to the second relation vector, which is a vector indicating a difference between a vector indicated by the first word and a vector indicated by the second word. 4. The information processing apparatus according to 4.

The semantic space is a space in which each coordinate on the semantic space indicates a vector indicated by a distributed representation of a word,
The second specifying means uses the word having the relationship with the third word as a candidate word, the vector indicated by the candidate word on the semantic space, and the vector indicated by the third word from the first word If the similarity between the vector indicated by the word and the vector added by the vector indicated by the second word specified by the first specifying means is equal to or greater than a predetermined threshold, the candidate word is The information processing apparatus according to claim 3, which is specified as the fourth word.

Output means for outputting information indicating whether the category of the second word specified by the first specifying means is different from the category of the fourth word specified by the second specifying means; The information processing apparatus according to claim 3, further comprising:

When the category of the second word specified by the first specifying means is different from the category of the fourth word specified by the second specifying means, the second word and the fourth word And further comprising an accepting means for accepting designation of one or both categories of words,
When the category of the second word is different from the category of the fourth word, the updating unit determines one of the second word and the fourth word based on the designation received by the receiving unit. The information processing apparatus according to claim 3, wherein the category is updated for both.

The updating means is different from the fourth word category specified by the second specifying means in that the category of the second word specified by the first specifying means differs from the second word and the category The category for the other of the second word and the fourth word is updated to the one category when a category is determined for one of the fourth words. 8. The information processing apparatus according to any one of 7 to 7.

First specifying means for specifying a second word having a relationship with the first word;
The first word in the semantic space has the relationship with the third word similar to the first word and the relationship with the third word in the semantic space of the distributed expression And second specifying means for specifying a fourth word similar to the relationship between the first word and the second word specified by the first specifying means;
Based on whether the category of the second word specified by the first specifying means is different from the category of the fourth word specified by the second specifying means, the second word and Determining means for determining whether the category may be inappropriate for the fourth word;
An information processing apparatus.

An information processing method executed by an information processing apparatus,
An extraction step of extracting words from a plurality of sentences;
A determination step for determining a category of a word registered in the category dictionary among the words extracted in the extraction step based on a category dictionary including category information corresponding to the word;
An acquisition step of acquiring a distributed representation of the word whose category is determined in the determination step based on the plurality of sentences;
A learning step of learning a classification model for classifying a category of words represented by a distributed expression based on the distributed expression acquired by the acquiring step and the category determined by the determining step;
An information processing method including:

An information processing method executed by an information processing apparatus,
A first specifying step of specifying a second word having a relationship with the first word;
The first word in the semantic space has the relationship with the third word similar to the first word and the relationship with the third word in the semantic space of the distributed expression And a second specifying step of specifying a fourth word similar to the relationship between the first word and the second word specified in the first specifying step;
Based on whether the category of the second word specified in the first specifying step is different from the category of the fourth word specified in the second specifying step, the second word and A determining step for determining whether the category may be inappropriate for the fourth word;
An information processing method including:

A program for causing a computer to function as each unit of the information processing apparatus according to any one of claims 1 to 9.

The program for functioning a computer as each means of the information processing apparatus of Claim 10.