JP2020042330A

JP2020042330A - Information processing apparatus, data classification method and program

Info

Publication number: JP2020042330A
Application number: JP2018166803A
Authority: JP
Inventors: 晋太郎川村; Shintaro Kawamura; 篠宮　聖彦; Masahiko Shinomiya; 聖彦篠宮; 嘉偉勇; Kai Yu; 金崎　克己; Katsumi Kanezaki; 克己金崎; 昭一内藤; Shoichi Naito
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2018-09-06
Filing date: 2018-09-06
Publication date: 2020-03-19
Anticipated expiration: 2038-09-06
Also published as: JP7087851B2

Abstract

【課題】分類対象のデータに対するラベリングの負荷を低減させるとともに、自然言語処理に用いるデータの分類精度を向上させることを目的とする。【解決手段】データベースサーバ３０は、特定のカテゴリに対して、テキストデータ２００のうち、特定のカテゴリの正例または負例のいずれの属性であるかを示す属性情報がラベリングされたサンプルデータ２１０の特徴量を抽出し、抽出した特徴量を用いた教師なし学習に基づいて第１の学習モデルを生成する。また、データベースサーバ３０は、第１の学習モデルに含まれるクラスタの属性、および当該クラスタに属するサンプルデータ２１０にラベリングされた属性情報に基づいて、テキストデータ２００の分類を行うための制約を設定し、設定した制約を用いた半教師あり学習に基づいて、第２の学習モデルを生成する。【選択図】図１６PROBLEM TO BE SOLVED: To reduce the labeling load on data to be classified and to improve the classification accuracy of data used for natural language processing. SOLUTION: A database server 30 is a sample data 210 in which attribute information indicating whether the attribute is a positive example or a negative example of the specific category in the text data 200 is labeled for a specific category. The feature amount is extracted, and the first learning model is generated based on the unsupervised learning using the extracted feature amount. Further, the database server 30 sets a constraint for classifying the text data 200 based on the attributes of the cluster included in the first learning model and the attribute information labeled on the sample data 210 belonging to the cluster. , A second learning model is generated based on semi-supervised learning using the set constraints. [Selection diagram] FIG.

Description

本発明は、情報処理装置、データ分類方法およびプログラムに関する。 The present invention relates to an information processing device, a data classification method, and a program.

機械翻訳、情報検索または質問応答等の場面において、機械学習をベースにした自然言語処理を活用する研究が盛んに行われている。この機械学習を活用した分野においては、その学習精度を高めるために、学習データとして用いる十分なデータセットが必要となる。 2. Description of the Related Art In a case of machine translation, information retrieval, question answering, and the like, studies utilizing natural language processing based on machine learning have been actively conducted. In the field utilizing machine learning, a sufficient data set to be used as learning data is required to increase the learning accuracy.

このようなデータセットに関して、学習データを属性ごとに分類する機械学習のアルゴリズムがある。特許文献１には、自然言語処理を用いる分類器のトレーニング方法が開示されている。機械学習のアルゴリズムは、パターンマッチング、教師なし学習、半教師あり学習または教師あり学習等の手法が知られている。 For such a data set, there is a machine learning algorithm for classifying learning data for each attribute. Patent Document 1 discloses a training method for a classifier using natural language processing. As a machine learning algorithm, methods such as pattern matching, unsupervised learning, semi-supervised learning or supervised learning are known.

しかし、従来の方法では、分類対象となる大量のデータに対して、各データの属性を手作業でラベリングする必要があり、ラベリングに対する負荷が大きい。一方で、ラベリングを必要としない教師なし学習等の手法を用いた場合には、分類精度が低下してしまう。そのため、分類対象のデータに対するラベリングの負荷を低減させるとともに、自然言語処理に用いるデータの分類精度を向上させたいという課題があった。 However, in the conventional method, it is necessary to manually label the attribute of each data with respect to a large amount of data to be classified, and the load on the labeling is large. On the other hand, when a method such as unsupervised learning that does not require labeling is used, the classification accuracy is reduced. Therefore, there is a problem that it is desired to reduce the labeling load on the data to be classified and to improve the classification accuracy of the data used for natural language processing.

請求項１に係る情報処理装置は、特定のカテゴリに対して、自然言語処理に用いるテキストデータの分類を行う情報処理装置であって、前記テキストデータのうち、前記カテゴリの正例または負例のいずれの属性であるかを示す正負ラベルがラベリングされたサンプルデータの特徴量を抽出する特徴量抽出手段と、前記抽出された特徴量を用いた教師なし学習に基づいて、第１の学習モデルを生成する第１の生成手段と、前記サンプルデータにラベリングされた正負ラベルに基づいて、前記生成された第１の学習モデルに含まれるクラスタが、前記正例または前記負例のいずれの属性を有する集合であるかを特定するクラスタ属性特定手段と、前記特定されたクラスタの属性、および当該クラスタに属するサンプルデータにラベリングされた正負ラベルに基づいて、前記分類を行うための制約を設定する制約設定手段と、前記設定された制約を用いた半教師あり学習に基づいて、第２の学習モデルを生成する第２の生成手段と、前記生成された第２の学習モデルに含まれるクラスタに対して、前記テキストデータのうち、前記正負ラベルがラベリングされていない未知データを分類する分類手段と、を備える。 The information processing device according to claim 1, wherein the information processing device classifies text data used for natural language processing with respect to a specific category, wherein the text data includes a positive example or a negative example of the category. A first learning model based on unsupervised learning using the extracted feature amount, and a feature amount extraction unit that extracts a feature amount of the sample data labeled with a positive / negative label indicating which attribute it is; A cluster included in the generated first learning model has the attribute of either the positive example or the negative example based on first generating means for generating, and a positive / negative label labeled on the sample data. Cluster attribute specifying means for specifying whether the set is a set, attributes of the specified cluster, and sample data belonging to the cluster Constraint setting means for setting a constraint for performing the classification based on the negative label; and second generating means for generating a second learning model based on semi-supervised learning using the set constraint. And classifying means for classifying unknown data in which the positive / negative labels are not labeled in the text data with respect to clusters included in the generated second learning model.

本発明によれば、分類対象のデータに対するラベリングの負荷を低減させるとともに、自然言語処理に用いるデータの分類精度を向上させることができる。 According to the present invention, it is possible to reduce the load of labeling on data to be classified and improve the classification accuracy of data used for natural language processing.

実施形態に係る会議システムのシステム構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a system configuration of a conference system according to an embodiment. 実施形態に係るコンピュータのハードウエア構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a hardware configuration of a computer according to the embodiment. 実施形態に係るデータベースサーバの機能構成の一例を示す図である。FIG. 2 is a diagram illustrating an example of a functional configuration of a database server according to the embodiment. 実施形態に係るカテゴリ管理テーブルの一例を示す図である。FIG. 4 is a diagram illustrating an example of a category management table according to the embodiment. 実施形態に係るテキストデータの一例を示す図である。FIG. 3 is a diagram illustrating an example of text data according to the embodiment. 実施形態に係る特徴量抽出部の機能構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a functional configuration of a feature amount extraction unit according to the embodiment. 実施形態に係る制約設定部の機能構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a functional configuration of a constraint setting unit according to the embodiment. 実施形態に係るデータベースサーバにおけるデータ分類処理の一例を示すフローチャートである。9 is a flowchart illustrating an example of a data classification process in the database server according to the embodiment. 実施形態に係るデータベースサーバにおける特徴量抽出処理の一例を示すフローチャートである。9 is a flowchart illustrating an example of a feature amount extraction process in the database server according to the embodiment. 実施形態に係るデータベースサーバにおけるカテゴリ分類処理の一例を示すフローチャートである。9 is a flowchart illustrating an example of a category classification process in the database server according to the embodiment. 教師なし学習によって生成された第１の学習モデルの一例を説明するための概念図である。It is a conceptual diagram for explaining an example of the 1st learning model generated by unsupervised learning. 第１の学習モデルに含まれる属性が特定されたクラスタの一例を説明するための概念図である。FIG. 7 is a conceptual diagram for describing an example of a cluster in which an attribute included in a first learning model is specified. 不正解ベクトルＮ_ｍｉｓについて説明するための概念図である。It is a conceptual diagram for _{demonstrating} incorrect _solution vector _Nmis . 不正解ベクトルＮ_ｍｉｓによって示される領域に属するサンプルデータに対して生成されたデータリンクの一例を説明するための概念図である。FIG. 7 is a conceptual diagram for explaining an example of a data link generated for sample data belonging to an area indicated by an incorrect answer vector N _mis . 実施形態に係る半教師あり学習によって生成された第２の学習モデルの一例を説明するための概念図である。It is a conceptual diagram for explaining an example of the 2nd learning model generated by semi-supervised learning concerning an embodiment. 実施形態に係る第２の学習モデルに含まれるクラスタに対して分類された未知データの一例を説明するための概念図である。It is a conceptual diagram for explaining an example of unknown data classified into a cluster contained in the 2nd learning model concerning an embodiment.

以下、図面を参照しながら、発明を実施するための形態を説明する。なお、図面の説明において同一要素には同一符号を付し、重複する説明は省略する。 Hereinafter, embodiments for carrying out the invention will be described with reference to the drawings. In the description of the drawings, the same elements will be denoted by the same reference symbols, without redundant description.

●システム構成●
図１は、実施形態に係るデータベースサーバが適用されるシステムの一例を示す図である。図１に示す会議システム１は、本実施形態に係るデータベースサーバ３０による機械学習によって生成された学習モデルを、通信端末７０を利用した会議に利用する場合の例である。会議システム１は、例えば、通信端末７０によって集音された音声データに対して、データベースサーバ３０によって生成された学習モデルを用いた自然言語処理を行うことができるシステムである。 ● System configuration ●
FIG. 1 is a diagram illustrating an example of a system to which the database server according to the embodiment is applied. The conference system 1 illustrated in FIG. 1 is an example of a case where a learning model generated by machine learning by the database server 30 according to the present embodiment is used for a conference using the communication terminal 70. The conference system 1 is, for example, a system that can perform natural language processing on the audio data collected by the communication terminal 70 using the learning model generated by the database server 30.

会議システム１は、管理サーバ１０、データベースサーバ３０、ＷＥＢサーバ５０および通信端末７０によって構成されている。会議システム１を構成する各装置は、通信ネットワーク５を介してそれぞれ接続されている。通信ネットワーク５は、例えば、ＬＡＮ(Local Area Network)、専用線およびインターネット等によって構築される。通信ネットワーク５は、有線だけでなく、Ｗｉ−Ｆｉ（Wireless Fidelity）や、Ｂｌｕｅｔｏｏｔｈ（登録商標）等の無線による通信が行われる箇所があってもよい。 The conference system 1 includes a management server 10, a database server 30, a web server 50, and a communication terminal 70. Each device constituting the conference system 1 is connected via a communication network 5. The communication network 5 is constructed by, for example, a LAN (Local Area Network), a dedicated line, and the Internet. The communication network 5 may include not only wired communication but also wireless communication such as Wi-Fi (Wireless Fidelity) and Bluetooth (registered trademark).

管理サーバ１０、データベースサーバ３０およびＷＥＢサーバ５０は、管理システム２を構成する。管理システム２は、通信端末７０からによって集音された音声データ等の発話録データに対して、自然言語処理を行うシステムである。管理サーバ１０は、通信ネットワーク５を介して、通信端末７０に対して、各種機能を実現するためのアプリケーション等を提供するサーバコンピュータである。 The management server 10, the database server 30, and the web server 50 constitute the management system 2. The management system 2 is a system that performs natural language processing on speech record data such as voice data collected by the communication terminal 70. The management server 10 is a server computer that provides an application and the like for implementing various functions to the communication terminal 70 via the communication network 5.

データベースサーバ３０は、自然言語処理に用いる複数のテキストデータ（データセット）を記憶するサーバコンピュータである。また、データベースサーバ３０は、データセットを機械学習により特定のカテゴリの属性ごとに分類する分類器としての機能を有する。本実施形態において、会議システム１は、データベースサーバ３０において会話要素の有無によって分類された学習モデルを用いて、例えば、通信端末７０によって生成されたデータに対する自然言語処理を行う。 The database server 30 is a server computer that stores a plurality of text data (data sets) used for natural language processing. In addition, the database server 30 has a function as a classifier that classifies the data set into attributes of a specific category by machine learning. In the present embodiment, the conference system 1 performs, for example, natural language processing on data generated by the communication terminal 70 using a learning model classified by the presence or absence of a conversation element in the database server 30.

ＷＥＢサーバ５０は、データベースサーバ３０または通信端末７０に対して、ＷＥＢサービス（ＨＴＴＰ：Hypertext Transfer Protocol通信）による中継機能を提供するサーバ装置である。ＷＥＢサーバ５０は、ＷＥＢサービスを介して、データベースサーバ３０へ自然言語処理に用いるテキストデータ２００を送信する。なお、ＷＥＢサーバ５０の機能は、データベースサーバ３０および通信端末７０に備えられていてもよい。 The WEB server 50 is a server device that provides the database server 30 or the communication terminal 70 with a relay function by a WEB service (HTTP: Hypertext Transfer Protocol communication). The WEB server 50 transmits text data 200 used for natural language processing to the database server 30 via the WEB service. The function of the web server 50 may be provided in the database server 30 and the communication terminal 70.

通信端末７０は、会議システム１の利用者が使用するノートＰＣ（Personal Computer）等の端末装置である。会議システム１の利用者は、通信端末７０にインストールされた会議アプリ等の特定のアプリケーションを用いて会議を開催する。通信端末７０は、会議中に行われた利用者の発言等を集音した音声データを管理システム２へ送信する。そして、通信端末７０は、管理システム２によって自然言語処理された議事録等の変換データを受信することによって、自動的に会議の議事録等を作成することができる。なお、通信端末７０は、通信ネットワーク５に接続可能な通信機能を備えていればノートＰＣに限られない。通信端末７０は、ディスクトップＰＣ、タブレット端末、スマートフォン、電子黒板、カーナビゲーション装置またはマイク等の集音装置であってもよい。また、図１は、通信端末７０が一つである場合の例を説明したが、通信端末７０の数はこれに限られず、会議システム１は、複数の通信端末７０を有してもよい。 The communication terminal 70 is a terminal device such as a notebook PC (Personal Computer) used by a user of the conference system 1. The user of the conference system 1 holds a conference using a specific application such as a conference application installed on the communication terminal 70. The communication terminal 70 transmits, to the management system 2, audio data obtained by collecting a user's speech or the like made during the conference. The communication terminal 70 can automatically create the minutes of the meeting by receiving the converted data such as the minutes processed by the management system 2 in the natural language. Note that the communication terminal 70 is not limited to a notebook PC as long as it has a communication function connectable to the communication network 5. The communication terminal 70 may be a sound collecting device such as a desktop PC, a tablet terminal, a smartphone, an electronic blackboard, a car navigation device, or a microphone. FIG. 1 illustrates an example in which the number of the communication terminals 70 is one, but the number of the communication terminals 70 is not limited to this, and the conference system 1 may include a plurality of the communication terminals 70.

なお、図１は、会議システムの例を説明したが、図１に示したシステムの用途は、会議に限られず、自然言語処理を必要とする所定のイベントであってもよい。例えば、会議システム１は、会合、集い、寄り合い、相談、打ち合わせ等の音声データに対するテキスト変換を利用するイベントに適用されてもよい。また、会議システム１は、通信端末７０を用いた情報検索等のイベントに適用されてもよい。さらに、管理サーバ１０およびデータベースサーバ３０の機能は、一つのサーバによって実現される構成であってもよいし、データベースサーバ３０の機能は、複数のサーバによって実現される構成であってもよい。 Although FIG. 1 illustrates an example of the conference system, the application of the system illustrated in FIG. 1 is not limited to a conference, and may be a predetermined event that requires natural language processing. For example, the conference system 1 may be applied to an event that uses text conversion for audio data, such as a meeting, a gathering, a meeting, a consultation, and a meeting. The conference system 1 may be applied to an event such as an information search using the communication terminal 70. Further, the functions of the management server 10 and the database server 30 may be realized by one server, or the functions of the database server 30 may be realized by a plurality of servers.

●ハードウエア構成●
続いて、実施形態に係る各装置のハードウエア構成について説明する。図１に示した会議システム１を構成する各装置は、一般的なコンピュータの構成を有する。ここでは、一般的なコンピュータのハードウエア構成例について説明する。 ● Hardware configuration ●
Subsequently, a hardware configuration of each device according to the embodiment will be described. Each device configuring the conference system 1 illustrated in FIG. 1 has a configuration of a general computer. Here, a hardware configuration example of a general computer will be described.

図２は、実施形態に係るコンピュータのハードウエア構成の一例を示す図である。コンピュータ１００は、ＣＰＵ（Central Processing Unit）１０１、ＲＯＭ（Read Only Memory）１０２、ＲＡＭ（Random Access Memory）１０３、ストレージ１０４、入出力インターフェース（I/F）１０５、ネットワークインターフェース（I/F）１０６およびバスライン１０７を有する。 FIG. 2 is a diagram illustrating an example of a hardware configuration of a computer according to the embodiment. The computer 100 includes a CPU (Central Processing Unit) 101, a ROM (Read Only Memory) 102, a RAM (Random Access Memory) 103, a storage 104, an input / output interface (I / F) 105, a network interface (I / F) 106, It has a bus line 107.

ＣＰＵ１０１は、ＲＯＭ１０２やストレージ１０４等に格納された本発明に係るプログラムやデータをＲＡＭ１０３上に読み出し、処理を実行することで、コンピュータ１００の各機能を実現する演算装置である。例えば、データベースサーバ３０は、本発明に係るプログラムが実行されることで本発明に係るデータ分類方法を実現する。 The CPU 101 is an arithmetic unit that realizes each function of the computer 100 by reading out a program or data according to the present invention stored in the ROM 102, the storage 104, or the like onto the RAM 103 and executing a process. For example, the database server 30 implements the data classification method according to the present invention by executing the program according to the present invention.

ＲＯＭ１０２は、電源を切ってもプログラムやデータを保持することができる不揮発性のメモリである。ＲＯＭ１０２は、例えば、フラッシュＲＯＭ等により構成される。ＲＯＭ１０２は、多種の用途に対応したＳＤＫ（Software Development Kit）がインストールされており、ＳＤＫのアプリケーションを用いて、コンピュータ１００の機能やネットワーク接続などを実現することが可能である。 The ROM 102 is a nonvolatile memory that can retain programs and data even when the power is turned off. The ROM 102 is configured by, for example, a flash ROM or the like. The ROM 102 is installed with an SDK (Software Development Kit) corresponding to various uses, and the functions of the computer 100 and the network connection can be realized by using the SDK application.

ＲＡＭ１０３は、ＣＰＵ１０１のワークエリア等として用いられる揮発性のメモリである。ストレージ１０４は、例えば、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）等のストレージデバイスである。ストレージ１０４は、例えば、ＯＳ（Operation System）、アプリケーションプログラム、および各種データ等を記憶する。 The RAM 103 is a volatile memory used as a work area or the like of the CPU 101. The storage 104 is a storage device such as a hard disk drive (HDD) and a solid state drive (SSD). The storage 104 stores, for example, an OS (Operation System), application programs, and various data.

入出力Ｉ／Ｆ１０５は、コンピュータ１００に外部装置を接続するためのインターフェースである。外部装置は、例えば、ＵＳＢ（Universal Serial Bus）メモリ、メモリカード、光学ディスク等の記録媒体１０５ａや、各種の電子機器等が含まれる。 The input / output I / F 105 is an interface for connecting an external device to the computer 100. The external device includes, for example, a recording medium 105a such as a USB (Universal Serial Bus) memory, a memory card, an optical disk, and various electronic devices.

ネットワークＩ／Ｆ１０６は、通信ネットワーク５を介して、データ通信をするためのインターフェースである。ネットワークＩ／Ｆ１０６は、例えば、無線ＬＡＮの通信インターフェースである。また、ネットワークＩ／Ｆ１０６は、有線ＬＡＮ、３Ｇ（3rd Generation）、ＬＴＥ(Long Term Evolution)、４Ｇ（4rd Generation）、５Ｇ（5rd Generation）、ミリ波無線通信の通信インターフェースを備えていてもよい。 The network I / F 106 is an interface for performing data communication via the communication network 5. The network I / F 106 is, for example, a wireless LAN communication interface. Further, the network I / F 106 may include a communication interface for a wired LAN, 3G (3rd Generation), LTE (Long Term Evolution), 4G (4rd Generation), 5G (5rd Generation), or millimeter wave wireless communication.

バスライン１０７は、上記の各構成要素に共通に接続され、アドレス信号、データ信号、および各種制御信号等を伝送する。ＣＰＵ１０１、ＲＯＭ１０２、ＲＡＭ１０３、ストレージ１０４、入出力Ｉ／Ｆ１０５およびネットワークＩ／Ｆ１０６は、バスライン１０７を介して相互に接続されている。 The bus line 107 is commonly connected to the above components, and transmits an address signal, a data signal, various control signals, and the like. The CPU 101, the ROM 102, the RAM 103, the storage 104, the input / output I / F 105, and the network I / F 106 are mutually connected via a bus line 107.

なお、実施形態に係る各装置のハードウエア構成は、必要に応じて構成要素が追加または削除されてもよい。通信端末７０は、図２に示した構成に加えて、マイク等の音声を入力するための集音装置を有する。また、通信端末７０は、例えば、キーボード、マウスおよびタッチパネル等の入力装置、スピーカ、カメラ等の撮像装置、並びにＬＣＤ（Liquid Crystal display）等の表示装置を有していてもよい。 Note that, in the hardware configuration of each device according to the embodiment, constituent elements may be added or deleted as necessary. The communication terminal 70 has a sound collection device for inputting sound such as a microphone in addition to the configuration shown in FIG. Further, the communication terminal 70 may include, for example, an input device such as a keyboard, a mouse, and a touch panel, an imaging device such as a speaker and a camera, and a display device such as an LCD (Liquid Crystal Display).

●機能構成●
続いて、実施形態に係るデータベースサーバ３０の機能構成について説明する。図３は、実施形態に係るデータベースサーバの機能構成の一例を示す図である。データベースサーバ３０によって実現される機能は、送受信部３１、サンプルデータ取得部３２、対象カテゴリ情報生成部３３、特徴量抽出部３４、データ数値化部３５、第１の学習部３６、クラスタ属性特定部３７、制約設定部３８、第２の学習部３９、未知データ分類部４１、記憶・読出部４２および記憶部３０００を含む。 ● Functional configuration ●
Subsequently, a functional configuration of the database server 30 according to the embodiment will be described. FIG. 3 is a diagram illustrating an example of a functional configuration of the database server according to the embodiment. The functions realized by the database server 30 include a transmitting / receiving unit 31, a sample data obtaining unit 32, a target category information generating unit 33, a feature amount extracting unit 34, a data digitizing unit 35, a first learning unit 36, a cluster attribute specifying unit. 37, a constraint setting unit 38, a second learning unit 39, an unknown data classifying unit 41, a storage / readout unit 42, and a storage unit 3000.

送受信部３１は、通信ネットワーク５を介して、外部装置と各種データの送受信を行う機能である。送受信部３１は、例えば、ＷＥＢサーバ５０から提供されるＷＥＢサービスを介して、分類対象となるテキストデータ２００を受信する。送受信部３１は、図２に示したネットワークＩ／Ｆ１０６およびＣＰＵ１０１で実行されるプログラム等によって実現される。 The transmission / reception unit 31 has a function of transmitting / receiving various data to / from an external device via the communication network 5. The transmission / reception unit 31 receives the text data 200 to be classified, for example, via a web service provided from the web server 50. The transmission / reception unit 31 is realized by a program or the like executed by the network I / F 106 and the CPU 101 illustrated in FIG.

サンプルデータ取得部３２は、記憶部３０００に記憶されたテキストデータ２００のうち、特定のカテゴリの属性を特定するための属性情報がラベリングされたデータをサンプルデータ２１０として取得する。属性情報は、例えば、特定のカテゴリに対する正例または負例のいずれの属性に属するかを示す正負ラベルである。属性情報は、例えば、カテゴリの種別が「会話」である場合、会話要素の有無を特定するための情報である。ここで、会話とは、発言、質問、応答、対話、発表等が含まれる。なお、属性情報のラベリングは、データベースサーバ３０の利用者またはＷＥＢサービスによりテキストデータ２００のデータセットを提供する提供者等によって行われる。サンプルデータ取得部３２は、図２に示したネットワークＩ／Ｆ１０６およびＣＰＵ１０１で実行されるプログラム等によって実現される。サンプルデータ取得部３２は、取得手段の一例である。 The sample data obtaining unit 32 obtains, as the sample data 210, data in which the attribute information for specifying the attribute of the specific category is labeled from the text data 200 stored in the storage unit 3000. The attribute information is, for example, a positive / negative label indicating whether the attribute belongs to a positive example or a negative example for a specific category. The attribute information is, for example, information for specifying the presence or absence of a conversation element when the category type is “conversation”. Here, the conversation includes a statement, a question, a response, a dialogue, a presentation, and the like. Note that the labeling of the attribute information is performed by a user of the database server 30 or a provider that provides a data set of the text data 200 by a WEB service. The sample data acquisition unit 32 is realized by the network I / F 106 and the program executed by the CPU 101 shown in FIG. The sample data acquisition unit 32 is an example of an acquisition unit.

対象カテゴリ情報生成部３３は、後述するカテゴリ管理テーブル３００に含まれる対象カテゴリ設定情報３１０を生成する機能である。対象カテゴリ設定情報３１０とは、データ分類処理の分類対象となるカテゴリの特徴を特定するための情報である。対象カテゴリ情報生成部３３は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。 The target category information generation unit 33 has a function of generating target category setting information 310 included in a category management table 300 described later. The target category setting information 310 is information for specifying characteristics of a category to be classified in the data classification processing. The target category information generation unit 33 is realized by a program or the like executed by the CPU 101 shown in FIG.

特徴量抽出部３４は、サンプルデータ取得部３２によって取得されたサンプルデータ２１０の特徴量を抽出する機能である。特徴量は、例えば、分類対象のカテゴリにおけるテキストデータ２００に含まれる単語の重要度である。この場合、特徴量抽出部３４は、サンプルデータ２１０に含まれるテキスト情報の中から、単語を抽出する。特徴量抽出部３４による処理の詳細は、後述（図６参照）する。特徴量抽出部３４は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。特徴量抽出部３４は、特徴量抽出手段の一例である。 The feature amount extraction unit 34 has a function of extracting a feature amount of the sample data 210 acquired by the sample data acquisition unit 32. The feature amount is, for example, the importance of a word included in the text data 200 in the category to be classified. In this case, the feature amount extraction unit 34 extracts a word from the text information included in the sample data 210. Details of the processing by the feature amount extraction unit 34 will be described later (see FIG. 6). The feature amount extraction unit 34 is realized by a program or the like executed by the CPU 101 shown in FIG. The feature amount extracting unit 34 is an example of a feature amount extracting unit.

データ数値化部３５は、テキストデータ２００の特徴量の数値化処理を行う機能である。データ数値化部３５は、特徴量抽出部３４によって抽出されたサンプルデータ２１０の特徴量を特徴量ベクトルに変換（数値化）する。なお、サンプルデータ２１０ではないテキストデータ２００に対しても、サンプルデータ２１０に対する処理と同様に特徴量抽出処理を実行する。データ数値化部３５は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。 The data digitizing unit 35 has a function of performing a digitizing process of the feature amount of the text data 200. The data digitizing unit 35 converts (digitizes) the feature amount of the sample data 210 extracted by the feature amount extracting unit 34 into a feature amount vector. Note that the feature amount extraction process is performed on the text data 200 other than the sample data 210 in the same manner as the process on the sample data 210. The data digitizing unit 35 is realized by a program or the like executed by the CPU 101 shown in FIG.

第１の学習部３６は、特徴量抽出部３４によって抽出されたサンプルデータ２１０の特徴量を用いた教師なし学習に基づいて、第１の学習モデルを生成する機能である。教師なし学習（unsupervised learning）とは、所定のデータをラベルリング等の外的基準なしに分類する手法である。教師なし学習は、例えば、Ｋ−ｍｅａｎｓクラスタリング等の手法である。第１の学習部３６は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。第１の学習部３６は、第１の生成手段の一例である。 The first learning unit 36 has a function of generating a first learning model based on unsupervised learning using the feature amount of the sample data 210 extracted by the feature amount extraction unit 34. Unsupervised learning is a method of classifying predetermined data without external criteria such as labeling. Unsupervised learning is, for example, a technique such as K-means clustering. The first learning unit 36 is realized by a program or the like executed by the CPU 101 shown in FIG. The first learning unit 36 is an example of a first generation unit.

クラスタ属性特定部３７は、第１の学習部３６によって学習された第１の学習モデルに含まれる各クラスタの属性を特定するための機能である。クラスタ属性特定部３７は、例えば、第１の学習モデルに含まれるクラスタが、分類対象のカテゴリの正例または負例のいずれの属性を有する集合であるかを特定する。クラスタ属性特定部３７は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。クラスタ属性特定部３７は、クラスタ属性特定手段の一例である。 The cluster attribute specifying unit 37 is a function for specifying an attribute of each cluster included in the first learning model learned by the first learning unit 36. For example, the cluster attribute specifying unit 37 specifies whether the cluster included in the first learning model is a set having a positive example or a negative example of the category to be classified. The cluster attribute specifying unit 37 is realized by a program or the like executed by the CPU 101 shown in FIG. The cluster attribute specifying unit 37 is an example of a cluster attribute specifying unit.

制約設定部３８は、データ分類処理に用いる制約設定を行う機能である。制約設定部３８は、第１の学習部３６によって生成された第１の学習モデルと、第１の学習モデルに含まれるクラスタに属するサンプルデータ２１０にラベリングされた属性情報とに基づいて、自然言語処理に用いるテキストデータ２００の分類を行うための制約を設定する。制約設定部３８の具体的の処理については、後述（図７参照）する。制約設定部３８は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。制約設定部３８は、制約設定手段の一例である。 The constraint setting unit 38 has a function of setting a constraint used for data classification processing. The constraint setting unit 38 uses a natural language based on the first learning model generated by the first learning unit 36 and the attribute information labeled on the sample data 210 belonging to the cluster included in the first learning model. A constraint for classifying the text data 200 used for processing is set. Specific processing of the constraint setting unit 38 will be described later (see FIG. 7). The constraint setting unit 38 is realized by a program or the like executed by the CPU 101 shown in FIG. The constraint setting unit 38 is an example of a constraint setting unit.

第２の学習部３９は、制約設定部３８によって設定された制約を用いた半教師あり学習に基づいて、第２の学習モデルを生成する機能である。半教師あり学習（semi-supervised learning）とは、ラベリングされたデータとラベリングされていないデータの両方を用いてデータ分類を行う手法である。半教師あり学習は、例えば、ＣＯＰＫ−ｍｅａｎｓクラスタリング等の手法である。第２の学習部３９は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。第２の学習部３９は、第２の生成手段の一例である。 The second learning unit 39 has a function of generating a second learning model based on semi-supervised learning using the constraint set by the constraint setting unit 38. Semi-supervised learning is a method of classifying data using both labeled and unlabeled data. Semi-supervised learning is, for example, a technique such as COP K-means clustering. The second learning section 39 is realized by a program or the like executed by the CPU 101 shown in FIG. The second learning unit 39 is an example of a second generation unit.

未知データ分類部４１は、第２の学習部３９によって生成された第２の学習モデルに含まれるクラスタに対して、未知データを分類する機能である。未知データとは、テキストデータ２００のうち、属性情報がラベリングされていないデータである。未知データ分類部４１は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。未知データ分類部４１は、分類手段の一例である。 The unknown data classification unit 41 has a function of classifying unknown data into clusters included in the second learning model generated by the second learning unit 39. The unknown data is data of which the attribute information is not labeled in the text data 200. The unknown data classification unit 41 is realized by a program or the like executed by the CPU 101 shown in FIG. The unknown data classification unit 41 is an example of a classification unit.

記憶・読出部４２は、記憶部３０００に各種データを記憶し、記憶部３０００から各種データを読み出す機能である。記憶・読出部４２は、図２に示したＣＰＵ１０１で実行されるプログラム等により実現される。記憶部３０００は、図２に示したＲＯＭ１０２またはストレージ１０４により実現される。また、記憶部３０００は、カテゴリ管理テーブル３００および複数のテキストデータ２００を記憶している。 The storage / readout unit 42 has a function of storing various data in the storage unit 3000 and reading various data from the storage unit 3000. The storage / readout unit 42 is realized by a program or the like executed by the CPU 101 shown in FIG. The storage unit 3000 is realized by the ROM 102 or the storage 104 shown in FIG. The storage unit 3000 stores a category management table 300 and a plurality of text data 200.

●カテゴリ管理テーブル
ここで、記憶部３０００に記憶されているデータの詳細について説明する。図４は、実施形態に係るカテゴリ管理テーブルの一例を示す図である。図４に示すカテゴリ管理テーブル３００は、自然言語処理による分類対象となるカテゴリごとに、当該カテゴリを特定するための設定情報を管理するテーブルである。 ● Category management table Here, details of the data stored in the storage unit 3000 will be described. FIG. 4 is a diagram illustrating an example of the category management table according to the embodiment. The category management table 300 illustrated in FIG. 4 is a table that manages, for each category to be classified by natural language processing, setting information for specifying the category.

カテゴリ管理テーブル３００は、分類対象とするカテゴリを識別するためのカテゴリ識別番号、カテゴリ名、および分類対象とするカテゴリを特徴付けるための情報である対象カテゴリ設定情報３１０を関連付けて記憶して管理している。 The category management table 300 associates and manages a category identification number for identifying a category to be classified, a category name, and target category setting information 310 which is information for characterizing the category to be classified. I have.

図５に示すカテゴリ管理テーブル３００において、カテゴリ識別番号「１」およびカテゴリ名「会話」に関連付けられた対象カテゴリ設定情報３１０は、「Q:」,「A:」,「C:」,「?」,「⇒」,「→」,「<"人名">」,「［"人名"］」,「（"人名"）」等である。例えば、対象カテゴリ設定情報３１０は、全角または半角文字の直後の「?」,「⇒」,「→」,「:」や、全角または半角文字の直前の「?」,「Q」,「A」,「C」,「<"人名">」,「［"人名"］」,「（"人名"）」等のパターンである会話を特徴付けるための情報を含む。なお、対象カテゴリ設定情報３１０は、対象カテゴリ情報生成部３３の処理によって適宜追加・変更可能である。 In the category management table 300 shown in FIG. 5, the target category setting information 310 associated with the category identification number “1” and the category name “conversation” includes “Q:”, “A:”, “C:”, “? , "⇒", "→", "<" person name ">", "[" person name "]", "(" person name ")", and the like. For example, the target category setting information 310 includes “?”, “⇒”, “→”, “:” immediately after a full-width or half-width character, and “?”, “Q”, “A” immediately before a full-width or half-width character. , "C", "<" person name ">", "[" person name "]", "(" person name ")" and other information for characterizing conversations. The target category setting information 310 can be appropriately added or changed by the processing of the target category information generating unit 33.

●テキストデータ
続いて、記憶部３０００に記憶されるテキストデータ２００の内容について説明する。図５は、実施形態に係るテキストデータの一例を示す図である。図５に示すテキストデータ２００は、テキスト情報が含まれるデータであり、本実施形態に係るデータ分類方法において分類対象となるデータである。テキストデータ２００は、例えば、ＷＥＢサーバ５０から提供されるＷＥＢサービスを介して、ＷＥＢページを構成するＨＴＭＬ（HyperText Markup Language）形式で取得される。 Next, the contents of the text data 200 stored in the storage unit 3000 will be described. FIG. 5 is a diagram illustrating an example of text data according to the embodiment. The text data 200 illustrated in FIG. 5 is data including text information, and is data to be classified in the data classification method according to the present embodiment. The text data 200 is acquired, for example, via a WEB service provided by the WEB server 50 in an HTML (HyperText Markup Language) format that forms a WEB page.

図５に示すテキストデータ２００は、質問応答形式で記述されたテキスト情報を含む。図５に示すテキストデータ２００は、カテゴリ「会話」に関連付けられた対象カテゴリ設定情報３１０を含むため（図４参照）、「会話有」の属性を有するデータ（正例）となる。一方で、図５に示すテキストデータ２００とは異なり、カテゴリ「会話」に関連付けられた対象カテゴリ設定情報３１０を含まないテキストデータ２００は、「会話無」の属性を有するデータ（負例）となる。 The text data 200 shown in FIG. 5 includes text information described in a question-response format. Since the text data 200 shown in FIG. 5 includes the target category setting information 310 associated with the category “conversation” (see FIG. 4), it becomes data having a “conversation present” attribute (a positive example). On the other hand, unlike the text data 200 shown in FIG. 5, the text data 200 that does not include the target category setting information 310 associated with the category “conversation” is data (negative example) having an attribute of “no conversation”. .

テキストデータ２００は、例えば、会話要素が含まれる可能性のあるデータセットとして収集されるデータである。具体的には、企業や公的機関の会議録、ＳＮＳ(Social Networking Service)、商品レビュー、テレビの字幕、小説等のデータに会話要素が含まれている可能性が高い。特に、ＷＥＢ上に公開されているデータであれば、クローリングやウェブスクレイピング等によって自動的にデータセットを充足させることでき、分類精度を高めるために十分なデータを収集することができる。 The text data 200 is, for example, data collected as a data set that may include a conversation element. Specifically, it is highly likely that conversational elements are included in data such as the minutes of a company or public institution, SNS (Social Networking Service), product reviews, TV subtitles, and novels. In particular, if the data is published on the Web, the data set can be automatically filled by crawling, web scraping, or the like, and sufficient data can be collected to increase the classification accuracy.

●特徴量抽出部
続いて、図６を用いて、特徴量抽出部３４の詳細な機能構成について説明する。図６は、実施形態に係る特徴量抽出部の機能構成の一例を示す図である。図６に示す特徴量抽出部３４は、対象カテゴリ情報抽出部３４１、形態素解析部３４２および特徴量決定部３４３を含む。 Next, the detailed functional configuration of the feature amount extraction unit 34 will be described with reference to FIG. FIG. 6 is a diagram illustrating an example of a functional configuration of a feature amount extraction unit according to the embodiment. The feature amount extraction unit 34 illustrated in FIG. 6 includes a target category information extraction unit 341, a morphological analysis unit 342, and a feature amount determination unit 343.

対象カテゴリ情報抽出部３４１は、サンプルデータ取得部３２によって取得されたサンプルデータ２１０に含まれるテキスト情報の中から、対象カテゴリ情報を抽出する機能である。対象カテゴリ情報抽出部３４１は、サンプルデータ２１０に含まれるテキスト情報の中から、カテゴリ管理テーブル３００に含まれる対象カテゴリ設定情報３１０と同じテキストを、対象カテゴリ情報として抽出する。対象カテゴリ情報抽出部３４１は、カテゴリ情報抽出手段の一例である。 The target category information extracting unit 341 has a function of extracting target category information from text information included in the sample data 210 acquired by the sample data acquiring unit 32. The target category information extracting unit 341 extracts the same text as the target category setting information 310 included in the category management table 300 from the text information included in the sample data 210 as target category information. The target category information extracting unit 341 is an example of a category information extracting unit.

形態素解析部３４２は、サンプルデータ取得部３２によって取得されたサンプルデータ２１０に含まれるテキスト情報に対する形態素解析処理を行う機能である。形態素解析部３４２は、サンプルデータ２１０に含まれるテキスト情報のうち、名詞、動詞および形容詞等の品詞を有するものを単語の特徴量として取得する。形態素解析部３４２は、形態素解析手段の一例である。 The morphological analysis unit 342 has a function of performing a morphological analysis process on text information included in the sample data 210 acquired by the sample data acquisition unit 32. The morphological analysis unit 342 obtains, from the text information included in the sample data 210, information having a part of speech such as a noun, a verb, or an adjective as a feature amount of a word. The morphological analysis unit 342 is an example of a morphological analysis unit.

特徴量決定部３４３は、対象カテゴリ情報抽出部３４１によって抽出された対象カテゴリ情報、および形態素解析部３４２による解析結果に基づいて、サンプルデータ２１０の特徴量を決定する機能である。特徴量決定部３４３は、特徴量決定手段の一例である。 The feature amount determination unit 343 has a function of determining the feature amount of the sample data 210 based on the target category information extracted by the target category information extraction unit 341 and the analysis result by the morphological analysis unit 342. The feature amount determining unit 343 is an example of a feature amount determining unit.

●制約設定部
続いて、図７を用いて、制約設定部３８の詳細な機能構成について説明する。図７は、実施形態に係る制約設定部の機能構成の一例を示す図である。図７に示す制約設定部３８は、不正解ベクトル生成部３８１およびデータリンク生成部３８２を含む。 Next, a detailed functional configuration of the constraint setting unit 38 will be described with reference to FIG. FIG. 7 is a diagram illustrating an example of a functional configuration of a constraint setting unit according to the embodiment. The constraint setting unit 38 illustrated in FIG. 7 includes an incorrect answer vector generation unit 381 and a data link generation unit 382.

不正解ベクトル生成部３８１は、第１の学習部３６によって生成された第１の学習モデルに含まれるクラスタに対して、当該クラスタの属性とは異なる属性を有するサンプルデータ２１０が含まれる可能性のある領域を示す不正解ベクトルＮ_ｍｉｓを生成する。 The incorrect answer vector generation unit 381 may determine whether the cluster included in the first learning model generated by the first learning unit 36 includes the sample data 210 having an attribute different from the attribute of the cluster. An incorrect answer vector N _mis indicating a certain area is generated.

データリンク生成部３８２は、不正解ベクトル生成部３８１によって生成された不正解ベクトルＮ_ｍｉｓが示す領域に属するサンプルデータ２１０に対するデータリンクを生成する機能である。データリンクとは、ラベリングされた属性情報が示す属性と異なる属性を有するクラスタに属するサンプルデータ２１０を、正しい属性を有するクラスタに属させるための制約である。すなわち、データリンク生成部３８２は、第１の学習モデルに含まれるクラスタに属するサンプルデータ２１０のうち、クラスタが有する属性とは異なる属性を示す属性情報がラベリングされたサンプルデータ２１０に対する制約を生成する。 The data link generation unit 382 has a function of generating a data link for the sample data 210 belonging to the area indicated by the incorrect answer vector _Nmis generated by the incorrect answer vector generation unit 381. The data link is a constraint for making the sample data 210 belonging to a cluster having an attribute different from the attribute indicated by the labeled attribute information belong to a cluster having a correct attribute. That is, the data link generation unit 382 generates a constraint on the sample data 210 to which attribute information indicating an attribute different from the attribute of the cluster among the sample data 210 belonging to the cluster included in the first learning model is labeled. .

●データ分類処理●
続いて、データベースサーバ３０に記憶されたテキストデータ２００に対するデータ分類処理について説明する。図８は、実施形態に係るデータベースサーバにおけるデータ分類処理の一例を示すフローチャートである。以下において、テキストデータ２００における会話要素の有無の分類するための処理について説明する。 ● Data classification processing ●
Next, a data classification process for the text data 200 stored in the database server 30 will be described. FIG. 8 is a flowchart illustrating an example of a data classification process in the database server according to the embodiment. Hereinafter, processing for classifying the presence or absence of a conversation element in the text data 200 will be described.

ステップＳ１１において、サンプルデータ取得部３２は、記憶部３０００に記憶されたテキストデータ２００のうち、属性情報がラベリングされたサンプルデータ２１０を抽出する。具体的には、記憶・読出部４２は、記憶部３０００に記憶されたテキストデータ２００のうち、所定のデータを読み出す。次に、サンプルデータ取得部３２は、読み出されたテキストデータ２００に含まれるテキスト情報に基づいて、このテキストデータ２００に属性情報をラベリングする。ここで、属性情報とは、分類対象のカテゴリにおけるテキストデータ２００の属性（正例または負例）を示す情報（正負ラベル）であり、例えば、会話要素の有無を特定するための情報である。そして、サンプルデータ取得部３２は、属性情報がラベリングされたテキストデータ２００を、サンプルデータ２１０として取得する。 In step S11, the sample data acquisition unit 32 extracts the sample data 210 on which the attribute information is labeled from the text data 200 stored in the storage unit 3000. Specifically, the storage / readout unit 42 reads out predetermined data from the text data 200 stored in the storage unit 3000. Next, based on the text information included in the read text data 200, the sample data acquisition unit 32 labels the text data 200 with attribute information. Here, the attribute information is information (positive or negative label) indicating an attribute (positive or negative example) of the text data 200 in the category to be classified, and is, for example, information for specifying the presence or absence of a conversation element. Then, the sample data acquiring unit 32 acquires the text data 200 on which the attribute information is labeled as the sample data 210.

ステップＳ１２において、特徴量抽出部３４は、ステップＳ１１によって抽出されたサンプルデータ２１０に含まれるテキスト情報の内容に基づいて、サンプルデータ２１０の特徴量を抽出する。 In step S12, the feature amount extraction unit 34 extracts a feature amount of the sample data 210 based on the contents of the text information included in the sample data 210 extracted in step S11.

ここで、図９を用いて、データベースサーバ３０による特徴量抽出処理について説明する。図９は、実施形態に係るデータベースサーバにおける特徴量抽出処理について説明するためのフローチャートである。図９に示す特徴量抽出処理は、テキストデータ２００に含まれるテキスト情報の会話要素を示す対象カテゴリ情報を利用したパターンマッチングの例である。 Here, the feature amount extraction processing by the database server 30 will be described with reference to FIG. FIG. 9 is a flowchart illustrating a feature amount extraction process in the database server according to the embodiment. The feature amount extraction processing illustrated in FIG. 9 is an example of pattern matching using target category information indicating a conversation element of text information included in the text data 200.

ステップＳ１２１において、記憶・読出部４２は、記憶部３０００に記憶された対象カテゴリ設定情報３１０を読み出す。具体的には、記憶・読出部４２は、記憶部３０００に記憶されたカテゴリ管理テーブル３００の中から、カテゴリ名「会話」に関連づけられた対象カテゴリ設定情報３１０（図４参照）を読み出す。 In step S121, the storage / readout unit 42 reads the target category setting information 310 stored in the storage unit 3000. Specifically, the storage / readout unit 42 reads out the target category setting information 310 (see FIG. 4) associated with the category name “conversation” from the category management table 300 stored in the storage unit 3000.

ステップＳ１２２において、対象カテゴリ情報抽出部３４１は、サンプルデータ２１０に含まれるテキスト情報の中から、ステップＳ１２１によって読み出された対象カテゴリ設定情報３１０に該当するテキスト情報を、対象カテゴリ情報として抽出する。 In step S122, the target category information extraction unit 341 extracts text information corresponding to the target category setting information 310 read out in step S121 from the text information included in the sample data 210 as target category information.

ステップＳ１２３において、形態素解析部３４２は、図８に示したステップＳ１１によって取得されたサンプルデータ２１０に含まれるテキスト情報の形態素解析を実行する。具体的には、まず、形態素解析部３４２は、サンプルデータ２１０に含まれるテキスト情報を抽出する。そして、形態素解析部３４２は、抽出したテキスト情報のうち、名詞、動詞および形容詞等の品詞を有するものを特徴量の候補として取得する。 In step S123, the morphological analysis unit 342 performs a morphological analysis of the text information included in the sample data 210 acquired in step S11 shown in FIG. Specifically, first, the morphological analysis unit 342 extracts text information included in the sample data 210. Then, the morphological analysis unit 342 acquires, from among the extracted text information, one having a part of speech such as a noun, a verb, or an adjective as a candidate for the feature amount.

ステップＳ１２４において、特徴量決定部３４３は、ステップＳ１２２によって抽出された対象カテゴリ情報、およびステップＳ１２３による解析結果に基づいて、サンプルデータ２１０の特徴量を決定する。ここで、上記説明したように、ステップＳ１２２によって抽出された対象カテゴリ情報、およびステップＳ１２３による解析結果は、処理対象であるサンプルデータ２１０の特徴量の候補である。具体的には、まず、特徴量決定部３４３は、ＴＦ−ＩＤＦ（Term Frequency−Inverse Document Frequency）値を算出する。ＴＦ−ＩＤＦ値とは、文書中に含まれる単語の重要度を評価する手法の一つであり、単語の出現頻度（ＴＦ）と希少性（ＩＤＦ）の二つの指標に基づいて計算される。ここで算出されるＴＦ−ＩＤＦ値は、会話有または会話無に関係する情報を含むものと想定される。しかしながら、算出されるＴＦ−ＩＤＦ値の次元は非常に冗長なものとなり、結果的に分類精度を低下させるおそれがある。そこで、特徴量決定部３４３は、以下（式１）に示すカイ二乗検定によってカイ二乗値ＣＨＩ(ｔ,ｃ)を計算することによって、該当する対象カテゴリ情報とＴＦ−ＩＤＦ値によって示される特徴量の取捨選択を行う。 In step S124, the feature value determination unit 343 determines the feature value of the sample data 210 based on the target category information extracted in step S122 and the analysis result in step S123. Here, as described above, the target category information extracted in step S122 and the analysis result in step S123 are candidates for the feature amount of the sample data 210 to be processed. Specifically, first, the feature value determination unit 343 calculates a TF-IDF (Term Frequency-Inverse Document Frequency) value. The TF-IDF value is one of the methods for evaluating the importance of a word included in a document, and is calculated based on two indices of a word appearance frequency (TF) and a rarity (IDF). The TF-IDF value calculated here is assumed to include information relating to the presence or absence of conversation. However, the dimension of the calculated TF-IDF value becomes extremely redundant, and as a result, the classification accuracy may be reduced. Therefore, the feature amount determination unit 343 calculates the chi-square value CHI (t, c) by a chi-square test shown in (Equation 1), and thereby obtains the feature amount indicated by the corresponding target category information and the TF-IDF value Is selected.

ここで、カイ二乗検定とは、２つの事柄がどの程度独立しているかの検定であり、例えば、「会話要素の有無ｔ,ｔ’」と「ステップＳ１２２によって抽出された対象カテゴリ情報とステップＳ１２３による解析結果とで示される特徴量の候補の有無ｃ,ｃ’」との関係性を計算するものである。Ｎは、「会話要素の有無」と「特徴量の候補の有無」のバリエーションの数（この場合のパリエーションは、ｔ,ｔ’,ｃ,ｃ’であるため、Ｎ＝４）、Ｐ(ｔ,ｃ)は、全てのサンプルデータ２１０のうち、属性が「会話要素有」で、かつ特徴量の候補が含まれる確率、Ｐ(ｔ’,ｃ)は、全てのサンプルデータ２１０のうち、属性が「会話要素無」で、かつ特徴量の候補が含まれる確率、Ｐ(ｔ,ｃ’)は、全てのサンプルデータ２１０のうち、属性が「会話要素有」で、かつ特徴量の候補が含まれない確率、Ｐ(ｔ’,ｃ’)は、全てのサンプルデータ２１０のうち、属性が「会話要素無」で、かつ特徴量の候補が含まれない確率を示す、また、Ｐ(ｔ)は、全てのサンプルデータ２１０のうち、属性が「会話要素有」である確率、Ｐ(ｔ’）は、全てのサンプルデータ２１０のうち、属性が「会話要素無」である確率、Ｐ(ｃ)は、全てのサンプルデータ２１０のうち、特徴量の候補が含まれる確率、Ｐ(ｃ’)は、全てのサンプルデータ２１０のうち、特徴量の候補が含まれる確率を示す。なお、全てのサンプルデータ２１０とは、サンプルデータ取得部３２によって取得された、属性情報がラベリングされた全てのサンプルデータ２１０のことである。 Here, the chi-square test is a test of how independent two things are. For example, “the presence or absence of a conversation element t, t ′”, “the target category information extracted in step S122 and step S123” And the presence / absence c, c ′ of the feature amount candidate indicated by the analysis result. N is the number of variations of “presence / absence of conversation element” and “presence / absence of candidate feature amount” (N = 4 because the variation in this case is t, t ′, c, c ′), and P ( t, c) is the probability that the attribute is “conversational element present” and the candidate for the feature is included among all the sample data 210, and P (t ′, c) is the probability of The probability that the attribute is “no conversation element” and that the candidate for the feature amount is included, P (t, c ′) is, of all the sample data 210, Is not included, P (t ′, c ′) indicates the probability that the attribute is “no conversation element” and that no candidate for the feature amount is included among all the sample data 210. t) is the probability that the attribute is “conversational element present” among all the sample data 210, and P (t ′) is the Among the data 210, the probability that the attribute is "no conversation element", P (c) is the probability that the feature amount candidate is included in all the sample data 210, and P (c ') is the all sample data. Among 210, the probability that feature value candidates are included is shown. Note that the all sample data 210 is all the sample data 210 acquired by the sample data acquisition unit 32 and labeled with the attribute information.

カイ二乗検定は、誤判断を避けるために無関係な特徴量を排除する機能を有する。例えば、「会話要素の有無ｔ,ｔ’」に対して、「特徴量の候補の有無ｃ,ｃ’」が全く関係ない場合、Ｐ(ｔ,ｃ)×Ｐ(ｔ’,ｃ’)＝Ｐ(ｔ,ｃ’)×Ｐ(ｔ’,ｃ)となり、カイ二乗値ＣＨＩ(ｔ,ｃ)は、０となる。一方で、「会話要素の有無ｔ,ｔ’」に対して、「特徴量の候補の有無ｃ,ｃ’」の依存度が強いと、カイ二乗値ＣＨＩ(ｔ,ｃ)も大きな値となる。つまり、特徴量決定部３４３は、カイ二乗検定によって算出されたカイ二乗値ＣＨＩ(ｔ,ｃ)がより大きい特徴量を、サンプルデータ２１０の特徴量として選択・決定する。そのため、特徴量決定部３４３は、サンプルデータ２１０の属性（例えば、会話要素の有無）の識別に関係する特徴量のみを絞り込むことができる。 The chi-square test has a function of eliminating extraneous features in order to avoid misjudgment. For example, if “the presence or absence of a feature amount candidate c, c ′” is completely unrelated to “the presence or absence of a conversation element t, t ′”, P (t, c) × P (t ′, c ′) = P (t, c ′) × P (t ′, c), and the chi-square value CHI (t, c) becomes 0. On the other hand, if the degree of “the presence / absence of a feature amount candidate c, c ′” is strong with respect to “the presence / absence of conversation element t, t ′”, the chi-square value CHI (t, c) also becomes large. . That is, the feature amount determination unit 343 selects and determines a feature amount having a larger chi-square value CHI (t, c) calculated by the chi-square test as a feature amount of the sample data 210. Therefore, the feature value determination unit 343 can narrow down only the feature value related to the identification of the attribute (for example, the presence or absence of a conversation element) of the sample data 210.

これによって、特徴量決定部３４３は、サンプルデータ取得部３２によって取得された全てのサンプルデータ２１０の特徴量を決定する。特徴量決定部３４３は、複数のサンプルデータ２１０に対して、上記特徴量抽出処理を実行することによって、データ分類処理において分類種別を特定するためのサンプルとなるデータセットを生成する。 Thereby, the characteristic amount determining unit 343 determines the characteristic amounts of all the sample data 210 acquired by the sample data acquiring unit 32. The feature amount determination unit 343 generates a data set as a sample for specifying a classification type in the data classification process by executing the feature amount extraction process on the plurality of sample data 210.

図８に戻り、データベースサーバ３０のデータ分類処理の説明を続ける。ステップＳ１３において、記憶・読出部４２は、記憶部３０００に記憶されているテキストデータ２００を読み出す。ここで、テキストデータ２００は、ステップＳ１１によってサンプルデータ２１０として取得されたデータ、およびサンプルデータ２１０として抽出されなかったデータを含む。すなわち、テキストデータ２００は、属性情報がラベリングされたデータ（サンプルデータ２１０）と属性情報がラベリングされていないデータ（未知データ）の両方を含む。 Returning to FIG. 8, the description of the data classification process of the database server 30 will be continued. In step S13, the storage / readout unit 42 reads out the text data 200 stored in the storage unit 3000. Here, text data 200 includes data obtained as sample data 210 in step S11 and data not extracted as sample data 210. That is, the text data 200 includes both data on which attribute information is labeled (sample data 210) and data on which attribute information is not labeled (unknown data).

ステップＳ１４において、データ数値化部３５は、ステップＳ１４によって読み出されたテキストデータ２００に対するベクトル化（数値化）処理を実行する。そして、ステップＳ１５において、データベースサーバ３０は、ステップＳ１４によって数値化されたテキストデータ２００を用いて、カテゴリ分類処理を実行する。 In step S14, the data digitizing unit 35 performs a vectorization (digitization) process on the text data 200 read in step S14. Then, in step S15, the database server 30 performs a category classification process using the text data 200 digitized in step S14.

ここで、図１０乃至図１６を用いて、データベースサーバ３０によるカテゴリ分類処理について説明する。図１０は、実施形態に係るデータベースサーバにおけるカテゴリ分類処理の一例を示すフローチャートである。以下で説明する処理は、教師なし学習と、教師なし学習によって生成された第１の学習モデルに基づく制約を用いた半教師あり学習との組み合わせによる分類処理である。 Here, the category classification process by the database server 30 will be described with reference to FIGS. FIG. 10 is a flowchart illustrating an example of the category classification process in the database server according to the embodiment. The process described below is a classification process based on a combination of unsupervised learning and semi-supervised learning using a constraint based on a first learning model generated by unsupervised learning.

まず、ステップＳ１５１において、第１の学習部３６は、図９のステップＳ１４によって数値化されたサンプルデータ２１０の特徴量を用いた教師なし学習によって、第１の学習モデルを生成する。具体的には、第１の学習部３６は、代表的な教師なし学習であるＫ−ｍｅａｎｓクラスタリングによって、二値分類されたクラスタ（第１のクラスタおよび第２のクラスタ）を含む第１の学習モデルを行う。図１１は、教師なし学習によって生成された第１の学習モデルの一例を説明するための概念図である。図１１に示すように、教師なし学習は、サンプルデータ２１０の属性を区別しないため、第１の学習モデルに含まれるクラスタは、各クラスタが会話有（正例）であるか会話無（負例）であるかの属性が不明な状態である。 First, in step S151, the first learning unit 36 generates a first learning model by unsupervised learning using the features of the sample data 210 quantified in step S14 in FIG. Specifically, the first learning unit 36 performs the first learning including the clusters (the first cluster and the second cluster) binarized by the K-means clustering, which is a typical unsupervised learning. Do the model. FIG. 11 is a conceptual diagram illustrating an example of a first learning model generated by unsupervised learning. As shown in FIG. 11, since unsupervised learning does not distinguish the attributes of the sample data 210, the clusters included in the first learning model have each cluster with conversation (positive example) or no conversation (negative example). ) Is unknown.

次に、ステップＳ１５２において、クラスタ属性特定部３７は、ステップＳ１５１によって生成された第１の学習モデルに含まれるクラスタの属性を特定する。上記のように、教師なし学習によって生成されたクラスタは、どちらのクラスタが会話要素を含むかを特定することができない。そのため、クラスタ属性特定部３７は、サンプルデータ２１０にラベリングされた属性情報に基づいて、第１の学習モデルに含まれる各クラスタが、正例または負例のいずれの属性を有する集合であるかを特定する。クラスタ属性特定部３７は、例えば、各クラスタに属するサンプルデータ２１０の属性（正負）の数の多数決によって、それぞれのクラスタの属性を特定する。図１２は、第１の学習モデルに含まれる属性が特定されたクラスタの一例を説明するための概念図である。図１２に示すように、左側のクラスタに属するサンプルデータ２１０は、会話無（負例）データよりも会話有（正例）データが多いため、クラスタ属性特定部３７は、左側のクラスタの属性を、会話（正例）有クラスタとして特定する。また、右側のクラスタに属するサンプルデータ２１０は、会話有（正例）データよりも会話無（負例）データが多いため、クラスタ属性特定部３７は、右側のクラスタの属性を、会話無（負例）クラスタとして特定する。 Next, in step S152, the cluster attribute specifying unit 37 specifies the attributes of the cluster included in the first learning model generated in step S151. As described above, a cluster generated by unsupervised learning cannot specify which cluster includes a conversation element. Therefore, the cluster attribute specifying unit 37 determines whether each of the clusters included in the first learning model is a set having a positive example or a negative example based on the attribute information labeled on the sample data 210. Identify. The cluster attribute specifying unit 37 specifies the attribute of each cluster, for example, by majority decision of the number of attributes (positive or negative) of the sample data 210 belonging to each cluster. FIG. 12 is a conceptual diagram illustrating an example of a cluster in which an attribute included in the first learning model has been specified. As shown in FIG. 12, since the sample data 210 belonging to the cluster on the left side has more conversational (positive example) data than data without conversation (negative example), the cluster attribute specifying unit 37 determines the attribute of the cluster on the left side. , A conversation (positive example) is specified as a cluster. Also, since the sample data 210 belonging to the cluster on the right has more data without conversation (negative example) than the data with conversation (positive example), the cluster attribute specifying unit 37 determines the attribute of the cluster on the right with no conversation (negative). Example) Specify as a cluster.

次に、ステップＳ１５３において、不正解ベクトル生成部３８１は、ステップＳ１５２によって属性が特定された第１の学習モデルに含まれるクラスタを用いて、不正解ベクトルＮ_ｍｉｓを算出する。不正解ベクトルＮ_ｍｉｓとは、第１の学習モデルに含まれるクラスタに属するデータに対して、正解と判定すべき領域のうち予測できなかった領域を示す。ここで、正解とは、サンプルデータ２１０が自らの属性と同じ属性を有するクラスタに分類されることを示す。また、正解データとは、自らの属性と同じ属性を有するクラスタに分類されたサンプルデータ２１０を表し、不正解データとは、自らの属性とは異なる属性を有するクラスタに分類されたサンプルデータ２１０を表す。第１の学習モデルは、教師なし学習によって生成される学習モデルであるため、その分類精度は低い。そのため、不正解ベクトル生成部３８１は、分類精度の向上を図るため、不正解ベクトルＮ_ｍｉｓが示す領域を活用する。 Next, in step S153, the incorrect answer vector generation unit 381 calculates an incorrect answer vector N _mis using the cluster included in the first learning model whose attribute has been specified in step S152. The incorrect answer vector N _mis indicates an area that could not be predicted among the areas that should be determined to be correct for the data belonging to the cluster included in the first learning model. Here, a correct answer indicates that the sample data 210 is classified into a cluster having the same attribute as its own. In addition, the correct answer data represents the sample data 210 classified into a cluster having the same attribute as its own attribute, and the incorrect answer data represents the sample data 210 classified into a cluster having an attribute different from its own attribute. Represent. Since the first learning model is a learning model generated by unsupervised learning, its classification accuracy is low. Therefore, the incorrect answer vector generation unit 381 uses the area indicated by the incorrect answer vector N _{mis in} order to improve the classification accuracy.

図１２は、不正解ベクトルＮ_ｍｉｓについて説明するための概念図である。Ａ_ｃｏｒｒは、Ａカテゴリの正解集合を示し、Ａ_ｐｒｅｄは、Ａカテゴリの予測集合を示している。ここで、正解集合とは、正解データが分類される集合領域を示す。一方で、予測集合とは、正解データが分類されることが予測される集合領域を示す。同様に、Ｂ_ｃｏｒｒは、Ｂカテゴリの正解集合を示し、Ｂ_ｐｒｅｄは、Ｂカテゴリの予測集合を示している。例えば、Ａカテゴリは、会話有のクラスタであり、Ｂカテゴリは、会話無のクラスタである。不正解ベクトルＮ_ｍｉｓは、正解集合の中で、予測集合に含まれない領域を示す。すなわち、不正解ベクトルＮ_ｍｉｓは、第１の学習部３６によって生成された第１の学習モデルに含まれるクラスタに対して、クラスタの属性とは異なる属性を有するサンプルデータ２１０が含まれる可能性のある領域を示す。不正解ベクトルＮ_ｍｉｓは、下記（式２）を用いて算出される。 FIG. 12 is a conceptual diagram for describing the incorrect answer vector _Nmis . A _corr indicates a correct set of the A category, and A _pred indicates a predicted set of the A category. Here, the correct answer set indicates a set area in which the correct answer data is classified. On the other hand, the prediction set indicates a set area where correct data is predicted to be classified. Similarly, B _corr indicates a correct set of the B category, and B _pred indicates a predicted set of the B category. For example, the A category is a cluster having a conversation, and the B category is a cluster having no conversation. The incorrect answer vector N _mis indicates a region in the correct answer set that is not included in the prediction set. That is, for the cluster included in the first learning model generated by the first learning unit 36, the incorrect answer vector N _mis may include sample data 210 having an attribute different from the cluster attribute. Indicates an area. The incorrect answer vector N _mis is calculated using the following (Equation 2).

ステップＳ１５４において、データリンク生成部３８２は、ステップＳ１５３において生成された不正解ベクトルＮ_ｍｉｓが示す領域に対して、当該領域に属するデータに対するデータリンクを生成する。第１の学習モデルには、会話有（正例）クラスタに属する会話無（負例）データ、または会話無（負例）クラスタに属する会話有（正例）データのように、本来分類されるべきクラスタとは属性の異なるデータが存在する。そのようなデータに対して、データリンク生成部３８２は、ユークリッド距離を算出し、最短距離となるデータを検出する。そして、データリンク生成部３８２は、検出されたデータに対して、半教師あり学習で使用する「ｍｕｓｔ−ｌｉｎｋ」および「ｃａｎｎｏｔ−ｌｉｎｋ」のデータリンクの制約を設定する。 In step S154, the data link generation unit 382 to the generated incorrect vector _{N mis} indicated regions in step S153, and generates a data link to the data belonging to the region. The first learning model is originally classified as non-conversational (negative example) data belonging to a conversational (positive example) cluster or conversational (positive example) data belonging to a non-conversation (negative example) cluster. There are data with different attributes from the power cluster. For such data, the data link generation unit 382 calculates the Euclidean distance and detects the data having the shortest distance. Then, the data link generation unit 382 sets restrictions on the “must-link” and “cannot-link” data links used in semi-supervised learning for the detected data.

ここで、不正解ベクトルＮ_ｍｉｓが示す領域に属するデータに対するデータリンクの生成処理を説明する。図１４は、不正解ベクトルＮ_ｍｉｓによって示される領域に属するサンプルデータに対して生成されたデータリンクの一例を説明するための概念図である。ラベリングされた属性情報と同じ属性のクラスタに分類されたサンプルデータ２１０を正解データＴ、ラベリングされた属性情報とは異なる属性のクラスタに分類されたサンプルデータ２１０を不正解データＦとする。 Here, a process of generating a data link for data belonging to the area indicated by the incorrect answer vector _Nmis will be described. FIG. 14 is a conceptual diagram illustrating an example of a data link generated for sample data belonging to an area indicated by an incorrect answer vector N _mis . The sample data 210 classified into a cluster having the same attribute as the labeled attribute information is referred to as correct data T, and the sample data 210 classified into a cluster having an attribute different from that of the labeled attribute information is referred to as incorrect data F.

正解データＴは、属すべきクラスタに属するデータであるので、データリンク生成部３８２は、正解データＴと、正解データＴとは異なるクラスタに属する不正解データＦのうち最短距離に位置するデータとの間で「ｍｕｓｔ−ｌｉｎｋ」を生成する。また、不正解データFは、属するべきでないクラスタに属するデータであるので、データリンク生成部３８２は、不正解データＦと、同じクラスタに属する正解データＴのうち最短距離に位置するデータとの間で「ｃａｎｎｏｔ−ｌｉｎｋ」を生成する。ここで、「ｍｕｓｔ−ｌｉｎｋ」は、二つのサンプルデータ２１０が同じクラスタに属する制約であり、第１の学習モデルに含まれるクラスタのうち、異なるクラスタに属する同一の属性情報がラベリングされたサンプルデータ２１０の間で設定される制約である。一方で、「ｃａｎｎｏｔ−ｌｉｎｋ」は、二つのサンプルデータ２１０が異なるクラスタに属する制約であり、第１の学習モデルに含まれるクラスタのうち、同一のクラスタに属する異なる属性情報がラベリングされたサンプルデータ２１０の間で設定される制約である。 Since the correct answer data T is data belonging to the cluster to which the correct answer data T belongs, the data link generating unit 382 determines whether the correct answer data T and the incorrect answer data F belonging to a cluster different from the correct answer data T are located at the shortest distance. A "must-link" is generated between them. Further, since the incorrect answer data F is data belonging to a cluster that should not belong, the data link generation unit 382 determines whether the incorrect answer data F and the correct answer data T belonging to the same cluster are located at the shortest distance. Generates “cannot-link”. Here, “must-link” is a constraint that the two sample data 210 belong to the same cluster, and the sample data in which the same attribute information belonging to different clusters among the clusters included in the first learning model is labeled. The constraint is set between 210. On the other hand, “cannot-link” is a constraint that two sample data 210 belong to different clusters, and sample data in which different attribute information belonging to the same cluster among the clusters included in the first learning model is labeled. The constraint is set between 210.

ステップＳ１５５において、第２の学習部３９は、ステップＳ１５４において生成されたデータリンクに基づいて、代表的な半教師あり学習であるＣＯＰＫ−ｍｅａｎｓクラスタリングによる第２の学習モデルを生成する。図１５は、実施形態に係る半教師あり学習によって生成された第２の学習モデルの一例を説明するための概念図である。図１５に示すように、第２の学習モデルは、図１０に示した第１の学習モデルに含まれるクラスタ（第１のクラスタおよび第２のクラスタ）の境界が修正され複雑化している。第２の学習モデルは、第１のクラスタの境界が修正された第３のクラスタおよび第２のクラスタの境界が修正された第４のクラスタを含む。このように、第２の学習モデルは、教師なし学習に基づいて生成された制約設定を、半教師あり学習に適用することによって、より精密な学習モデルを生成することができる。 In step S155, the second learning unit 39 generates a second learning model based on COP K-means clustering, which is a typical semi-supervised learning, based on the data link generated in step S154. FIG. 15 is a conceptual diagram illustrating an example of a second learning model generated by semi-supervised learning according to the embodiment. As shown in FIG. 15, the boundary of the clusters (the first cluster and the second cluster) included in the first learning model shown in FIG. 10 is modified, and the second learning model is complicated. The second learning model includes a third cluster whose boundary of the first cluster has been modified and a fourth cluster whose boundary of the second cluster has been modified. As described above, the second learning model can generate a more precise learning model by applying the constraint setting generated based on unsupervised learning to semi-supervised learning.

そして、ステップＳ１５６において、未知データ分類部４１は、ステップＳ１５５によって生成された第２の学習モデルに含まれるクラスタに対して、属性情報がラベリングされていない未知データの分類処理を行う。ここで、属性情報がラベリングされていない未知データとは、テキストデータ２００のうち、サンプルデータ２１０として取得されていないデータである。図１６は、実施形態に係る第２の学習モデルに含まれるクラスタに対して分類された未知データの一例を説明するための概念図である。図１６に示すように、第１の学習モデルに含まれるクラスタでは不正解データとなっていた未知データに対しても、第２の学習モデルにおいては、正解データとして分類されていることがわかる。なお、未知データの特徴量は、次元数もしくはＴＦ−ＩＤＦ値の構成並びに順番が図９に示した特徴量抽出処理によって得られたデータと等しくなるよう生成される。 Then, in step S156, the unknown data classification unit 41 performs a classification process of the unknown data in which the attribute information is not labeled with respect to the cluster included in the second learning model generated in step S155. Here, the unknown data whose attribute information is not labeled is data that has not been acquired as the sample data 210 in the text data 200. FIG. 16 is a conceptual diagram illustrating an example of unknown data classified for a cluster included in the second learning model according to the embodiment. As shown in FIG. 16, it can be understood that unknown data that was incorrect data in the clusters included in the first learning model is classified as correct data in the second learning model. Note that the feature amount of the unknown data is generated such that the number of dimensions or the configuration and order of the TF-IDF values are equal to the data obtained by the feature amount extraction process illustrated in FIG.

このように、データベースサーバ３０は、複数のサンプルデータ２１０を用いた教師なし学習により生成された第１の学習モデルを利用して、テキストデータ２００の分類を行うための制約を設定する。そして、データベースサーバ３０は、設定した制約を用いて半教師あり学習を行う。これにより、データベースサーバ３０は、属性情報がラベリングされていない未知データの分類精度を向上させることができる。また、データベースサーバ３０によるデータ分類処理は、全てのテキストデータ２００に対して属性情報をラベリングする必要がないため、ラベリングに要する負荷を低減させることができるとともに、誤ったラベリングが行われることによってデータ分類精度が低下することを防止することができる。 As described above, the database server 30 sets a constraint for classifying the text data 200 using the first learning model generated by the unsupervised learning using the plurality of sample data 210. Then, the database server 30 performs semi-supervised learning using the set constraints. As a result, the database server 30 can improve the classification accuracy of unknown data whose attribute information is not labeled. Further, in the data classification processing by the database server 30, since it is not necessary to label attribute information on all the text data 200, it is possible to reduce the load required for labeling, and to perform data labeling by performing erroneous labeling. It is possible to prevent a decrease in classification accuracy.

従来から機械学習によるデータの分類方法として用いられる、正規表現によるパターンマッチングや教師なし学習は、分類対象のデータに対するラベリング作業が不要であるが、分類精度が低い。一方で、サポートベクターマシン（ＳＶＭ）のような教師あり学習を用いる方法は、分類精度は高いが、膨大なデータに対するラベリングに係る負荷が大きい。そこで、本実施形態は、上記手法の特徴を混成させた半教師あり学習を用いる。半教師あり学習は、教師なし学習を行わずに、正解データＴを直接適用させる方法もある。しかし、この手法では、制約条件の数や計算量がサンプルデータの増加に伴い増大してしまう。本実施形態では、教師なし学習によって生成された第１の学習モデルおよびサンプルデータ２１０の属性に基づいて、不正解データＦとなった要素に対してのみ、「ｍｕｓｔ−ｌｉｎｋ」および「ｃａｎｎｏｔ−ｌｉｎｋ」の制約を設けることで、効率的かつ精度よく半教師あり学習を実施することができる。 Conventionally, pattern matching using regular expressions and unsupervised learning, which have been used as data classification methods by machine learning, do not require labeling work on data to be classified, but have low classification accuracy. On the other hand, a method using supervised learning, such as a support vector machine (SVM), has high classification accuracy, but has a large load for labeling a huge amount of data. Therefore, the present embodiment uses semi-supervised learning that is a hybrid of the features of the above method. In the semi-supervised learning, there is also a method of directly applying the correct answer data T without performing unsupervised learning. However, in this method, the number of constraints and the amount of calculation increase with an increase in sample data. In the present embodiment, based on the first learning model generated by the unsupervised learning and the attributes of the sample data 210, only the element that has become the incorrect answer data F has “must-link” and “cannot-link”. , The semi-supervised learning can be efficiently and accurately performed.

なお、図１０に示したカテゴリ分類処理は、説明の便宜上、第２の学習モデルを生成した後に未知データが分類される（ステップＳ１５６）構成を説明したが、未知データは、例えば、ステップＳ１５１によって第１の学習モデルが生成される段階からサンプルデータ２１０とともに分類されている構成であってもよい。 In the category classification process illustrated in FIG. 10, for the sake of convenience, the configuration in which unknown data is classified after generating the second learning model (step S156) has been described. The configuration may be such that the first learning model is classified together with the sample data 210 from the stage of generation.

●まとめ●
以上説明したように、本発明の一実施形態に係るデータベースサーバは、特定のカテゴリに対して、自然言語処理に用いるテキストデータ２００の分類を行うデータベースサーバ３０（情報処理装置の一例）であって、テキストデータ２００のうち、特定のカテゴリの正例または負例のいずれの属性であるかを示す属性情報（正負ラベルの一例）がラベリングされたサンプルデータ２１０の特徴量を抽出し、抽出した特徴量を用いた教師なし学習に基づいて第１の学習モデルを生成し、サンプルデータ２１０にラベリングされた属性情報に基づいて、第１の学習モデルに含まれるクラスタが特定のカテゴリの正例または負例のいずれの属性を有する集合であるかを特定する。また、データベースサーバ３０は、生成した第１の学習モデルに含まれるクラスタの属性、および当該クラスタに属するサンプルデータ２１０にラベリングされた属性情報に基づいて、テキストデータ２００の分類を行うための制約を設定し、設定した制約を用いた半教師あり学習に基づいて、第２の学習モデルを生成する。そして、データベースサーバ３０は、生成した第２の学習モデルに含まれるクラスタに対して、属性情報がラベリングされていないテキストデータ２００（未知データの一例）を分類する。これによって、データベースサーバ３０は、第１の学習モデルに基づく制約を用いた半教師あり学習を行うことによって、属性情報がラベリングされていない未知データの分類精度を向上させることができる。 ● Summary ●
As described above, the database server according to an embodiment of the present invention is a database server 30 (an example of an information processing device) that classifies text data 200 used for natural language processing for a specific category. , Extracting attribute values of the sample data 210 to which attribute information (an example of a positive / negative label) indicating whether the attribute is a positive example or a negative example of a specific category in the text data 200 is extracted and extracted. A first learning model is generated based on unsupervised learning using the quantity, and based on the attribute information labeled on the sample data 210, a cluster included in the first learning model is defined as a positive or negative example of a specific category. Specify which attribute of the example is the set having the attribute. Further, the database server 30 imposes restrictions for classifying the text data 200 based on the attributes of the cluster included in the generated first learning model and the attribute information labeled on the sample data 210 belonging to the cluster. A second learning model is generated based on the set and semi-supervised learning using the set constraints. Then, the database server 30 classifies the text data 200 (an example of unknown data) whose attribute information is not labeled with respect to the cluster included in the generated second learning model. Thereby, the database server 30 can improve the classification accuracy of unknown data whose attribute information is not labeled by performing semi-supervised learning using the constraint based on the first learning model.

また、本発明の一実施形態に係るデータベースサーバは、教師なし学習に基づいて生成した第１の学習モデルに含まれるクラスタのうち、同一のクラスタに属する異なる属性情報（正負ラベルの一例）がラベリングされたサンプルデータ２１０の間における制約（例えば、ｃａｎｎｏｔ−ｌｉｎｋ）、および異なるクラスタに属する同一の属性情報がラベリングされたサンプルデータ２１０の間における制約（例えば、ｍｕｓｔ−ｌｉｎｋ）を設定する。これによって、データベースサーバ３０（情報処理装置の一例）は、教師なし学習によって生成された第１の学習モデル、およびサンプルデータ２１０の属性に基づいて、不正解データＦとなった要素に対してのみ、「ｍｕｓｔ−ｌｉｎｋ」および「ｃａｎｎｏｔ−ｌｉｎｋ」の制約を設けることで、効率的かつ精度よく半教師あり学習を実施することができる。 Further, the database server according to the embodiment of the present invention labels different attribute information (an example of a positive / negative label) belonging to the same cluster among clusters included in the first learning model generated based on unsupervised learning. (For example, cannot-link) between the sampled data 210 and the constraint (for example, must-link) between the sample data 210 to which the same attribute information belonging to different clusters is labeled. As a result, the database server 30 (an example of an information processing device) performs only the incorrect answer data F based on the first learning model generated by the unsupervised learning and the attribute of the sample data 210. , "Must-link" and "cannot-link", semi-supervised learning can be efficiently and accurately performed.

さらに、本発明の一実施形態に係るデータベースサーバは、テキストデータ２００のうち、属性情報（正負ラベルの一例）がラベリングされた複数のサンプルデータ２１０を取得し、取得した複数のサンプルデータ２１０の特徴量を抽出し、抽出した複数のサンプルデータ２１０の特徴量を用いた教師なし学習に基づいて、第１の学習モデルを生成する。これによって、データベースサーバ３０（情報処理装置の一例）は、全てのテキストデータ２００に対して属性情報をラベリングする必要がないため、ラベリングに要する負荷を低減させることができるとともに、誤ったラベリングが行われることによってデータ分類精度が低下することを防止することができる。 Furthermore, the database server according to the embodiment of the present invention acquires a plurality of sample data 210 on which attribute information (an example of a positive / negative label) is labeled from the text data 200, and features of the acquired plurality of sample data 210. The amount is extracted, and a first learning model is generated based on unsupervised learning using the feature amounts of the extracted plurality of sample data 210. This eliminates the need for the database server 30 (an example of an information processing device) to label the attribute information on all the text data 200, so that the load required for labeling can be reduced and erroneous labeling is performed. It is possible to prevent the data classification accuracy from being lowered due to being performed.

●補足●
なお、各実施形態の機能は、アセンブラ、Ｃ、Ｃ＋＋、Ｃ＃、Ｊａｖａ（登録商標）等のレガシープログラミング言語またはオブジェクト指向プログラミング言語等で記述されたコンピュータ実行可能なプログラムにより実現でき、各実施形態の機能を実行するためのプログラムは、電気通信回線を通じて頒布することができる。 ● Supplement ●
The functions of each embodiment can be realized by a computer-executable program written in a legacy programming language such as assembler, C, C ++, C #, Java (registered trademark) or an object-oriented programming language. A program for executing the function of (1) can be distributed through a telecommunication line.

また、各実施形態の機能を実行するためのプログラムは、ＲＯＭ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read-Only Memory）、ＥＰＲＯＭ（Erasable Programmable Read-Only Memory）、フラッシュメモリ、フレキシブルディスク、ＣＤ（Compact Disc）−ＲＯＭ、ＣＤ−ＲＷ（Re-Writable）、ＤＶＤ−ＲＯＭ、ＤＶＤ−ＲＡＭ、ＤＶＤ−ＲＷ、ブルーレイディスク、ＳＤカード、ＭＯ（Magneto-Optical disc）等の装置可読な記録媒体に格納して頒布することもできる。 The programs for executing the functions of the embodiments include ROM, EEPROM (Electrically Erasable Programmable Read-Only Memory), EPROM (Erasable Programmable Read-Only Memory), flash memory, flexible disk, and CD (Compact Disc). Stored in a device-readable recording medium such as ROM, CD-RW (Re-Writable), DVD-ROM, DVD-RAM, DVD-RW, Blu-ray disc, SD card, MO (Magneto-Optical disc), and distributed. Can also.

さらに、各実施形態の機能の一部または全部は、例えばＦＰＧＡ（Field Programmable Gate Array）等のプログラマブル・デバイス（PD）上に実装することができ、またはＡＳＩＣとして実装することができ、各実施形態の機能をＰＤ上に実現するためにＰＤにダウンロードする回路構成データ（ビットストリームデータ）、回路構成データを生成するためのＨＤＬ（Hardware Description Language）、ＶＨＤＬ（Very High Speed Integrated Circuits Hardware Description Language）、Ｖｅｒｉｌｏｇ−ＨＤＬ等により記述されたデータとして記録媒体により配布することができる。 Further, a part or all of the functions of each embodiment can be implemented on a programmable device (PD) such as an FPGA (Field Programmable Gate Array), or can be implemented as an ASIC. Configuration data (bit stream data) to be downloaded to the PD in order to realize the functions of the PD on the PD, HDL (Hardware Description Language) for generating the circuit configuration data, VHDL (Very High Speed Integrated Circuits Hardware Description Language), It can be distributed on a recording medium as data described in Verilog-HDL or the like.

これまで本発明の一実施形態に係る情報処理装置、データ分類方法およびプログラムについて説明してきたが、本発明は、上述した実施形態に限定されるものではなく、他の実施形態の追加、変更または削除等、当業者が想到することができる範囲内で変更することができ、いずれの態様においても本発明の作用・効果を奏する限り、本発明の範囲に含まれるものである。 Although the information processing apparatus, the data classification method, and the program according to one embodiment of the present invention have been described above, the present invention is not limited to the above-described embodiment, and is not limited to the above-described embodiment. Modifications such as deletion can be made within a range that can be conceived by those skilled in the art, and any aspect is within the scope of the present invention as long as the functions and effects of the present invention are exhibited.

１会議システム
２管理システム
５通信ネットワーク
１０管理サーバ
３０データベースサーバ（情報処理装置の一例）
３２サンプルデータ取得部（取得手段の一例）
３４特徴量抽出部（特徴量抽出手段の一例）
３６第１の学習部（第１の生成手段の一例）
３７クラスタ属性特定部（クラスタ属性特定手段の一例）
３８制約設定部（制約設定手段の一例）
３９第２の学習部（第２の生成手段の一例）
４１未知データ分類部（分類手段の一例）
５０ＷＥＢサーバ
７０通信端末
２００テキストデータ
２１０サンプルデータ
３００カテゴリ管理テーブル
３４１対象カテゴリ情報抽出部（カテゴリ情報抽出手段の一例）
３４２形態素解析部（形態素解析手段の一例）
３４３特徴量決定部（特徴量決定手段の一例）
３８１不正解ベクトル生成部
３８２データリンク生成部 DESCRIPTION OF SYMBOLS 1 Conference system 2 Management system 5 Communication network 10 Management server 30 Database server (an example of an information processing device)
32 Sample data acquisition unit (an example of acquisition means)
34 Feature Extraction Unit (Example of Feature Extraction Means)
36 first learning unit (an example of a first generation unit)
37 Cluster attribute specifying unit (an example of a cluster attribute specifying unit)
38 Constraint setting unit (an example of constraint setting means)
39 second learning unit (an example of a second generation unit)
41 Unknown Data Classification Unit (Example of Classification Means)
50 Web server 70 Communication terminal 200 Text data 210 Sample data 300 Category management table 341 Target category information extraction unit (an example of category information extraction means)
342 Morphological analyzer (an example of morphological analyzer)
343 feature amount determination unit (an example of feature amount determination means)
381 Incorrect answer vector generation unit 382 Data link generation unit

特表２０１７―５３５００７号Special table 2017-535007

Claims

An information processing apparatus for classifying text data used for natural language processing for a specific category,
Among the text data, a feature amount extracting unit that extracts a feature amount of sample data in which a positive or negative label indicating whether the attribute is a positive example or a negative example of the category is labeled,
First generating means for generating a first learning model based on unsupervised learning using the extracted feature amount;
A cluster attribute that specifies whether a cluster included in the generated first learning model is a set having the positive example or the negative example based on the positive / negative label labeled on the sample data; Identification means,
Based on the attribute of the identified cluster, and a positive / negative label labeled on sample data belonging to the cluster, a constraint setting unit that sets a constraint for performing the classification,
A second generation unit that generates a second learning model based on semi-supervised learning using the set constraint;
Classification means for classifying unknown data in which the positive / negative labels are not labeled among the text data, for clusters included in the generated second learning model,
An information processing apparatus comprising:

2. The constraint setting unit according to claim 1, wherein, of the sample data belonging to the cluster included in the first learning model, a constraint is set on the sample data to which the positive / negative label having the attribute different from that of the cluster is labeled. Information processing device.

The constraint setting means may include, among clusters included in the first learning model, a constraint between sample data in which the different positive / negative labels belonging to the same cluster are labeled, and the same positive / negative label belonging to a different cluster. The information processing apparatus according to claim 1, wherein a constraint between the labeled sample data is set.

The information processing apparatus according to claim 1 or 2,
Acquisition means for acquiring a plurality of sample data in which the positive and negative labels are labeled among the text data,
The feature amount extracting means extracts a feature amount of the acquired plurality of sample data,
The information processing apparatus, wherein the first generation unit generates the first learning model based on the unsupervised learning using the feature amounts of the extracted plurality of sample data.

The information processing device according to claim 4, wherein:
The feature amount extracting means further comprises:
Included in the obtained sample data, category information extracting means for extracting category information for specifying the category,
Morphological analysis means for performing morphological analysis on text information included in the obtained sample data,
An information processing apparatus comprising: a feature amount determining unit that determines a feature amount of the sample data based on the extracted category information and an analysis result by the morphological analysis unit.

The information processing apparatus according to claim 1, wherein the clusters included in the first learning model include a first cluster and a second cluster generated based on the unsupervised learning.

The clusters included in the second learning model include a third cluster corresponding to the first cluster and a fourth cluster corresponding to the second cluster generated based on the semi-supervised learning. The information processing device according to claim 6.

The information processing apparatus according to claim 1, wherein the unsupervised learning is machine learning based on K-means clustering.

The information processing apparatus according to claim 1, wherein the semi-supervised learning is machine learning based on COP K-means clustering.

The information processing apparatus according to claim 1, wherein the specific category is a category for identifying the presence or absence of a conversation element.

A data classification method performed by an information processing device that classifies text data used for natural language processing for a specific category,
A feature amount extracting step of extracting a feature amount of sample data in which the text data is labeled with a positive / negative label indicating whether the attribute is a positive example or a negative example of the category;
A first generation step of generating a first learning model based on unsupervised learning using the extracted feature amount;
A cluster attribute for specifying whether the cluster included in the generated first learning model is a set having the positive example or the negative example based on the positive / negative label labeled on the sample data; Specific steps,
Based on the attribute of the identified cluster, and a positive / negative label labeled on sample data belonging to the cluster, a constraint setting step of setting a constraint for performing the classification,
A second generation step of generating a second learning model based on semi-supervised learning using the set constraint;
A classifying step of classifying unknown data in which the positive / negative label is not labeled among the text data with respect to the clusters included in the generated second learning model;
Data classification method to perform.

12. The constraint setting step, wherein, among sample data belonging to a cluster included in the first learning model, a constraint is set on the sample data to which the positive / negative label different in the attribute from the cluster is labeled. Data classification method.

In the constraint setting step, among the clusters included in the first learning model, a constraint between sample data in which different positive and negative labels belonging to the same cluster are labeled, and the same positive and negative labels belonging to different clusters are included. The data classification method according to claim 11, wherein a constraint between the labeled sample data is set.

A program for causing a computer to execute the method according to any one of claims 11 to 13.