JP2016162257A

JP2016162257A - Information processing device and information processing program

Info

Publication number: JP2016162257A
Application number: JP2015040921A
Authority: JP
Inventors: 茂之榊; Shigeyuki Sakaki; 康秀三浦; Yasuhide Miura; 大熊　智子; Tomoko Okuma; 智子大熊
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2015-03-03
Filing date: 2015-03-03
Publication date: 2016-09-05
Anticipated expiration: 2035-03-03
Also published as: JP6511865B2

Abstract

PROBLEM TO BE SOLVED: To provide an information processing device that is adapted so as to create teacher data less in noise when compared with preparation of the teacher data to be used in machine learning, using non-classified submission information.SOLUTION: Collection means of an information processing device is configured to collect posting information, classification means thereof is configured to classify the posting information collected by the collection means, and preparation means thereof is configured to prepare teacher data to be used in machine learning, using the posting information in an aggregate classified by the classification means.SELECTED DRAWING: Figure 1

Description

本発明は、情報処理装置及び情報処理プログラムに関する。 The present invention relates to an information processing apparatus and an information processing program.

特許文献１には、異なるカテゴリの対象の検出や認識に有用な局所特徴を、画像の集合から適切に抽出するための学習を行えることができるようにすることを課題とし、異なる種類の物体画像データの集合を、画像入力部により入力し、入力した各物体画像データから所定の幾何学的構造を有する局所的特徴を、局所特徴検出部により検出し、検出した複数の局所的特徴を、クラスタリングによりクラスタリングし、クラスタリングした結果に基づき、複数の代表局所特徴を、特徴選択部により選択し、選択した代表局所特徴を教師データとして含む学習用データセットを用いて、前記物体画像データに基づく物体を認識又は検出するための学習を、学習制御部により行うことが開示されている。 Patent Document 1 has an object to enable learning to appropriately extract local features useful for detection and recognition of objects in different categories from a set of images. A set of data is input by an image input unit, local features having a predetermined geometric structure are detected from each input object image data by a local feature detection unit, and a plurality of detected local features are clustered. Based on the clustered result, a plurality of representative local features are selected by a feature selection unit, and an object based on the object image data is selected using a learning data set including the selected representative local features as teacher data. It is disclosed that learning for recognition or detection is performed by a learning control unit.

特許文献２には、酷似種が含まれる多クラスの画像やその他の特徴データの分類において、過学習を避けることを課題とし、酷似種が含まれる多クラスの画像や特徴データ分類で、グループ単位の識別を行う第１階層目識別器と、グループ内の識別を行う第２階層目識別器の、２階層の識別器が実現され、グループを決定する際、第１学習手段によって事前に、グループ化しない状態で機械学習による識別を行う種類識別器が生成され、次に、識別誤り集計手段によって、この種類識別器を用いて、識別試験が行われ、カテゴリ間で識別誤りを生じた回数が集計され、グルーピング処理手段によって、機械学習が間違いやすいカテゴリ同士がグループ化され、事前の識別誤りの集計により酷似種が自動的に仕分され、グループ間識別、グループ内識別の２階層の識別が実行され、多クラス識別器において、過学習が抑制されることが開示されている。 In Patent Document 2, it is an object to avoid over-learning in classification of multi-class images and other feature data including very similar species. When a group is determined, the first learning unit and the second layer classifier for identifying in a group are realized. A type discriminator that performs discrimination by machine learning is generated in a state in which the discriminating state is not generated, and then a discrimination test is performed by using the type discriminator by the discriminating error counting means, and the number of times that discriminating errors are generated between categories is calculated. Aggregated, grouping processing means group categories that are prone to machine learning mistakes, and automatically sort out very similar species by counting prior identification errors. Identifying two-tier internal identification is performed, the multi-class classifier, excessive learning is disclosed to be inhibited.

特許文献３には、内容の偏りが少なく、学習に効果的な画像データを収集することを課題とし、画像から検出対象とする画像領域を検出する複数の検出器と、複数の検出器における検出の結果を統合し、学習画像の候補となる画像領域と画像領域の対象物らしさのスコアとの組を出力する統合手段と、学習データの採択率を設定する設定手段と、スコアと設定手段で設定された採択率とに基づいて、画像領域とスコアとの組から学習データを選択する選択手段と、選択手段で選択された学習データを保存する保存手段と、を有することが開示されている。 Patent Document 3 has a problem of collecting image data that is less biased in content and effective in learning, and includes a plurality of detectors that detect an image region to be detected from an image, and detection by the plurality of detectors. Integration means for combining the results of the above, and outputting a set of the image area candidate score of the learning image and the object-likeness score of the image area, a setting means for setting the adoption rate of the learning data, and the score and the setting means Disclosed is a selection unit that selects learning data from a set of an image area and a score based on a set acceptance rate, and a storage unit that stores the learning data selected by the selection unit. .

特許文献４には、カメラに映った人物が携帯電話を使用している状態か否かを判別できる携帯電話使用状態識別装置を提供することを課題とし、カメラから入力した学習用画像から顔領域を検出し、該顔領域に隣接する左右部分領域から特徴量を抽出し、該抽出された特徴量のデータ群を元に携帯電話使用状態か否かを識別する識別関数を作成し、該識別関数の各種パラメータ値を識別関数記憶部に格納し、カメラから入力した識別対象画像から顔のある領域を検出し、該識別対象画像の顔領域に隣接する左右部分領域から特徴量を抽出し、該抽出された特徴量を前記識別関数記憶部に格納された識別関数に入力することで、携帯電話使用状態か否かを識別し（識別部）、該識別部の識別結果を出力部から出力することが開示されている。 In Patent Document 4, it is an object to provide a mobile phone use state identification device that can determine whether or not a person reflected in a camera is using a mobile phone, and a facial region is obtained from a learning image input from the camera. Is detected, a feature amount is extracted from the left and right partial regions adjacent to the face region, and an identification function for identifying whether the mobile phone is in use or not is created based on the extracted feature amount data group. Various parameter values of the function are stored in the identification function storage unit, a region with a face is detected from the identification target image input from the camera, and feature amounts are extracted from the left and right partial regions adjacent to the facial region of the identification target image, By inputting the extracted feature quantity into the discrimination function stored in the discrimination function storage unit, it is identified whether or not the mobile phone is in use (identification unit), and the discrimination result of the discrimination unit is output from the output unit Is disclosed.

非特許文献１には、機械学習のためにノイズの少ない教師データを大量に用意するのは困難であることを課題とし、Ｔｗｅｅｔ（「つぶやき」といわれるテキスト）中に含まれる顔文字情報を手掛かりに教師データを収集することによって、人手をかけず効率的に大量の教師データを収集することが開示されている。例えば、顔文字「:-)」は、ポジティブを示しており、顔文字「:-(」はネガティブを示していることを手掛かりにしている。 Non-Patent Document 1 has a problem that it is difficult to prepare a large amount of low-noise teacher data for machine learning, and provides clues on emoticon information included in Tweet (text called “tweet”). It is disclosed that a large amount of teacher data can be efficiently collected by collecting teacher data in a simple manner. For example, the emoticon “:-)” indicates positive, and the emoticon “:-(” indicates negative.

特開２００５−２１５９８８号公報Japanese Patent Laid-Open No. 2005-215988 特開２０１３−２５０８０９号公報JP 2013-250809 A 特開２０１２−１９０１５９号公報JP 2012-190159 A 特開２０１０−１２２８３８号公報JP 2010-122838 A

ＡｌｅｃＧｏ，ＲｉｃｈａＢｈａｙａｎｉ，ａｎｄＬｅｉＨｕａｎｇ．２００９．Ｔｗｉｔ−ｔｅｒｓｅｎｔｉｍｅｎｔｃｌａｓｓｉｆｉｃａｔｉｏｎｕｓｉｎｇｄｉｓｔａｎｔｓｕｐｅｒｖｉｓｉｏｎ．Ｔｅｃｈｎｉｃａｌｒｅｐｏｒｔ，ＳｔａｎｆｏｒｄＵｎｉｖｅｒｓｉｔｙ．Alec Go, Richa Bayani, and Lei Huang. 2009. Twiter-terentent classification using distant supervision. Technical report, Stanford University.

本発明は、分類されていない投稿情報を用いて、機械学習に用いる教師データを作成する場合に比較して、ノイズの少ない教師データを作成するようにした情報処理装置及び情報処理プログラムを提供することを目的としている。 The present invention provides an information processing apparatus and an information processing program that create teacher data with less noise compared to the case of creating teacher data used for machine learning using post information that is not classified. The purpose is that.

かかる目的を達成するための本発明の要旨とするところは、次の各項の発明に存する。
請求項１の発明は、投稿情報を収集する収集手段と、前記収集手段によって収集された投稿情報を分類する分類手段と、前記分類手段によって分類された集合内の投稿情報を用いて、機械学習に用いる教師データを作成する作成手段を具備することを特徴とする情報処理装置である。 The gist of the present invention for achieving the object lies in the inventions of the following items.
The invention of claim 1 is a machine learning method using collection means for collecting post information, classification means for classifying post information collected by the collection means, and post information in the set classified by the classification means. It is an information processing apparatus characterized by comprising a creation means for creating teacher data used for.

請求項２の発明は、前記分類手段は、前記収集手段によって収集された投稿情報を、教師なしクラスタリングによる分類又は該投稿情報に含まれている情報を用いて分類することを特徴とする請求項１に記載の情報処理装置である。 The invention according to claim 2 is characterized in that the classification means classifies the post information collected by the collection means using classification based on unsupervised clustering or information included in the post information. The information processing apparatus according to 1.

請求項３の発明は、前記分類手段は、前記収集手段によって収集された投稿情報を、予め作成した教師データを用いて、教師ありクラスタリング又は機械学習によって分類することを特徴とする請求項１に記載の情報処理装置である。 The invention according to claim 3 is characterized in that the classification means classifies post information collected by the collection means by supervised clustering or machine learning using teacher data created in advance. The information processing apparatus described.

請求項４の発明は、前記作成手段は、前記分類手段によって分類された集合内の投稿情報を結合して、機械学習に用いる１件の教師データを作成することを特徴とする請求項１に記載の情報処理装置である。 The invention according to claim 4 is characterized in that the creation means creates one piece of teacher data used for machine learning by combining post information in the set classified by the classification means. The information processing apparatus described.

請求項５の発明は、コンピュータを、投稿情報を収集する収集手段と、前記収集手段によって収集された投稿情報を分類する分類手段と、前記分類手段によって分類された集合内の投稿情報を用いて、機械学習に用いる教師データを作成する作成手段として機能させるための情報処理プログラムである。 According to a fifth aspect of the present invention, there is provided a computer using collection means for collecting post information, classification means for classifying post information collected by the collection means, and post information in a set classified by the classification means. An information processing program for functioning as a creation means for creating teacher data used for machine learning.

請求項１の情報処理装置によれば、分類されていない投稿情報を用いて、機械学習に用いる教師データを作成する場合に比較して、ノイズの少ない教師データを作成することができる。 According to the information processing apparatus of the first aspect, it is possible to create teacher data with less noise compared to the case of creating teacher data used for machine learning using post information that is not classified.

請求項２の情報処理装置によれば、収集された投稿情報を、教師なしクラスタリングによる分類又はその投稿情報に含まれている情報を用いて分類することができる。 According to the information processing apparatus of the second aspect, the collected posted information can be classified by using unsupervised clustering or information included in the posted information.

請求項３の情報処理装置によれば、収集された投稿情報を、予め作成した教師データを用いて、教師ありクラスタリング又は機械学習によって分類することができる。 According to the information processing apparatus of the third aspect, the collected post information can be classified by supervised clustering or machine learning using teacher data created in advance.

請求項４の情報処理装置によれば、分類された集合内の投稿情報を結合して、機械学習に用いる１件の教師データを作成することができる。 According to the information processing apparatus of the fourth aspect, it is possible to create one piece of teacher data used for machine learning by combining post information in the classified set.

請求項５の情報処理プログラムによれば、分類されていない投稿情報を用いて、機械学習に用いる教師データを作成する場合に比較して、ノイズの少ない教師データを作成することができる。 According to the information processing program of the fifth aspect, it is possible to create teacher data with less noise compared to a case where teacher data used for machine learning is created using post information that is not classified.

第１の実施の形態の構成例についての概念的なモジュール構成図である。It is a conceptual module block diagram about the structural example of 1st Embodiment. 第１の実施の形態を利用したシステム構成例を示す説明図である。It is explanatory drawing which shows the system configuration example using 1st Embodiment. 第１の実施の形態による処理例を示すフローチャートである。It is a flowchart which shows the process example by 1st Embodiment. 第１の実施の形態による処理例を示す説明図である。It is explanatory drawing which shows the process example by 1st Embodiment. 第１の実施の形態による処理例を示す説明図である。It is explanatory drawing which shows the process example by 1st Embodiment. 第１の実施の形態による処理例を示す説明図である。It is explanatory drawing which shows the process example by 1st Embodiment. 第１の実施の形態による処理例を示す説明図である。It is explanatory drawing which shows the process example by 1st Embodiment. 第１の実施の形態による処理例を示す説明図である。It is explanatory drawing which shows the process example by 1st Embodiment. 第２の実施の形態の構成例についての概念的なモジュール構成図である。It is a conceptual module block diagram about the structural example of 2nd Embodiment. 第２の実施の形態による処理例を示すフローチャートである。It is a flowchart which shows the process example by 2nd Embodiment. 第２の実施の形態による処理例を示すフローチャートである。It is a flowchart which shows the process example by 2nd Embodiment. 本実施の形態を実現するコンピュータのハードウェア構成例を示すブロック図である。It is a block diagram which shows the hardware structural example of the computer which implement | achieves this Embodiment.

以下、図面に基づき本発明を実現するにあたっての好適な各種の実施の形態の例を説明する。
図１は、第１の実施の形態の構成例についての概念的なモジュール構成図を示している。
なお、モジュールとは、一般的に論理的に分離可能なソフトウェア（コンピュータ・プログラム）、ハードウェア等の部品を指す。したがって、本実施の形態におけるモジュールはコンピュータ・プログラムにおけるモジュールのことだけでなく、ハードウェア構成におけるモジュールも指す。それゆえ、本実施の形態は、それらのモジュールとして機能させるためのコンピュータ・プログラム（コンピュータにそれぞれの手順を実行させるためのプログラム、コンピュータをそれぞれの手段として機能させるためのプログラム、コンピュータにそれぞれの機能を実現させるためのプログラム）、システム及び方法の説明をも兼ねている。ただし、説明の都合上、「記憶する」、「記憶させる」、これらと同等の文言を用いるが、これらの文言は、実施の形態がコンピュータ・プログラムの場合は、記憶装置に記憶させる、又は記憶装置に記憶させるように制御するの意である。また、モジュールは機能に一対一に対応していてもよいが、実装においては、１モジュールを１プログラムで構成してもよいし、複数モジュールを１プログラムで構成してもよく、逆に１モジュールを複数プログラムで構成してもよい。また、複数モジュールは１コンピュータによって実行されてもよいし、分散又は並列環境におけるコンピュータによって１モジュールが複数コンピュータで実行されてもよい。なお、１つのモジュールに他のモジュールが含まれていてもよい。また、以下、「接続」とは物理的な接続の他、論理的な接続（データの授受、指示、データ間の参照関係等）の場合にも用いる。「予め定められた」とは、対象としている処理の前に定まっていることをいい、本実施の形態による処理が始まる前はもちろんのこと、本実施の形態による処理が始まった後であっても、対象としている処理の前であれば、そのときの状況・状態に応じて、又はそれまでの状況・状態に応じて定まることの意を含めて用いる。「予め定められた値」が複数ある場合は、それぞれ異なった値であってもよいし、２以上の値（もちろんのことながら、全ての値も含む）が同じであってもよい。また、「Ａである場合、Ｂをする」という意味を有する記載は、「Ａであるか否かを判断し、Ａであると判断した場合はＢをする」の意味で用いる。ただし、Ａであるか否かの判断が不要である場合を除く。
また、システム又は装置とは、複数のコンピュータ、ハードウェア、装置等がネットワーク（一対一対応の通信接続を含む）等の通信手段で接続されて構成されるほか、１つのコンピュータ、ハードウェア、装置等によって実現される場合も含まれる。「装置」と「システム」とは、互いに同義の用語として用いる。もちろんのことながら、「システム」には、人為的な取り決めである社会的な「仕組み」（社会システム）にすぎないものは含まない。
また、各モジュールによる処理毎に又はモジュール内で複数の処理を行う場合はその処理毎に、対象となる情報を記憶装置から読み込み、その処理を行った後に、処理結果を記憶装置に書き出すものである。したがって、処理前の記憶装置からの読み込み、処理後の記憶装置への書き出しについては、説明を省略する場合がある。なお、ここでの記憶装置としては、ハードディスク、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）、外部記憶媒体、通信回線を介した記憶装置、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）内のレジスタ等を含んでいてもよい。 Hereinafter, examples of various preferred embodiments for realizing the present invention will be described with reference to the drawings.
FIG. 1 is a conceptual module configuration diagram of a configuration example according to the first embodiment.
The module generally refers to components such as software (computer program) and hardware that can be logically separated. Therefore, the module in the present embodiment indicates not only a module in a computer program but also a module in a hardware configuration. Therefore, the present embodiment is a computer program for causing these modules to function (a program for causing a computer to execute each procedure, a program for causing a computer to function as each means, and a function for each computer. This also serves as an explanation of the program and system and method for realizing the above. However, for the sake of explanation, the words “store”, “store”, and equivalents thereof are used. However, when the embodiment is a computer program, these words are stored in a storage device or stored in memory. It is the control to be stored in the device. Modules may correspond to functions one-to-one, but in mounting, one module may be configured by one program, or a plurality of modules may be configured by one program, and conversely, one module May be composed of a plurality of programs. The plurality of modules may be executed by one computer, or one module may be executed by a plurality of computers in a distributed or parallel environment. Note that one module may include other modules. Hereinafter, “connection” is used not only for physical connection but also for logical connection (data exchange, instruction, reference relationship between data, etc.). “Predetermined” means that the process is determined before the target process, and not only before the process according to this embodiment starts but also after the process according to this embodiment starts. In addition, if it is before the target processing, it is used in accordance with the situation / state at that time or with the intention to be decided according to the situation / state up to that point. When there are a plurality of “predetermined values”, they may be different values, or two or more values (of course, including all values) may be the same. In addition, the description having the meaning of “do B when it is A” is used in the meaning of “determine whether or not it is A and do B when it is judged as A”. However, the case where it is not necessary to determine whether or not A is excluded.
In addition, the system or device is configured by connecting a plurality of computers, hardware, devices, and the like by communication means such as a network (including one-to-one correspondence communication connection), etc., and one computer, hardware, device. The case where it implement | achieves by etc. is also included. “Apparatus” and “system” are used as synonymous terms. Of course, the “system” does not include a social “mechanism” (social system) that is an artificial arrangement.
In addition, when performing a plurality of processes in each module or in each module, the target information is read from the storage device for each process, and the processing result is written to the storage device after performing the processing. is there. Therefore, description of reading from the storage device before processing and writing to the storage device after processing may be omitted. Here, the storage device may include a hard disk, a RAM (Random Access Memory), an external storage medium, a storage device via a communication line, a register in a CPU (Central Processing Unit), and the like.

本実施の形態である情報処理装置１００は、機械学習に用いる教師データを作成するものであって、図１の例に示すように、予備教師データ収集モジュール１１０、予備教師データ分析モジュール１２０、教師データ作成モジュール１３０、収集対象データ保存モジュール１４０、予備教師データ格納モジュール１５０、教師データ格納モジュール１６０を有している。
例えば、ＳＮＳ（ｓｏｃｉａｌｎｅｔｗｏｒｋｉｎｇｓｅｒｖｉｃｅ、ソーシャル・ネットワーキング・サービス）等における投稿情報から、その投稿を行ったユーザーの趣味を判定等する目的で、機械学習に用いる教師データを作成することが行われている。
１件１件の教師データのデータ単位（以下、教師データ単位ともいう）がテキスト集合からなるデータを学習・判定対象とする機械学習において、教師データを手掛かり情報によって収集すると、教師データ単位を満たすデータが元々少なかったり、セキュリティでデータの一部が制限されていたりといったことが原因で、学習に必要な件数の教師データを収集できないことがある。また、手掛かり情報によってテキスト集合からなる教師データを収集、作成すると、データの中に手掛かり情報に関連しない情報（ノイズ）が多数紛れ込んでしまうことがある。例えば、ＳＮＳユーザーの趣味判定におけるノイズとしては、「おはよう」等のあいさつなどがある。このような投稿は誰もが作成する可能性があるので、趣味判定の手掛りにはならない。
なお、教師データ単位とは、教師データ１件のデータ量である。分割データ単位とは、本実施の形態でデータを収集するときの抽出するデータの大きさである。後述する予備教師データの大きさとなる。予備教師データとは、手掛かり情報によって分割データ単位で収集されたデータである。 The information processing apparatus 100 according to the present embodiment creates teacher data used for machine learning. As shown in the example of FIG. 1, the preliminary teacher data collection module 110, the preliminary teacher data analysis module 120, the teacher The data creation module 130, the collection target data storage module 140, the preliminary teacher data storage module 150, and the teacher data storage module 160 are included.
For example, teacher data used for machine learning is created for the purpose of determining the hobby of the user who made the posting from posting information in SNS (social networking service), etc. .
In machine learning in which data units of teacher data for each case (hereinafter also referred to as teacher data units) are learning / determination targets of data consisting of text sets, the teacher data unit is satisfied when teacher data is collected by clue information. There may be cases where the number of teacher data necessary for learning cannot be collected due to the fact that the data is originally low or because part of the data is limited by security. Further, when teacher data consisting of a text set is collected and created based on clue information, a large amount of information (noise) not related to the clue information may be mixed in the data. For example, the noise in the SNS user's hobby determination includes a greeting such as “Good morning”. Anyone can create such a post, so it is not a clue for judging hobbies.
The teacher data unit is a data amount of one piece of teacher data. The divided data unit is the size of data to be extracted when data is collected in the present embodiment. This is the size of preliminary teacher data to be described later. The preliminary teacher data is data collected in divided data units based on clue information.

情報処理装置１００は、手掛かり情報によってデータを収集するときに、テキスト部分集合からなる教師データ単位を分割した単位（以下、分割データ単位ともいう）データを収集する。収集したデータの傾向を分析し、類似する傾向を持つデータを教師データ単位に結合することによって、実際のデータ（本実施の形態が生成した教師データを用いて機械学習したモデルを用いて、処理対象とするデータ）と類似し、ノイズの少ない教師データを作成する。 When the information processing apparatus 100 collects data based on clue information, the information processing apparatus 100 collects data obtained by dividing a teacher data unit composed of a text subset (hereinafter also referred to as a divided data unit). Analyzing collected data trends and combining data with similar trends into teacher data units, and using actual data (machine learning model using teacher data generated by this embodiment, processing The teacher data is similar to the target data) and has less noise.

収集対象データ保存モジュール１４０は、予備教師データ収集モジュール１１０と接続されている。収集対象データ保存モジュール１４０は、教師データの元となる投稿情報を記憶している。ここで投稿情報として、ＳＮＳにおける投稿情報（つぶやき、ブログ内の記事、掲示板への書き込み等）等が該当する。例えば、ＳＮＳによってユーザーから投稿情報を収集してもよいし、ＳＮＳが収集した投稿情報を複製して収集してきたものであってもよい。
予備教師データ収集モジュール１１０は、収集対象データ保存モジュール１４０、予備教師データ格納モジュール１５０と接続されている。予備教師データ収集モジュール１１０は、投稿情報を収集する。そして、収集した投稿情報を予備教師データ格納モジュール１５０に記憶させる。
例えば、予備教師データ収集モジュール１１０は、手掛かり情報に基づいて投稿情報（以下、予備教師データともいう）を収集するようにしてもよい。ここで手掛かり情報とは、検索キーワード、ユーザー（書き手）の属性（ユーザープロフィール）、共通の興味をもつ者が集まる場を提供するフォーラム、コミュニティ等がある。例えば、音楽を趣味とする人の投稿情報を収集する場合は、検索キーワードとして、ある歌手の名前「ｘｘｘｘｘ」が含まれている投稿情報を検索して収集するようにしてもよいし、ユーザー（書き手）の属性の趣味欄に「ｘｘｘｘｘ」が含まれているユーザーの投稿情報を収集するようにしてもよい。
具体的には、予備教師データ収集モジュール１１０は、手掛かり情報を元にテキスト部分集合からなる分割データ単位の予備教師データを収集する。これによって、データ単位に満たないデータも利用できるようになり、たくさんのデータを収集できるようになる。また、手掛かり情報によって抽出される範囲が狭くなるため、手掛かり情報と関係のない情報（ノイズ）が減る。 The collection target data storage module 140 is connected to the preliminary teacher data collection module 110. The collection target data storage module 140 stores post information that is a source of teacher data. Here, post information in the SNS (tweet, article in a blog, writing on a bulletin board, or the like) corresponds to the post information. For example, post information may be collected from a user by SNS, or post information collected by SNS may be copied and collected.
The preliminary teacher data collection module 110 is connected to the collection target data storage module 140 and the preliminary teacher data storage module 150. The preliminary teacher data collection module 110 collects post information. Then, the collected post information is stored in the preliminary teacher data storage module 150.
For example, the preliminary teacher data collection module 110 may collect post information (hereinafter also referred to as preliminary teacher data) based on the clue information. Here, the clue information includes a search keyword, a user (writer) attribute (user profile), a forum that provides a place for people with common interests to gather, a community, and the like. For example, when collecting post information of a person who has a hobby of music, post information including a certain singer's name “xxxx” as a search keyword may be searched and collected. You may make it collect the posting information of the user in which "xxxx" is included in the hobby column of the attribute of (writer).
Specifically, the preliminary teacher data collection module 110 collects preliminary teacher data in divided data units composed of text subsets based on the clue information. This makes it possible to use data that is less than the data unit and collect a large amount of data. Further, since the range extracted by the clue information is narrowed, information (noise) unrelated to the clue information is reduced.

予備教師データ格納モジュール１５０は、予備教師データ収集モジュール１１０、予備教師データ分析モジュール１２０と接続されている。予備教師データ格納モジュール１５０は、予備教師データ収集モジュール１１０によって収集された予備教師データを記憶しており、その予備教師データを予備教師データ分析モジュール１２０に渡す。
予備教師データ分析モジュール１２０は、教師データ作成モジュール１３０、予備教師データ格納モジュール１５０と接続されている。予備教師データ分析モジュール１２０は、予備教師データ収集モジュール１１０によって収集された投稿情報（予備教師データ格納モジュール１５０に記憶された予備教師データ）を分類する。
例えば、予備教師データ分析モジュール１２０は、予備教師データ収集モジュール１１０によって収集された投稿情報を、教師なしクラスタリングによる分類又はその投稿情報に含まれている情報を用いて分類するようにしてもよい。つまり、予備教師データ分析モジュール１２０は、収集した予備教師データを分類（分析）する。 The preliminary teacher data storage module 150 is connected to the preliminary teacher data collection module 110 and the preliminary teacher data analysis module 120. The preliminary teacher data storage module 150 stores the preliminary teacher data collected by the preliminary teacher data collection module 110 and passes the preliminary teacher data to the preliminary teacher data analysis module 120.
The preliminary teacher data analysis module 120 is connected to the teacher data creation module 130 and the preliminary teacher data storage module 150. The preliminary teacher data analysis module 120 classifies post information (preliminary teacher data stored in the preliminary teacher data storage module 150) collected by the preliminary teacher data collection module 110.
For example, the preliminary teacher data analysis module 120 may classify post information collected by the preliminary teacher data collection module 110 using classification based on unsupervised clustering or information included in the post information. That is, the preliminary teacher data analysis module 120 classifies (analyzes) the collected preliminary teacher data.

教師データ作成モジュール１３０は、予備教師データ分析モジュール１２０、教師データ格納モジュール１６０と接続されている。教師データ作成モジュール１３０は、予備教師データ分析モジュール１２０によって分類された集合内の投稿情報を用いて、機械学習に用いる教師データを作成する。そして、作成した教師データを教師データ格納モジュール１６０に記憶させる。
また、教師データ作成モジュール１３０は、予備教師データ分析モジュール１２０によって分類された集合内の投稿情報を結合して、機械学習に用いる１件の教師データを作成するようにしてもよい。つまり、教師データ作成モジュール１３０は、同等の傾向を持つデータをまとめる（結合する）ことで、教師データ単位のデータに集約する。これによって、同等の傾向を持つデータを統合し、実際のデータに類似した教師データを作成することとなる。しかも、手掛かり情報で収集した上で結合しているので、その傾向のデータが持つ特徴を豊富に含む教師データを作成することとなる。
教師データ格納モジュール１６０は、教師データ作成モジュール１３０と接続されている。教師データ格納モジュール１６０は、教師データ作成モジュール１３０によって作成された教師データを記憶する。 The teacher data creation module 130 is connected to the preliminary teacher data analysis module 120 and the teacher data storage module 160. The teacher data creation module 130 creates teacher data used for machine learning using the posting information in the set classified by the preliminary teacher data analysis module 120. Then, the created teacher data is stored in the teacher data storage module 160.
In addition, the teacher data creation module 130 may combine the posting information in the set classified by the preliminary teacher data analysis module 120 to create one piece of teacher data used for machine learning. That is, the teacher data creation module 130 aggregates (combines) data having the same tendency into data in units of teacher data. As a result, data having the same tendency is integrated, and teacher data similar to actual data is created. In addition, since the information is collected after the clue information is combined, the teacher data including abundant features of the tendency data is created.
The teacher data storage module 160 is connected to the teacher data creation module 130. The teacher data storage module 160 stores the teacher data created by the teacher data creation module 130.

図２は、第１の実施の形態を利用したシステム構成例を示す説明図である。
情報処理装置１００、ＳＮＳ提供装置２１０Ａ、ＳＮＳ提供装置２１０Ｂ、ユーザー端末２２０は、通信回線２９０を介してそれぞれ接続されている。通信回線２９０は、無線、有線、これらの組み合わせであってもよく、例えば、通信インフラとしてのインターネット、イントラネット等であってもよい。ＳＮＳ提供装置２１０は、ＳＮＳのサービスを提供し、ユーザー端末２２０等からの投稿情報を収集する。そして、情報処理装置１００は、ＳＮＳ提供装置２１０Ａ、ＳＮＳ提供装置２１０Ｂから、その投稿情報を収集して、教師データを生成する。また、情報処理装置１００による機能は、クラウドサービスとして実現してもよい。
さらに、情報処理装置１００の教師データ格納モジュール１６０に記憶された教師データを用いて、機械学習が行われる。この機械学習によって生成されたモデルを用いて、前述の例では、ＳＮＳ提供装置２１０Ａ、ＳＮＳ提供装置２１０Ｂ内の投稿情報から、音楽を趣味としているユーザーを特定する。そして、そのユーザー向けに音楽を趣味とする個人向けの商品、サービスの広告を提供するようにしてもよい。 FIG. 2 is an explanatory diagram illustrating an example of a system configuration using the first embodiment.
The information processing apparatus 100, the SNS providing apparatus 210A, the SNS providing apparatus 210B, and the user terminal 220 are connected via a communication line 290, respectively. The communication line 290 may be wireless, wired, or a combination thereof, and may be, for example, the Internet or an intranet as a communication infrastructure. The SNS providing device 210 provides an SNS service and collects post information from the user terminal 220 or the like. Then, the information processing apparatus 100 collects the posting information from the SNS providing apparatus 210A and the SNS providing apparatus 210B and generates teacher data. Further, the function of the information processing apparatus 100 may be realized as a cloud service.
Furthermore, machine learning is performed using the teacher data stored in the teacher data storage module 160 of the information processing apparatus 100. Using the model generated by this machine learning, in the above-described example, the user who has a hobby of music is specified from the posted information in the SNS providing apparatus 210A and the SNS providing apparatus 210B. Then, advertisements for products and services for individuals who have a hobby of music may be provided for the user.

図３は、第１の実施の形態による処理例を示すフローチャートである。
ステップＳ３０２では、予備教師データ収集モジュール１１０は、収集対象データ保存モジュール１４０から手掛かり情報を用いて予備教師データを抽出する。
ステップＳ３０４では、予備教師データ分析モジュール１２０は、予備教師データに対して、クラスタリング処理を行う。具体的には、ステップＳ３０２で収集した多数の予備教師データをクラスタリングし、類似した傾向を持つデータからなるいくつかのクラスタに分類する。
ステップＳ３０６では、教師データ作成モジュール１３０は、クラスタリングされた予備教師データを、教師データの単位に集約（結合）する。具体的には、ステップＳ３０４で作成されたクラスタに含まれている予備教師データを用いて、教師データとして要請されるデータ量に集約する。
ステップＳ３０８では、教師データ作成モジュール１３０は、ステップＳ３０６で作成された教師データを教師データ格納モジュール１６０に保存する。 FIG. 3 is a flowchart illustrating a processing example according to the first exemplary embodiment.
In step S302, the preliminary teacher data collection module 110 extracts preliminary teacher data from the collection target data storage module 140 using the clue information.
In step S304, the preliminary teacher data analysis module 120 performs a clustering process on the preliminary teacher data. Specifically, a large number of preliminary teacher data collected in step S302 is clustered and classified into several clusters composed of data having similar tendencies.
In step S306, the teacher data creation module 130 aggregates (combines) the clustered preliminary teacher data into units of teacher data. Specifically, the preliminary teacher data included in the cluster created in step S304 is used to aggregate the data amount required as the teacher data.
In step S308, the teacher data creation module 130 stores the teacher data created in step S306 in the teacher data storage module 160.

本実施の形態による処理例について、ＳＮＳユーザーのプロフィール自動判定を例にとって説明する。
この技術は、ＳＮＳユーザーの性別、年代、居住域、職業、趣味等のプロフィール属性をユーザーの投稿から自動推定するというものである。機械学習器を作成することによってプロフィールの自動判定を実現するが、その教師データには通常、１人のＳＮＳユーザーの１００−２００個（この数は、一例であって、より多く、又は少ない場合であってもよい）の投稿情報に対して、アノテーションを付与したデータが用いられる。すなわち、教師データ単位は１００−２００投稿情報となる。これは、単一の投稿情報からそのユーザーのプロフィール属性を推定するのは難しいからである。図４に示す例では、投稿情報群４２０として、「俺の車かっこいい」、「レストランに行った」、「やったぜ、宝くじに当たった」、「今、会社帰り」、「＠ｘｘｘｘお前、何いってんの」の文（投稿情報）がある。これは、あるユーザーの複数の投稿情報（投稿情報群４２０）に対して、「男」というラベル４１０が付与されているが、「俺」、「やったぜ」、「お前」といった男性特有の表現が現れている投稿情報と現れていない投稿情報があり、精度よく判定を行うためには１人当たりたくさんの投稿情報を収集する必要があることが分かる。 A processing example according to the present embodiment will be described by taking SNS user profile automatic determination as an example.
This technology automatically estimates profile attributes such as gender, age, residential area, occupation, and hobby of SNS users from user posts. Automatic determination of the profile is realized by creating a machine learner, but the teacher data is usually 100-200 of one SNS user (this number is an example, if more or less The annotation information is used for the posting information. That is, the teacher data unit is 100-200 post information. This is because it is difficult to estimate the user's profile attribute from a single post information. In the example shown in FIG. 4, the posted information group 420 includes “my car is cool”, “I went to a restaurant”, “I did it, I hit a lottery”, “Now, I returned to the company”, “@xxxx you, There is a sentence (posted information). This is because a plurality of post information (post information group 420) of a user is given a label 410 of “male”, which is specific to men such as “I”, “I did it”, “You” It can be seen that there are post information that appears and post information that does not appear, and it is necessary to collect a large amount of post information per person in order to make an accurate determination.

プロフィール判定の中でも趣味極性の「音楽」の教師データ作成プロセスにおいて、本実施の形態による処理を適用する。
まず、予備教師データ収集モジュール１１０において、歌手の名前や楽器の名前を手掛かり情報として用いて、予備教師データを収集する。手掛かり情報として「ｘｘｘｘｘ」という歌手の名前で収集された予備教師データ群５００を図５の例に示す。予備教師データ群５００として、１行目にはｕｓｅｒａが投稿した「ｘｘｘｘｘちゃんかわいい」、２行目にはｕｓｅｒｂが投稿した「ｘｘｘｘｘのコンサートに行きます（＾＾）」、３行目にはｕｓｅｒｃが投稿した「声がいいｘｘｘｘｘ」、４行目にはｕｓｅｒｄが投稿した「ｘｘｘｘｘの顔小さくてスタイルがいい」、５行目にはｕｓｅｒｅが投稿した「ｘｘｘｘｘの歌うまい」、６行目にはｕｓｅｒｆが投稿した「今度ｘｘｘｘｘの大阪コンサート行くよ−」、７行目にはｕｓｅｒｇが投稿した「ｘｘｘｘｘの新曲買いました！」、８行目にはｕｓｅｒｈが投稿した「まつ毛長いよね、ｘｘｘｘｘちゃん」がある。なお、この例では、１件／人の投稿情報を収集しているが、１人につき複数の投稿情報を収集するようにしてもよい。 In the profile determination, the processing according to the present embodiment is applied in the teacher data creation process of “music” having a hobby polarity.
First, the preliminary teacher data collection module 110 collects preliminary teacher data using the name of the singer or the name of the instrument as clue information. The preliminary teacher data group 500 collected under the name of the singer “xxxx” as clue information is shown in the example of FIG. As the preliminary teacher data group 500, “xxxx chan is cute” posted by user a on the first line, “Go to the concert of xxxxx” posted by user b on the second line (^^), Is a "voice is good xxxxxxx" posted by user c, and the fourth line is "user's" posted "xxxxx face is small and has a good style", and the fifth line is posted by user e "I want to sing xxxxxx" In line 6, user f posted “This time I will go to the Osaka concert of xxxxxx”, in line 7, user g posted “I bought a new xxxxxx song!”, And in line 8, user h There is a post “Long eyelashes, xxxx-chan”. In this example, post information per person / person is collected, but a plurality of pieces of post information may be collected per person.

例えば、１行目の投稿情報は「ｘｘｘｘｘ」の「容姿」、２行目の投稿情報は「イベント」、３行目の投稿情報は「歌」に注目した投稿となっている。このように、「ｘｘｘｘｘ」を含むという条件で収集した投稿情報は、歌手「ｘｘｘｘｘ」の異なるアスペクト（切り口、性質）を含んだ、傾向の異なるデータであることが分かる。このような投稿情報をそのまま統合すると、全ての傾向が混合した教師データとなってしまう。実際のユーザーが興味を持つアスペクトは各々のユーザーによって異なると考えられ、１人のユーザーは特定の１つ又はいくつかのアスペクトに注目していると考えられる。そのため、全てのアスペクトの特徴を含んでいるこのデータをそのまま教師データ単位に集約しても、機械学習器の判定精度は低くなると予想される。そこで、予備教師データ分析モジュール１２０によって、収集した投稿データを同じアスペクト、傾向を持つもので分類し、教師データ作成モジュール１３０がその分類内で集約することによって教師データを作成する。
先に述べたように、収集した雑多な予備教師データを同一の傾向でまとめる際には、教師なしクラスタリングや収集対象のデータが元々持っている情報等を用いる。教師なしクラスタリングのアルゴリズムとしては、単語や文字を素性としたｋ−ｍｅａｎｓ法、潜在的ディリクレ配分法（ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ，ＬＤＡ、「Ｄ．Ｂｌｅｉ，Ａ．Ｎｇ，Ｍ．Ｊｏｒｄａｎ， “ＬａｔｅｎｔＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ”，ＪｏｕｒｎａｌｏｆＭａｃｈｉｎｅＬｅａｒｎｉｎｇＲｅｓｅａｒｃｈ３ｐｐ．９９３−１０２２，２００３」）、Ｇｉｒｖａｎ−Ｎｅｗｍａｎ法などがある。また、予備教師データが元々持っている情報で、傾向の分類と判定に利用可能なものとしては、例えばＳＮＳの投稿における投稿時間、投稿された場所、顔文字の有無や種類、添付された画像の有無や種類等がある。予備教師データ分析モジュール１２０は、この予備教師データが元々持っている情報で、分類する。例えば、投稿された場所がコンサート会場である予備教師データを収集して、１つの分類とすればよい。 For example, the post information on the first line is a “posture” of “xxxx”, the post information on the second line is “event”, and the post information on the third line is a post focusing on “song”. Thus, it can be seen that the post information collected under the condition of including “xxxx” is data having different tendencies including different aspects (cutting points, properties) of the singer “xxxx”. If such posting information is integrated as it is, it becomes teacher data in which all the tendencies are mixed. The aspect in which the actual user is interested is considered to be different for each user, and one user is considered to focus on one or several specific aspects. For this reason, even if this data including all aspects of characteristics is directly aggregated in units of teacher data, it is expected that the determination accuracy of the machine learner will be lowered. The preliminary teacher data analysis module 120 classifies the collected post data according to those having the same aspect and tendency, and the teacher data creation module 130 aggregates the classified post data within the classification to create teacher data.
As described above, when collecting the collected preliminary teacher data with the same tendency, unsupervised clustering, information originally stored in the data to be collected, or the like is used. As an algorithm for unsupervised clustering, k-means method using words and characters as a feature, potential Dirichlet allocation method (Lent Dirichlet Allocation, LDA, “D. Blei, A. Ng, M. Jordan,“ Lent Dirichlet Allocation ”) , Journal of Machine Learning Research 3 pp. 993-1022, 2003 "), and the Girvan-Newman method. Moreover, as information that the preliminary teacher data originally has and can be used for classification and determination of tendency, for example, posting time in SNS posting, posted location, presence / absence / type of emoticon, attached image There is presence or absence and kind. The preliminary teacher data analysis module 120 classifies the preliminary teacher data based on information originally possessed by the preliminary teacher data. For example, preliminary teacher data in which the posted location is a concert venue may be collected and classified into one category.

図６に示す例は、歌手「ｘｘｘｘｘ」についての投稿情報を「容姿」（容姿クラスタ６１２）、「歌」（顔クラスタ６２２）、「イベント」（イベントクラスタ６３２）で集約したものである。つまり、容姿クラスタ６１２に分類された予備教師データ群６１０内の予備教師データを結合して、１件の教師データ６１４を生成する。同様に、顔クラスタ６２２に分類された予備教師データ群６２０内の予備教師データを結合して、１件の教師データ６２４を生成する。イベントクラスタ６３２に分類された予備教師データ群６３０内の予備教師データを結合して、１件の教師データ６３４を生成する。このように、同じ傾向を持つ投稿データをまとめることによって、特徴を網羅的に備え、かつノイズの少ない教師データを作成することとなる。なお、図６では、予備教師データ群６１０内にユーザー名を含めているが、１件の教師データ６１４を生成する場合は、ユーザー名を含めずに、予備教師データ（「ｘｘｘｘｘかわいい」等）を結合する。 In the example shown in FIG. 6, post information on the singer “xxxx” is collected by “appearance” (appearance cluster 612), “song” (face cluster 622), and “event” (event cluster 632). That is, the preliminary teacher data in the preliminary teacher data group 610 classified into the appearance cluster 612 is combined to generate one piece of teacher data 614. Similarly, the preliminary teacher data in the preliminary teacher data group 620 classified into the face cluster 622 is combined to generate one piece of teacher data 624. By combining the preliminary teacher data in the preliminary teacher data group 630 classified into the event cluster 632, one piece of teacher data 634 is generated. In this way, by collecting post data having the same tendency, teacher data with comprehensive features and less noise is created. In FIG. 6, the user name is included in the preliminary teacher data group 610. However, when one piece of teacher data 614 is generated, preliminary teacher data (such as “xxxxxxxx cute”) is not included without including the user name. Join.

図６に示す例では、予備教師データの収集において、手掛かり情報としてキーワードによる収集を用いたが、他にもＳＮＳのコミュニティを利用するようにしてもよい。ＳＮＳには興味のある事柄について該当するコミュニティに投稿するユーザーがおり、そうしたコミュニティから投稿やコメントを収集すれば、キーワードで収集したデータと同様の、異なるユーザーによる特定の事柄に関する投稿データを収集することができ、予備教師データとして利用することができる。 In the example shown in FIG. 6, in the collection of preliminary teacher data, collection by keyword is used as clue information, but other SNS communities may be used. SNS has users who post to interested communities about matters that they are interested in, and if you collect posts and comments from such communities, you will collect post data on specific matters by different users, similar to the data collected by keywords Can be used as preliminary teacher data.

図７に示す例は「ｘｘｘｘｘ」のコミュニティ７００に投稿された投稿情報とそのコメントの表示例である。なお、コメントも投稿情報（予備教師データ）として扱う。
投稿領域７１０の主催者を示す投稿者アイコン７１５が表示され、そして、投稿者アイコン７１７が示すユーザーによって投稿された投稿情報が投稿領域７１０内に表示されている。そして、その投稿領域７１０に対して、別のユーザーの書き込みによるコメントがコメント領域７２２、７２４、７２６、７２８内に表示されている。また、投稿者アイコン７１９が示すユーザーによって投稿された投稿情報が投稿領域７３０内に表示されている。そして、その投稿領域７３０に対して、別のユーザーの書き込みによるコメントがコメント領域７３２内に表示されている。
この場合、予備教師データ収集モジュール１１０は、投稿領域７１０、コメント領域７２２、７２４、７２６、７２８、投稿領域７３０、コメント領域７３２内の投稿情報を、「ｘｘｘｘｘ」に関する投稿情報として収集する。 The example shown in FIG. 7 is a display example of post information posted to the community 700 of “xxxx” and its comments. Comments are also handled as post information (preliminary teacher data).
A contributor icon 715 indicating the organizer of the posting area 710 is displayed, and posting information posted by the user indicated by the contributor icon 717 is displayed in the posting area 710. A comment written by another user is displayed in the comment areas 722, 724, 726, and 728 for the posting area 710. Post information posted by the user indicated by the contributor icon 719 is displayed in the post area 730. A comment written by another user is displayed in the comment area 732 for the posting area 730.
In this case, the preliminary teacher data collection module 110 collects the posting information in the posting area 710, the comment areas 722, 724, 726, 728, the posting area 730, and the comment area 732 as posting information related to “xxxx”.

また、図６では「ｘｘｘｘｘ」について収集したデータについて、同一傾向をもつ投稿を集約する例を示したが、図８の例に示すように、他のキーワードで収集した投稿も含めて、集約を行ってもよい。例えば、歌手「ｙｙｙｙｙ」、「ｚｚｚｚｚ」など複数（特に、多数としてもよい）のキーワードを設定し、複数（特に、多数としてもよい）の投稿情報を収集し、得られた予備教師データすべてを包括的に分析し、同一傾向のデータで集約して教師データを作成してもよい。１件の教師データの作成は、集約の対象とする予備教師データ群（投稿情報）が異なること以外は、図６の例と同等である。
手掛り情報にコミュニティを用いる際にも、多数のコミュニティの投稿情報を予備教師データにして分析し、同一の傾向のデータで集約して教師データを作成してもよい。 In addition, FIG. 6 illustrates an example in which posts having the same tendency are aggregated with respect to data collected for “xxxx”. However, as illustrated in the example in FIG. 8, aggregation including posts collected with other keywords is also performed. You may go. For example, a plurality of (especially, many) keywords such as singer “yyyyy”, “zzzzz” are set, a plurality (especially, many) of posting information is collected, and all the preliminary teacher data obtained are collected. Comprehensive analysis may be performed and teacher data may be created by aggregating data with the same tendency. The creation of one piece of teacher data is the same as the example in FIG. 6 except that the preliminary teacher data group (post information) to be aggregated is different.
Even when a community is used as clue information, post information of a large number of communities may be analyzed as preliminary teacher data, and teacher data may be created by aggregating the same trend data.

また、ＳＮＳの１つであるＴｗｉｔｔｅｒにおいては、類似する投稿情報を検索しやすくする仕組みとしてハッシュタグがある。これは利用者が“♯”記号を任意の単語、文章につけることで自由に設定できるラベルである。人気のあるハッシュタグはたくさんの投稿に付与されているため、大量の予備教師データを効率的に収集するためにハッシュタグを利用することも予備教師データの収集手段の１つとして用いてもよい。 In addition, Twitter, which is one of SNSs, has a hash tag as a mechanism that makes it easy to search for similar post information. This is a label that can be freely set by the user by attaching the “#” symbol to any word or sentence. Since popular hashtags are assigned to many posts, it is possible to use hashtags to collect a large amount of preliminary teacher data efficiently as one of the means to collect preliminary teacher data. .

図９は、第２の実施の形態の構成例についての概念的なモジュール構成図である。
情報処理装置９００は、教師データ分析モジュール９７０、教師データ格納モジュール９８０、予備教師データ収集モジュール１１０、予備教師データ分析モジュール１２０、教師データ作成モジュール１３０、収集対象データ保存モジュール１４０、予備教師データ格納モジュール１５０、教師データ格納モジュール１６０を有している。なお、第１の実施の形態と同種の部位には同一符号を付し重複した説明を省略する。第２の実施の形態は、第１の実施の形態に、教師データ分析モジュール９７０、教師データ格納モジュール９８０を付加したものである。
この第２の実施の形態では、人手などによって作成した理想的な教師データ群を用意し（教師データ格納モジュール９８０）、そのデータを教師データ分析モジュール９７０で分析する。その分析結果を踏まえて、予備教師データ分析モジュール１２０で予備教師データの分析を行う。この分析には教師ありクラスタリングや機械学習による判定器などを用いる。教師ありクラスタリングのアルゴリズムの例としては教師あり潜在ディリクレ配分法（ＰａｒｔｉａｌｌｙＤｉｒｉｃｈｌｅｔＡｌｌｏｃａｔｉｏｎ，ＰＬＤＡ）等がある。また、理想的な教師データを用いて機械学習による傾向の分類・判定を行う判定器を作成し、同一傾向にまとめる手法等を用いてもよい。このアルゴリズムとしてはサポートベクターマシン、ナイーブベイズ、Ａｄａｂｏｏｓｔ等がある。第２の実施の形態では、予備教師データから、より実際のデータに近い教師データを作成することとなる。 FIG. 9 is a conceptual module configuration diagram of a configuration example according to the second embodiment.
The information processing apparatus 900 includes a teacher data analysis module 970, a teacher data storage module 980, a preliminary teacher data collection module 110, a preliminary teacher data analysis module 120, a teacher data creation module 130, a collection target data storage module 140, and a preliminary teacher data storage module. 150 and a teacher data storage module 160. In addition, the same code | symbol is attached | subjected to the site | part of the same kind as 1st Embodiment, and the overlapping description is abbreviate | omitted. In the second embodiment, a teacher data analysis module 970 and a teacher data storage module 980 are added to the first embodiment.
In the second embodiment, an ideal teacher data group created manually or the like is prepared (teacher data storage module 980), and the data is analyzed by the teacher data analysis module 970. Based on the analysis result, the preliminary teacher data analysis module 120 analyzes the preliminary teacher data. This analysis uses a supervised clustering or a machine learning discriminator. An example of a supervised clustering algorithm is a supervised latent Dirichlet allocation (PLDA). Alternatively, a method of creating a determiner that classifies and determines a tendency by machine learning using ideal teacher data and collects the same tendency may be used. Examples of this algorithm include support vector machine, naive bayes, and adaboost. In the second embodiment, teacher data closer to actual data is created from preliminary teacher data.

教師データ格納モジュール９８０は、教師データ分析モジュール９７０と接続されている。教師データ格納モジュール９８０は、理想的な教師データを記憶している。
教師データ分析モジュール９７０は、教師データ格納モジュール９８０、予備教師データ分析モジュール１２０と接続されている。教師データ分析モジュール９７０は、教師データ格納モジュール９８０内の理想的な教師データを分析し、予備教師データ分析モジュール１２０における教師データ単位のデータに結合する処理に、参考情報として反映させる。
予備教師データ分析モジュール１２０は、教師データ分析モジュール９７０、教師データ作成モジュール１３０、予備教師データ格納モジュール１５０と接続されている。予備教師データ分析モジュール１２０は、予備教師データ収集モジュール１１０によって収集された投稿情報を、教師データ分析モジュール９７０が作成した教師データを用いて、教師ありクラスタリング又は機械学習によって分類する。 The teacher data storage module 980 is connected to the teacher data analysis module 970. The teacher data storage module 980 stores ideal teacher data.
The teacher data analysis module 970 is connected to the teacher data storage module 980 and the preliminary teacher data analysis module 120. The teacher data analysis module 970 analyzes ideal teacher data in the teacher data storage module 980 and reflects it as reference information in the process of combining with the data of the teacher data unit in the preliminary teacher data analysis module 120.
The preliminary teacher data analysis module 120 is connected to the teacher data analysis module 970, the teacher data creation module 130, and the preliminary teacher data storage module 150. The preliminary teacher data analysis module 120 classifies post information collected by the preliminary teacher data collection module 110 by supervised clustering or machine learning using the teacher data created by the teacher data analysis module 970.

図１０は、第２の実施の形態による処理例を示すフローチャートである。
ステップＳ１００２では、教師データ格納モジュール９８０から教師データを抽出する。
ステップＳ１００４では、教師データを分析する。
ステップＳ１００６では、分析結果を予備教師データ分析モジュール１２０に渡す。 FIG. 10 is a flowchart illustrating a processing example according to the second exemplary embodiment.
In step S1002, teacher data is extracted from the teacher data storage module 980.
In step S1004, teacher data is analyzed.
In step S1006, the analysis result is passed to the preliminary teacher data analysis module 120.

図１１は、第２の実施の形態による処理例を示すフローチャートである。
ステップＳ１１０２では、予備教師データ収集モジュール１１０は、収集対象データ保存モジュール１４０から予備教師データを抽出する。
ステップＳ１１０４では、予備教師データ分析モジュール１２０は、教師データ分析モジュール９７０からの分析結果を用いて、クラスタリングを行う。
ステップＳ１１０６では、教師データ作成モジュール１３０は、教師データの単位に集約する。
ステップＳ１１０８では、教師データ作成モジュール１３０は、教師データとして格納する。 FIG. 11 is a flowchart illustrating a processing example according to the second exemplary embodiment.
In step S1102, the preliminary teacher data collection module 110 extracts preliminary teacher data from the collection target data storage module 140.
In step S1104, the preliminary teacher data analysis module 120 performs clustering using the analysis result from the teacher data analysis module 970.
In step S1106, the teacher data creation module 130 aggregates the data into teacher data units.
In step S1108, the teacher data creation module 130 stores the data as teacher data.

なお、本実施の形態としてのプログラムが実行されるコンピュータのハードウェア構成は、図１２に例示するように、一般的なコンピュータであり、具体的にはパーソナルコンピュータ、サーバーとなり得るコンピュータ等である。つまり、具体例として、処理部（演算部）としてＣＰＵ１２０１を用い、記憶装置としてＲＡＭ１２０２、ＲＯＭ１２０３、ＨＤ１２０４を用いている。ＨＤ１２０４として、例えばハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）を用いてもよい。予備教師データ収集モジュール１１０、予備教師データ分析モジュール１２０、教師データ作成モジュール１３０、教師データ分析モジュール９７０等のプログラムを実行するＣＰＵ１２０１と、そのプログラムやデータを記憶するＲＡＭ１２０２と、本コンピュータを起動するためのプログラム等が格納されているＲＯＭ１２０３と、収集対象データ保存モジュール１４０、予備教師データ格納モジュール１５０、教師データ格納モジュール１６０、教師データ格納モジュール９８０等の機能を有する補助記憶装置（フラッシュメモリ等であってもよい）であるＨＤ１２０４と、キーボード、マウス、タッチパネル、マイク等に対する利用者の操作に基づいてデータを受け付ける受付装置１２０６と、ＣＲＴ、液晶ディスプレイ、スピーカー等の出力装置１２０５と、ネットワークインタフェースカード等の通信ネットワークと接続するための通信回線インタフェース１２０７、そして、それらをつないでデータのやりとりをするためのバス１２０８により構成されている。これらのコンピュータが複数台互いにネットワークによって接続されていてもよい。 Note that the hardware configuration of the computer on which the program according to the present embodiment is executed is a general computer, specifically a personal computer, a computer that can be a server, and the like, as illustrated in FIG. That is, as a specific example, the CPU 1201 is used as a processing unit (calculation unit), and the RAM 1202, the ROM 1203, and the HD 1204 are used as storage devices. For example, a hard disk or SSD (Solid State Drive) may be used as the HD 1204. A CPU 1201 that executes programs such as the preliminary teacher data collection module 110, the preliminary teacher data analysis module 120, the teacher data creation module 130, and the teacher data analysis module 970, a RAM 1202 that stores the programs and data, and the computer ROM 1203 storing the above-described programs, etc., and an auxiliary storage device (such as a flash memory) having functions such as a collection target data storage module 140, a preliminary teacher data storage module 150, a teacher data storage module 160, and a teacher data storage module 980. HD 1204, a receiving device 1206 for receiving data based on user operations on a keyboard, mouse, touch panel, microphone, etc., CRT, liquid crystal display, speaker An output device 1205 and the like, a communication line interface 1207 for connecting to a communication network such as a network interface card, and, and a bus 1208 for exchanging data by connecting them. A plurality of these computers may be connected to each other via a network.

前述の実施の形態のうち、コンピュータ・プログラムによるものについては、本ハードウェア構成のシステムにソフトウェアであるコンピュータ・プログラムを読み込ませ、ソフトウェアとハードウェア資源とが協働して、前述の実施の形態が実現される。
なお、図１２に示すハードウェア構成は、１つの構成例を示すものであり、本実施の形態は、図１２に示す構成に限らず、本実施の形態において説明したモジュールを実行可能な構成であればよい。例えば、一部のモジュールを専用のハードウェア（例えば特定用途向け集積回路（ＡｐｐｌｉｃａｔｉｏｎＳｐｅｃｉｆｉｃＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ：ＡＳＩＣ）等）で構成してもよく、一部のモジュールは外部のシステム内にあり通信回線で接続しているような形態でもよく、さらに図１２に示すシステムが複数互いに通信回線によって接続されていて互いに協調動作するようにしてもよい。また、特に、パーソナルコンピュータの他、携帯情報通信機器（携帯電話、スマートフォン、モバイル機器、ウェアラブルコンピュータ等を含む）、情報家電、ロボット、複写機、ファックス、スキャナ、プリンタ、複合機（スキャナ、プリンタ、複写機、ファックス等のいずれか２つ以上の機能を有している画像処理装置）などに組み込まれていてもよい。 Among the above-described embodiments, the computer program is a computer program that reads the computer program, which is software, in the hardware configuration system, and the software and hardware resources cooperate with each other. Is realized.
Note that the hardware configuration shown in FIG. 12 shows one configuration example, and the present embodiment is not limited to the configuration shown in FIG. 12, but is a configuration that can execute the modules described in the present embodiment. I just need it. For example, some modules may be configured with dedicated hardware (for example, Application Specific Integrated Circuit (ASIC), etc.), and some modules are in an external system and connected via a communication line In addition, a plurality of systems shown in FIG. 12 may be connected to each other via communication lines so as to cooperate with each other. In particular, in addition to personal computers, portable information communication devices (including mobile phones, smartphones, mobile devices, wearable computers, etc.), information appliances, robots, copiers, fax machines, scanners, printers, multifunction devices (scanners, printers, An image processing apparatus having two or more functions such as a copying machine and a fax machine) may be incorporated.

なお、説明したプログラムについては、記録媒体に格納して提供してもよく、また、そのプログラムを通信手段によって提供してもよい。その場合、例えば、前記説明したプログラムについて、「プログラムを記録したコンピュータ読み取り可能な記録媒体」の発明として捉えてもよい。
「プログラムを記録したコンピュータ読み取り可能な記録媒体」とは、プログラムのインストール、実行、プログラムの流通等のために用いられる、プログラムが記録されたコンピュータで読み取り可能な記録媒体をいう。
なお、記録媒体としては、例えば、デジタル・バーサタイル・ディスク（ＤＶＤ）であって、ＤＶＤフォーラムで策定された規格である「ＤＶＤ−Ｒ、ＤＶＤ−ＲＷ、ＤＶＤ−ＲＡＭ等」、ＤＶＤ＋ＲＷで策定された規格である「ＤＶＤ＋Ｒ、ＤＶＤ＋ＲＷ等」、コンパクトディスク（ＣＤ）であって、読出し専用メモリ（ＣＤ−ＲＯＭ）、ＣＤレコーダブル（ＣＤ−Ｒ）、ＣＤリライタブル（ＣＤ−ＲＷ）等、ブルーレイ・ディスク（Ｂｌｕ−ｒａｙ（登録商標）Ｄｉｓｃ）、光磁気ディスク（ＭＯ）、フレキシブルディスク（ＦＤ）、磁気テープ、ハードディスク、読出し専用メモリ（ＲＯＭ）、電気的消去及び書換可能な読出し専用メモリ（ＥＥＰＲＯＭ（登録商標））、フラッシュ・メモリ、ランダム・アクセス・メモリ（ＲＡＭ）、ＳＤ（ＳｅｃｕｒｅＤｉｇｉｔａｌ）メモリーカード等が含まれる。
そして、前記のプログラム又はその一部は、前記記録媒体に記録して保存や流通等させてもよい。また、通信によって、例えば、ローカル・エリア・ネットワーク（ＬＡＮ）、メトロポリタン・エリア・ネットワーク（ＭＡＮ）、ワイド・エリア・ネットワーク（ＷＡＮ）、インターネット、イントラネット、エクストラネット等に用いられる有線ネットワーク、又は無線通信ネットワーク、さらにこれらの組み合わせ等の伝送媒体を用いて伝送させてもよく、また、搬送波に乗せて搬送させてもよい。
さらに、前記のプログラムは、他のプログラムの一部分であってもよく、又は別個のプログラムと共に記録媒体に記録されていてもよい。また、複数の記録媒体に分割して記録されていてもよい。また、圧縮や暗号化等、復元可能であればどのような態様で記録されていてもよい。 The program described above may be provided by being stored in a recording medium, or the program may be provided by communication means. In that case, for example, the above-described program may be regarded as an invention of a “computer-readable recording medium recording the program”.
The “computer-readable recording medium on which a program is recorded” refers to a computer-readable recording medium on which a program is recorded, which is used for program installation, execution, program distribution, and the like.
The recording medium is, for example, a digital versatile disc (DVD), which is a standard established by the DVD Forum, such as “DVD-R, DVD-RW, DVD-RAM,” and DVD + RW. Standard “DVD + R, DVD + RW, etc.”, compact disc (CD), read-only memory (CD-ROM), CD recordable (CD-R), CD rewritable (CD-RW), Blu-ray disc ( Blu-ray (registered trademark) Disc), magneto-optical disk (MO), flexible disk (FD), magnetic tape, hard disk, read-only memory (ROM), electrically erasable and rewritable read-only memory (EEPROM (registered trademark)) )), Flash memory, Random access memory (RAM) SD (Secure Digital) memory card and the like.
The program or a part of the program may be recorded on the recording medium for storage or distribution. Also, by communication, for example, a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a wired network used for the Internet, an intranet, an extranet, or a wireless communication It may be transmitted using a transmission medium such as a network or a combination of these, or may be carried on a carrier wave.
Furthermore, the program may be a part of another program, or may be recorded on a recording medium together with a separate program. Moreover, it may be divided and recorded on a plurality of recording media. Further, it may be recorded in any manner as long as it can be restored, such as compression or encryption.

１００…情報処理装置
１１０…予備教師データ収集モジュール
１２０…予備教師データ分析モジュール
１３０…教師データ作成モジュール
１４０…収集対象データ保存モジュール
１５０…予備教師データ格納モジュール
１６０…教師データ格納モジュール
２１０…ＳＮＳ提供装置
２２０…ユーザー端末
２９０…通信回線
９００…情報処理装置
９７０…教師データ分析モジュール
９８０…教師データ格納モジュール DESCRIPTION OF SYMBOLS 100 ... Information processing apparatus 110 ... Preliminary teacher data collection module 120 ... Preliminary teacher data analysis module 130 ... Teacher data creation module 140 ... Collection target data storage module 150 ... Preliminary teacher data storage module 160 ... Teacher data storage module 210 ... SNS providing apparatus 220 ... User terminal 290 ... Communication line 900 ... Information processing device 970 ... Teacher data analysis module 980 ... Teacher data storage module

Claims

A means of collecting post information;
A classifying means for classifying post information collected by the collecting means;
An information processing apparatus comprising: creation means for creating teacher data used for machine learning using post information in the set classified by the classification means.

The information processing apparatus according to claim 1, wherein the classification unit classifies post information collected by the collection unit using classification based on unsupervised clustering or information included in the post information. .

The information processing apparatus according to claim 1, wherein the classification unit classifies post information collected by the collection unit by supervised clustering or machine learning using teacher data created in advance.

The information processing apparatus according to claim 1, wherein the creation unit creates one piece of teacher data used for machine learning by combining post information in the set classified by the classification unit.

Computer
A means of collecting post information;
A classifying means for classifying post information collected by the collecting means;
An information processing program for functioning as a creation means for creating teacher data used for machine learning using post information in a set classified by the classification means.