JP2016006583A

JP2016006583A - Method and system for classifying noise post in social network service

Info

Publication number: JP2016006583A
Application number: JP2014127175A
Authority: JP
Inventors: 新吾堀内; Shingo Horiuchi; 佑輔小林; Yusuke Kobayashi; 正寿西村; Masatoshi Nishimura
Original assignee: NTT Data Corp
Current assignee: NTT Data Group Corp
Priority date: 2014-06-20
Filing date: 2014-06-20
Publication date: 2016-01-14
Anticipated expiration: 2034-06-20
Also published as: JP6306951B2

Abstract

PROBLEM TO BE SOLVED: To provide a method and a system for classifying an unknown SNS client and extracting an SNS client which outputs unnecessary contributed data in a social network service.SOLUTION: Based on contributed data in an SNS, a classification model for classifying an SNS client, which outputs the contributed data, by purpose is generated, and an unknown SNS client is classified. By learning the classification model, it is adaptable to a new type of SNS client appearing in the future.

Description

本発明は、ソーシャルネットワークサービスにおけるノイズ投稿の分類方法およびシステムに関する。具体的には、Ｔｗｉｔｔｅｒ（登録商標）などのソーシャルネットワークサービス（以下、ＳＮＳという）における投稿データに関し、当該サービスに対して投稿を行ったり、他人の投稿を閲覧したりするためのクライアントアプリケーションソフトウェア（例えば、ＴｗｉｔｔｅｒｆｏｒＡｎｄｒｏｉｄ（登録商標））を目的別に分類するための学習（分類）モデルを生成する。分類モデルにより、未知のＳＮＳクライアントを分類し、特に、不要な投稿データ（ノイズ）を出力するＳＮＳクライアントを抽出する方法およびシステムに関する。 The present invention relates to a noise posting classification method and system in a social network service. Specifically, with respect to posting data in a social network service (hereinafter referred to as SNS) such as Twitter (registered trademark), client application software for posting to the service or viewing other people's posts ( For example, a learning (classification) model for classifying Twitter for Android (registered trademark) by purpose is generated. The present invention relates to a method and system for classifying an unknown SNS client by a classification model and, in particular, extracting an SNS client that outputs unnecessary post data (noise).

従来、ＳＮＳにおいて投稿される投稿データを収集し、分析することでマーケティングなどに活用することが行われている。例えば、特許文献１には、投稿データを電子掲載する際の適合度合いを評価値として指標化し、当該評価値に基づいて投稿データをランク分けすることにより分析する装置が開示されている。特許文献１における評価は、投稿データにおける文章を複数の単語に分解し、予め定義したスコア設定テーブルと各単語を照合し、単語単位でスコアを算出した後、文章全体の総スコアを算出するものである。例えば、単語単位のスコアはいわゆる重み付け値であり、不適切な単語に対し大きなスコアを設定することで、不適切な単語を多く含む投稿データの総スコアはより大きくなる。これにより、投稿データをランク分けし、電子掲載の際の適合度合いを分析することできる。 2. Description of the Related Art Conventionally, post data posted in SNS is collected and analyzed to be used for marketing and the like. For example, Patent Document 1 discloses an apparatus that analyzes a degree of conformity when posting posted data electronically as an evaluation value, and ranks the posted data based on the evaluation value. The evaluation in Patent Document 1 is a method of decomposing a sentence in posted data into a plurality of words, comparing each word with a predefined score setting table, calculating a score in units of words, and then calculating a total score of the entire sentence It is. For example, the score in units of words is a so-called weighting value, and by setting a large score for inappropriate words, the total score of post data including many inappropriate words becomes larger. Thereby, it is possible to rank the posting data and analyze the degree of conformity in electronic posting.

特開２００６−２６８３０３号公報JP 2006-268303 A

しかしながら、特許文献１における分析を行うには、投稿データ中に、ある程度の数の単語を必要とし、単語数の少ないＳＮＳ投稿データを分析することには適していない。 However, the analysis in Patent Document 1 requires a certain number of words in the posted data and is not suitable for analyzing SNS posted data with a small number of words.

また、近年、ＳＮＳの利用は複雑化しており、投稿データの種類も、定型文の自動投稿や、広告・キャンペーンなど様々である。また、投稿データの分析目的によって、ノイズの定義も変わってくる。加えて、投稿アプリケーションを生成するためのＡＰＩは公開されており、誰でもＳＮＳクライアントを生成することができる。すなわち、常に新しい投稿データやノイズが発生し、加えて、未知のＳＮＳクライアントも新たに発生することから、既存の静的なデータ参照では判別しきれないものが出てくることが想定される。このような状況において、投稿データごとに分析を行ってノイズを特定するためには、ＳＮＳクライアントによる投稿データのような個々の文章が短いものは、分析のインプットデータとして適さない。 In recent years, the use of SNS has become complicated, and the types of post data are various, such as automatic posting of standard sentences and advertisements / campaigns. In addition, the definition of noise varies depending on the purpose of post data analysis. In addition, an API for generating a posting application is open to the public, and anyone can generate an SNS client. In other words, new post data and noise always occur, and in addition, an unknown SNS client also newly occurs, so that it is assumed that there is something that cannot be determined by existing static data reference. In such a situation, in order to perform analysis for each post data and identify noise, a short individual sentence such as post data by an SNS client is not suitable as input data for analysis.

また、ユーザは用途に応じて、ＳＮＳクライアントを使い分けている（分類している）傾向がある（例えば、分類１：ユーザ本人の投稿は、Ｔｗｉｔｔｅｒなどの投稿アプリケーション、分類２：ゲームアプリケーションのハイスコア取得、所定の電子書籍を読み終えた、などユーザの行動に伴う投稿は、行動記録アプリケーション、分類３：キャンペーン用投稿は、各企業の専用アプリケーション、分類４：自動投稿は、botアプリケーションなど）。なお、このＳＮＳクライアントの分類は、分類１がユーザの意思が最も強く、数字が多くなるにつれ弱くなり、分類４が最も弱いと考えることができる。分類２は、その内容や投稿タイミングによりユーザの思考を読み取ることができ、分類３は、ユーザのある程度の嗜好を判断することができる。分類４においては、ロボットによる自動投稿のため、ユーザ意思はないと言える。すなわち、投稿データにおけるユーザ意思が強いほど、ユーザにとっては重要度が高い投稿であり、マーケティングに活用する企業などにとっては有用性が高い投稿であるといえる。 Also, the user has a tendency to use (categorize) SNS clients properly according to the usage (for example, category 1: posting of the user himself is a posting application such as Twitter, category 2: high score of game application) Posts associated with user actions such as acquisition and reading of a predetermined electronic book are action recording applications, classification 3: campaign posting is a dedicated application for each company, classification 4: automatic posting is bot application, and the like). The classification of the SNS client can be considered that classification 1 has the strongest user's intention, weakens as the number increases, and classification 4 has the weakest. The category 2 can read the user's thoughts based on the content and the posting timing, and the category 3 can determine the user's degree of preference. In Category 4, it can be said that there is no user intention due to automatic posting by the robot. That is, it can be said that the stronger the user's intention in the posted data, the higher the importance for the user, and the higher the usefulness for a company or the like used for marketing.

以上より、ＳＮＳクライアントによる投稿データ単位ではなく、ＳＮＳクライアント単位での分析を行い、ノイズを出力するＳＮＳクライアントを抽出する方法およびシステムが求められている。 From the above, there is a need for a method and system for performing analysis in units of SNS clients, not in units of posted data by SNS clients, and extracting SNS clients that output noise.

そこで本発明は、ＳＮＳクライアントの分類に着目し、前述した課題を解決するために、ＳＮＳにおける投稿データに基づいて、前記投稿データを出力するＳＮＳクライアントを分類する方法を実行するためのコンピュータ実行可能命令を有するコンピュータプログラムであって、前記方法は、
前記投稿データを取得するステップと、
前記取得した投稿データから、ＳＮＳクライアント識別子、アカウント識別子、および本文内容を少なくとも抽出し、前記ＳＮＳクライアントごとの特徴量を算出するステップであって、前記特徴量は、平均圧縮率および総圧縮率を少なくとも含み、前記平均圧縮率は、前記ＳＮＳクライアント識別子および前記アカウント識別子ごとに、前記本文内容を結合し、かつ圧縮した場合の、前記ＳＮＳクライアント識別子ごとの平均圧縮率であり、前記総圧縮率は、前記ＳＮＳクライアント識別子ごとに前記本文内容を結合し、かつ圧縮した場合の圧縮率である、ステップと、
前記特徴量を学習モデルのインプットデータとして、前記取得した投稿データを出力したＳＮＳクライアントを分類するステップであって、前記学習モデルは、前記ＳＮＳにおける投稿データから複数の前記ＳＮＳクライアントの前記特徴量を説明変数、および前記ＳＮＳクライアントの分類を目的変数として生成される、ステップと
を備えたことを特徴とする。 Therefore, the present invention focuses on the classification of SNS clients, and in order to solve the above-described problem, the present invention can be executed by a computer for executing a method for classifying SNS clients that output the posted data based on the posted data in the SNS. A computer program having instructions, the method comprising:
Obtaining the post data;
Extracting at least an SNS client identifier, an account identifier, and text content from the acquired post data, and calculating a feature amount for each SNS client, wherein the feature amount includes an average compression rate and a total compression rate; The average compression rate includes at least the average compression rate for each SNS client identifier when the body content is combined and compressed for each of the SNS client identifier and the account identifier, and the total compression rate is , The compression ratio when the body content is combined and compressed for each SNS client identifier, and
Classifying SNS clients that output the acquired post data using the feature quantity as input data of a learning model, wherein the learning model extracts the feature quantities of a plurality of SNS clients from post data in the SNS. An explanatory variable and a step of generating the classification of the SNS client as an objective variable are provided.

また、前段落に記載の発明において、前記特徴量は、平均投稿数をさらに含み、前記平均投稿数は、前記ＳＮＳクライアント識別子および前記アカウント識別子ごとに算出される投稿数の、前記ＳＮＳクライアント識別子ごとの平均投稿数であることを特徴とする。 In the invention described in the preceding paragraph, the feature amount further includes an average number of posts, and the average number of posts is the number of posts calculated for each of the SNS client identifier and the account identifier for each SNS client identifier. The average number of posts.

さらに、前段落に記載の発明において、前記抽出することは、前記取得した投稿データからさらに投稿日時を抽出することを含み、前記特徴量は、投稿間隔平均標準偏差をさらに含み、前記投稿間隔平均標準偏差は、前記投稿日時から、前記ＳＮＳクライアント識別子および前記アカウント識別子ごとに投稿間隔を算出し、それぞれの標準偏差から算出される前記ＳＮＳクライアント識別子ごとの平均値であることを特徴とする。 Furthermore, in the invention described in the preceding paragraph, the extracting includes further extracting a posting date and time from the acquired posting data, and the feature amount further includes a posting interval average standard deviation, and the posting interval average The standard deviation is an average value for each of the SNS client identifiers calculated from the respective standard deviations by calculating a posting interval for each of the SNS client identifier and the account identifier from the posting date and time.

以上説明したように、本発明により、ＳＮＳにおける投稿データに基づいて、当該投稿データを出力するＳＮＳクライアントを目的別に分類するための分類モデルを生成し、未知のＳＮＳクライアントを分類することができる。これにより、特に、ノイズを出力するＳＮＳクライアントを抽出することができる。また、ある目的を以って投稿データを分析する場合に、ノイズとなる投稿データを出力するＳＮＳクライアントを一括して分割対象から除外することが可能となり、従来と比較し、より効率的に投稿データの分析を行うことができる。さらに、分類モデルを学習させることで、今後新たに発生する種類のＳＮＳクライアントにも対応することができる。 As described above, according to the present invention, it is possible to generate a classification model for classifying SNS clients that output post data based on the purpose based on post data in the SNS and classify unknown SNS clients. Thereby, in particular, an SNS client that outputs noise can be extracted. In addition, when analyzing post data for a certain purpose, it is possible to exclude SNS clients that output post data that causes noise from the batch, and post more efficiently than before. Data analysis can be performed. Furthermore, by learning the classification model, it is possible to cope with a SNS client of a type newly generated in the future.

本発明の一実施形態に係るシステム構成を示す図である。It is a figure which shows the system configuration | structure which concerns on one Embodiment of this invention. 本発明の一実施形態に係るＳＮＳクライアントデータ記憶部に格納されたデータを示す図である。It is a figure which shows the data stored in the SNS client data storage part which concerns on one Embodiment of this invention. 本発明の一実施形態に係る教師データ記憶部に格納されたデータを示す図である。It is a figure which shows the data stored in the teacher data storage part which concerns on one Embodiment of this invention. 本発明の一実施形態に係る分類モデルを示す図である。It is a figure which shows the classification model which concerns on one Embodiment of this invention. 本発明の一実施形態に係る分類モデルデータ記憶部に格納されたデータを示す図である。It is a figure which shows the data stored in the classification model data storage part which concerns on one Embodiment of this invention. 本発明の一実施形態に係る分類モデル生成処理を示すフローチャートである。It is a flowchart which shows the classification model production | generation process which concerns on one Embodiment of this invention. 本発明の一実施形態に係るＳＮＳクライアント分類処理を示すフローチャートである。It is a flowchart which shows the SNS client classification | category process which concerns on one Embodiment of this invention. 本発明の一実施形態における実行する学習パターンの判断方法を示すフローチャートである。It is a flowchart which shows the judgment method of the learning pattern performed in one Embodiment of this invention.

以下、添付した図面を参照して、ＳＮＳにおける投稿データに基づいて、当該投稿データを出力するＳＮＳクライアントを目的別に分類するための分類モデルを生成し、未知のＳＮＳクライアントを分類することができるＳＮＳクライアント分類システムおよび方法を詳細に説明する。 Hereinafter, referring to the attached drawings, based on post data in SNS, a SNS client that classifies SNS clients that output the post data according to purpose can be generated, and an SNS client can be classified. The client classification system and method will be described in detail.

まず始めに、本システムの概要を説明する。図１は、本発明の一実施形態に係るシステム構成を示す図である。図１において、データセンタなどに設置された、各ＳＮＳ提供企業などが管理するＳＮＳ投稿データ蓄積サーバ１００は、インターネット１０４を介して、ユーザ端末１０２ａ、１０２ｂ・・・、および１０２ｎ（以下、まとめて「ユーザ端末１０２」という）、ならびにモバイル端末１０３ａ、１０３ｂ・・・、および１０３ｎ（以下、まとめて「モバイル端末１０３」という）と通信を行うように構成されている。 First, the outline of this system will be described. FIG. 1 is a diagram showing a system configuration according to an embodiment of the present invention. In FIG. 1, an SNS post data storage server 100 installed in a data center or the like and managed by each SNS provider company or the like is connected to user terminals 102a, 102b,..., And 102n (hereinafter collectively) via the Internet 104. .., And 103n (hereinafter collectively referred to as “mobile terminal 103”) and the mobile terminals 103a, 103b,.

ＳＮＳを利用するユーザは、ユーザ端末１０２、またはモバイル端末１０３を使用して各ＳＮＳ投稿データ蓄積サーバ１００が提供するＳＮＳを利用する。各ユーザにより投稿された投稿データは、ＳＮＳ投稿データ蓄積サーバ１００に送信され、そのデータベースに集約される。なお、ＳＮＳ投稿データ蓄積サーバ１００は、図１において、単一のサーバとして示されているが、複数台のサーバによる分散システムとして構成することも可能である。また、ＳＮＳ投稿データ蓄積サーバ１００は、ＳＮＳクライアントごとに存在するため、実際は、インターネット１０４に多くのＳＮＳ投稿データ蓄積サーバ１００が接続されている。 A user who uses the SNS uses the SNS provided by each SNS post data storage server 100 using the user terminal 102 or the mobile terminal 103. Post data posted by each user is transmitted to the SNS post data storage server 100 and collected in the database. Although the SNS post data storage server 100 is shown as a single server in FIG. 1, it can also be configured as a distributed system including a plurality of servers. In addition, since the SNS post data storage server 100 exists for each SNS client, in reality, many SNS post data storage servers 100 are connected to the Internet 104.

また、インターネット１０４には、ＳＮＳクライアント分類サーバ１０１が接続されている。各ＳＮＳ投稿データ蓄積サーバ１００に集約された投稿データは、ＳＮＳクライアント分類サーバ１０１に送信され、ＳＮＳクライアント分類サーバ１０１上で分類モデルを用いて、ＳＮＳクライアント単位に分類される。 An SNS client classification server 101 is connected to the Internet 104. Post data aggregated in each SNS post data storage server 100 is transmitted to the SNS client classification server 101, and is classified on a SNS client basis on the SNS client classification server 101 using a classification model.

次に、ＳＮＳクライアント分類サーバ１０１の構成を詳細に説明する。なお、図１では、単一のサーバコンピュータを想定し、必要な機能構成だけを示している。 Next, the configuration of the SNS client classification server 101 will be described in detail. In FIG. 1, only a necessary functional configuration is shown assuming a single server computer.

ＳＮＳクライアント分類サーバ１０１は、ＣＰＵ１１０に、システムバス１１５を介してＲＡＭ１１１、入力装置１１２、出力装置１１３、通信制御装置１１４、および不揮発性記憶媒体（ＲＯＭやＨＤＤなど）で構成される記憶装置１１６が接続された構成を有する。記憶装置１１６は、ＳＮＳクライアント分類システムの各機能を奏するためのソフトウェアプログラムを格納するプログラム格納領域と、当該ソフトウェアプログラムが取り扱うデータを格納するデータ格納領域とを備えている。以下に説明するプログラム格納領域の各手段は、実際は独立したソフトウェアプログラム、そのルーチンやコンポーネントなどであり、ＣＰＵ１１０によって記憶装置１１６から呼び出された後、ＲＡＭ１１１のワークエリアに展開され、かつデータベースなどを適宜参照しながら順次実行されることで、各機能を奏するものである。 The SNS client classification server 101 includes a CPU 110 and a storage device 116 including a RAM 111, an input device 112, an output device 113, a communication control device 114, and a nonvolatile storage medium (ROM, HDD, etc.) via a system bus 115. It has a connected configuration. The storage device 116 includes a program storage area for storing a software program for performing each function of the SNS client classification system, and a data storage area for storing data handled by the software program. Each means of the program storage area described below is actually an independent software program, its routines and components, etc., which are called from the storage device 116 by the CPU 110, are expanded in the work area of the RAM 111, and a database is appropriately stored. Each function is performed by being executed sequentially with reference to the reference.

記憶装置１１６におけるデータ格納領域は、本発明に関連するものだけを列挙すると、ＳＮＳクライアントデータ記憶部１３０、教師データ記憶部１３１、分類モデルデータ記憶部１３２、および投稿データ記憶部１３３を備える。いずれも、記憶装置１１６内に確保された一定の記憶領域である。 The data storage area in the storage device 116 includes an SNS client data storage unit 130, a teacher data storage unit 131, a classification model data storage unit 132, and a post data storage unit 133 when only those related to the present invention are listed. Both are fixed storage areas secured in the storage device 116.

ＳＮＳクライアントデータ記憶部１３０は、投稿データを出力するＳＮＳクライアント（例えば、Ｔｗｉｔｔｅｒなどのアプリケーション）に関するデータを格納する。図２は、本発明の一実施形態に係るＳＮＳクライアントデータ記憶部１３０に格納されたデータを示す図である。図２におけるＳＮＳクライアントデータは、ＳＮＳクライアントの名称を示す「クライアント名」、ＳＮＳクライアントを目的別に分類する「カテゴリＩＤ」とその内容を示す「カテゴリ名」を含む。「カテゴリＩＤ」は、例えば、ＳＮＳクライアントが、Ｔｗｉｔｔｅｒなどのユーザ本人による投稿データを出力する投稿アプリケーションの場合は「1」、ゲームアプリケーションのハイスコア取得などユーザの行動に伴う投稿データを出力する行動記録アプリケーションの場合は「2」、キャンペーン用の投稿データを出力する各企業の専用アプリケーションの場合は「3」、一定時間ごとに辞書に登録された定型文を自動出力するbotアプリケーションの場合は「4」を格納する。この場合、カテゴリＩＤの数値が少ないほど、ユーザ意思が強い投稿データを出力するＳＮＳクライアントであると考えることができる。そのため、例えば、ユーザ意思の弱いカテゴリＩＤ「4」の投稿データはノイズデータであると考えられ、フィルタリングにより非表示対象にするなどといった実施形態が想定される。なお、ＳＮＳクライアントデータは、投稿データに含まれる「クライアント名」を抽出することにより生成されるが、抽出時点では「カテゴリＩＤ」および「カテゴリ名」を定めることができないため、両項目はデータレコード生成時点では空データである。レコード生成後、管理者などにより、「カテゴリＩＤ」および「カテゴリ名」に対しある程度のデータ（全てである必要はなく、学習に必要な最低限のデータであればよい）が手動登録され、後述する学習処理を行うことで、学習後のデータを改めて登録することもできる。 The SNS client data storage unit 130 stores data related to an SNS client (for example, an application such as Twitter) that outputs post data. FIG. 2 is a diagram illustrating data stored in the SNS client data storage unit 130 according to an embodiment of the present invention. The SNS client data in FIG. 2 includes a “client name” indicating the name of the SNS client, a “category ID” for classifying the SNS clients by purpose, and a “category name” indicating the content thereof. “Category ID” is, for example, “1” in the case where the SNS client is a posting application that outputs posted data by the user such as Twitter, and the action of outputting posted data accompanying the user's action such as high score acquisition of the game application. "2" for recording applications, "3" for each company's dedicated application that outputs campaign post data, and "bot" for automatic output of fixed phrases registered in the dictionary at regular intervals. Stores “4”. In this case, it can be considered that the smaller the numerical value of the category ID is, the more the SNS client outputs post data with a strong user intention. Therefore, for example, post data with a category ID “4” having a weak user intention is considered to be noise data, and an embodiment in which the post data is not displayed by filtering is assumed. The SNS client data is generated by extracting the “client name” included in the posted data. However, since “category ID” and “category name” cannot be determined at the time of extraction, both items are data records. It is empty data at the time of generation. After record generation, an administrator or the like manually registers a certain amount of data for “category ID” and “category name” (they need not be all but may be the minimum data necessary for learning), and will be described later. By performing the learning process, the data after learning can be registered again.

教師データ記憶部１３１は、分類モデルに対する教師データを格納する。図３は、本発明の一実施形態に係る教師データ記憶部１３１に格納されたデータを示す図である。図３における教師データは、「クライアント名」、各特徴量、および「カテゴリＩＤ」を含む。各特徴量は、「平均圧縮率」、「総圧縮率」、「投稿間隔平均標準偏差」、および「平均投稿数」であり、これらが分類モデルを生成するための説明変数となる。なお、各特徴量の算出については後述する。また、各特徴量の説明変数に対し、「カテゴリＩＤ」が目的変数となる。これらに基づいて、次項に示す分類モデルデータが生成されることになる。なお、本データにおけるカテゴリＩＤについても、前項のＳＮＳクライアントデータのカテゴリＩＤ同様、後述する学習処理を行うことで、学習後のデータを改めて登録することもできる。 The teacher data storage unit 131 stores teacher data for the classification model. FIG. 3 is a diagram illustrating data stored in the teacher data storage unit 131 according to an embodiment of the present invention. The teacher data in FIG. 3 includes “client name”, each feature amount, and “category ID”. The feature amounts are “average compression rate”, “total compression rate”, “post interval average standard deviation”, and “average number of posts”, and these are explanatory variables for generating a classification model. The calculation of each feature amount will be described later. Further, “category ID” is an objective variable for the explanatory variable of each feature amount. Based on these, the classification model data shown in the next section is generated. As for the category ID in this data, the learned data can be registered again by performing a learning process described later, like the category ID of the SNS client data in the previous section.

分類モデルデータ記憶部１３２は、教師データに基づいて生成される学習（分類）モデルデータを格納する。本発明における分類モデルは、説明変数（特徴量）をインプットとし、説明変数によって目的変数（カテゴリＩＤ）が説明できるかを定量的に分析する（回帰分析）、二値分類器である。二値分類器は、様々な従来技術（線形回帰、決定木、ロジスティック回帰、ニューラルネットワーク、サポートベクターマシン（ＳＶＭ）、パーセプトロンなど）を用いて実現することができる。 The classification model data storage unit 132 stores learning (classification) model data generated based on teacher data. The classification model in the present invention is a binary classifier that takes an explanatory variable (feature value) as an input and quantitatively analyzes whether the objective variable (category ID) can be explained by the explanatory variable (regression analysis). The binary classifier can be implemented using various conventional techniques (linear regression, decision tree, logistic regression, neural network, support vector machine (SVM), perceptron, etc.).

図５は、本発明の一実施形態に係る分類モデルデータ記憶部１３２に格納されたデータを示す図である。また、図４は、本発明の一実施形態に係る分類モデルを示す図である。図４の分類モデルには決定木を用いている。図４の分類モデルに対応するデータが図５の分類モデルデータである。図５における分類モデルデータは、各ツリーノードを一意に識別させる「ノードＩＤ」、各ノードＩＤの親ノードＩＤを示す「親ノードＩＤ」、同一の親ノードを持つノード間の順序を示す「兄弟間順序」、各ノード内容を示す「内容」、および「カテゴリＩＤ」を含む。「兄弟間順序」は、本実施形態では、同一の親ノードに対してＹｅｓのエッジにぶら下がるノードを「1」、Ｎｏのエッジにぶら下がるノードを「2」としている。同一の親ノードに対してエッジの種類がさらに多い場合は、例えば、左のノードから１、２、３・・・と番号を割り当てることができる。 FIG. 5 is a diagram showing data stored in the classification model data storage unit 132 according to an embodiment of the present invention. FIG. 4 is a diagram showing a classification model according to an embodiment of the present invention. A decision tree is used in the classification model of FIG. The data corresponding to the classification model of FIG. 4 is the classification model data of FIG. The classification model data in FIG. 5 includes a “node ID” that uniquely identifies each tree node, a “parent node ID” that indicates a parent node ID of each node ID, and a “sibling” that indicates the order between nodes having the same parent node. “Inter-order”, “content” indicating the contents of each node, and “category ID”. In the present embodiment, “order between siblings” is “1” for a node hanging on the Yes edge and “2” for a node hanging on the No edge for the same parent node. If there are more types of edges for the same parent node, numbers 1, 2, 3,... Can be assigned from the left node, for example.

図５の分類モデルを用いて図４の分類モデルデータを説明すると、ルートノード［平均圧縮率＜０．７８］は「ノードＩＤ」が「0」であり、親ノードは存在しないため、「親ノードＩＤ」および「兄弟間順序」は空データである。 When the classification model data of FIG. 4 is described using the classification model of FIG. 5, since the “node ID” is “0” for the root node [average compression rate <0.78] and there is no parent node, “parent” “Node ID” and “order between siblings” are empty data.

次に、ルートノードの子ノードである「平均投稿数＜５」および「平均投稿数＜２．７」の「ノードＩＤ」は、各々、「1」および「6」である。両ノードの「親ノードＩＤ」は、ルートノードである「0」が格納される。また、「兄弟間順序」は、各々、Ｙｅｓのエッジにぶら下がるノード「平均投稿数＜５」は「1」、Ｎｏのエッジにぶら下がるノード「平均投稿数＜２．７」は「2」が格納される。 Next, “node IDs” of “average number of posts <5” and “average number of posts <2.7” which are child nodes of the root node are “1” and “6”, respectively. The “parent node ID” of both nodes stores “0” which is the root node. In the “order between siblings”, “1” is stored for the node “average number of posts <5” hanging on the edge of Yes, and “2” is stored for the node “average number of posts <2.7” hanging on the edge of No. Is done.

同様に、ノード「平均投稿数＜５」の子ノードである「自動投稿」および「総圧縮率＜＝０．５５」の「ノードＩＤ」は、各々、「2」および「3」である。ここで、ノード「自動投稿」は、終端ノード（図４上では二重線のブロックとして示される）であり、学習対象であるＳＮＳクライアントが、カテゴリＩＤ「4」で示されるbotアプリケーションに分類されることを示す。そのため、「カテゴリＩＤ」には「4」が格納される。その他のノードも、同様に構成される。なお、図２、３、および５における各データは一実施形態であり、データ項目の追加、削除を妨げるものではない。 Similarly, “automatic posting” which is a child node of the node “average number of posts <5” and “node ID” of “total compression ratio <= 0.55” are “2” and “3”, respectively. Here, the node “automatic posting” is a terminal node (indicated as a double line block in FIG. 4), and the SNS client to be learned is classified into the bot application indicated by the category ID “4”. Indicates that Therefore, “4” is stored in “Category ID”. Other nodes are similarly configured. Each of the data in FIGS. 2, 3, and 5 is an embodiment, and does not prevent addition or deletion of data items.

投稿データ記憶部１３３は、各ＳＮＳ投稿データ蓄積サーバ１００から提供された投稿データを格納する。投稿データについては後述する。 The posted data storage unit 133 stores posted data provided from each SNS posted data storage server 100. The post data will be described later.

次に、記憶装置１１６におけるプログラム格納領域に格納されているソフトウェアプログラムは、本発明に関連するものだけを列挙すると、投稿データ取得手段１２０、ＳＮＳクライアントデータ生成手段１２１、特徴量算出手段１２２、教師データ生成手段１２３、分類モデルデータ生成手段１２４、およびＳＮＳクライアント分類手段１２５を備えている。これらの手段は、ＣＰＵ１１０によって実行される。 Next, among software programs stored in the program storage area in the storage device 116, only those related to the present invention are listed. Posted data acquisition means 120, SNS client data generation means 121, feature quantity calculation means 122, teacher Data generation means 123, classification model data generation means 124, and SNS client classification means 125 are provided. These means are executed by the CPU 110.

投稿データ取得手段１２０は、各ＳＮＳ投稿データ蓄積サーバ１００から提供され、投稿データ記憶部１３３に格納されている投稿データを取得する。 Post data acquisition means 120 acquires post data provided from each SNS post data storage server 100 and stored in post data storage unit 133.

ＳＮＳクライアントデータ生成手段１２１は、取得された投稿データからＳＮＳクライアント名を抽出し、当該ＳＮＳクライアント名に基づいてＳＮＳクライアントデータを生成し、ＳＮＳクライアントデータ記憶部１３０に格納する。 The SNS client data generation unit 121 extracts the SNS client name from the acquired post data, generates SNS client data based on the SNS client name, and stores it in the SNS client data storage unit 130.

特徴量算出手段１２２は、取得された投稿データからＳＮＳクライアント識別子、アカウント識別子、本文内容、および投稿日時などを抽出し、各特徴量（平均圧縮率、総圧縮率、投稿間隔平均標準偏差、平均投稿数）を算出する。 The feature amount calculation unit 122 extracts an SNS client identifier, an account identifier, a text content, a posting date and the like from the acquired posting data, and extracts each feature amount (average compression rate, total compression rate, posting interval average standard deviation, average). Number of posts).

教師データ生成手段１２３は、生成されたＳＮＳクライアントデータおよび算出された各特徴量から教師データを生成し、教師データ生成手段１２３に格納する。 The teacher data generation unit 123 generates teacher data from the generated SNS client data and each calculated feature amount, and stores the teacher data in the teacher data generation unit 123.

分類モデルデータ生成手段１２４は、教師データの各特徴量を説明変数、およびカテゴリＩＤを目的変数として、分類（学習）モデルデータを生成し、分類モデルデータ記憶部１３２に格納する。 The classification model data generation unit 124 generates classification (learning) model data using each feature amount of the teacher data as an explanatory variable and the category ID as an objective variable, and stores the classification (learning) model data in the classification model data storage unit 132.

ＳＮＳクライアント分類手段１２５は、分類対象のＳＮＳクライアントに係る各特徴量を分類モデルのインプットとして、ＳＮＳクライアントを分類する。また、ＳＮＳクライアント分類手段１２５は、分類結果を、ＳＮＳクライアントデータ記憶部１３０、および教師データ生成手段１２３に反映することもできる。 The SNS client classification unit 125 classifies the SNS clients by using each feature amount related to the classification target SNS client as an input of the classification model. In addition, the SNS client classification unit 125 can reflect the classification result in the SNS client data storage unit 130 and the teacher data generation unit 123.

次に、本発明の分類モデル生成処理について流れに沿って説明する。図６は、本発明の一実施形態に係る分類モデル生成処理を示すフローチャートである。まず、ステップ１０１にて、投稿データ取得手段１２０は、各ＳＮＳ投稿データ蓄積サーバ１００から提供され、投稿データ記憶部１３３に格納されている投稿データを取得する。ここで、投稿データの取得は、一実施形態において、ＳＮＳクライアント名、アカウント名（投稿ユーザ名）、本文内容、投稿日時などがソースコード中に埋め込まれた電子ファイルを、ＳＮＳ投稿データ蓄積サーバ１００から受信することである。なお、ＳＮＳクライアント名およびアカウント名は、それぞれ、ＳＮＳクライアントおよびアカウントを識別できるものであればよく、ＩＤなどを含む任意の識別子のことである。また、投稿データの取得は、特定の期間に投稿されたデータなど取得条件を絞り込むことができる。他の実施形態では、投稿データ自体が予め絞り込まれたデータである。 Next, the classification model generation processing of the present invention will be described along the flow. FIG. 6 is a flowchart showing classification model generation processing according to an embodiment of the present invention. First, in step 101, post data acquisition means 120 acquires post data provided from each SNS post data storage server 100 and stored in post data storage unit 133. Here, in one embodiment, post data is acquired by using an SNS post data storage server 100 as an SNS client name, account name (post user name), text content, post date and time, etc. embedded in the source code. To receive from. Note that the SNS client name and the account name only need to be able to identify the SNS client and the account, and are arbitrary identifiers including an ID and the like. In addition, acquisition of post data can narrow down acquisition conditions such as data posted in a specific period. In another embodiment, the post data itself is data that has been narrowed down in advance.

投稿データを取得すると、ＳＮＳクライアントデータ生成手段１２１は、取得された投稿データからＳＮＳクライアント名を抽出し、当該ＳＮＳクライアント名に基づいて、ＳＮＳクライアントデータ（図２）を生成する（ステップ１０２）。生成されたＳＮＳクライアントデータは、ＳＮＳクライアントデータ記憶部１３０に格納される。ここで、ＳＮＳクライアントデータにおける「カテゴリＩＤ」および「カテゴリ名」は、データ生成時点では未だ分類されていないため、空データである。しかしながら、以降の分類（学習）処理のため、ＳＮＳクライアントデータの全て、または所在が明らかな一部のＳＮＳクライアントに係るデータの「カテゴリＩＤ」および「カテゴリ名」に対し、データを手動登録することができる。この手動登録したカテゴリデータを初期値として、以降の分類（学習）を行うことになる。 When the posted data is acquired, the SNS client data generating unit 121 extracts the SNS client name from the acquired posted data, and generates SNS client data (FIG. 2) based on the SNS client name (step 102). The generated SNS client data is stored in the SNS client data storage unit 130. Here, the “category ID” and “category name” in the SNS client data are empty data because they are not yet classified at the time of data generation. However, for subsequent classification (learning) processing, data must be manually registered for the “category ID” and “category name” of all SNS client data or some SNS clients whose location is clearly known. Can do. Subsequent classification (learning) is performed using the manually registered category data as an initial value.

次に、投稿データを取得すると、特徴量算出手段１２２は、取得された投稿データからＳＮＳクライアント名、アカウント名、本文内容、投稿日時などを抽出し、各特徴量（平均圧縮率、総圧縮率、投稿間隔平均標準偏差、平均投稿数）を算出する（ステップ１０３）。各特徴量を詳細に説明する。 Next, when the posting data is acquired, the feature amount calculation unit 122 extracts the SNS client name, the account name, the body content, the posting date and time from the acquired posting data, and each feature amount (average compression rate, total compression rate) , Posting interval average standard deviation, average number of postings) is calculated (step 103). Each feature amount will be described in detail.

平均圧縮率は、ＳＮＳクライアントおよびアカウントごとに、投稿データ（本文内容）を圧縮した場合の、ＳＮＳクライアントごとの平均圧縮率である。同一アカウントが、所定のＳＮＳクライアントを用いて同一または類似する投稿をした場合、圧縮率が高くなり（圧縮後のファイルサイズが小さくなり）、このようなＳＮＳクライアントおよびアカウントの組み合わせが多い場合、それらの平均圧縮率も高くなる。例えば、ＳＮＳクライアントが、一定時間ごとに辞書に登録された定型文を自動出力するbotアプリケーションである場合は、アカウント毎の投稿データが同一または類似する傾向にあり、その圧縮率も高くなる。なお、botアプリケーションは、一定時間ごとのみならず、指定日時などユーザが予め設定したタイミングで、設定した内容を投稿できるものである。また、一定時間とあるが、厳密には投稿タイミングごとに数秒程度のランダム時間を付与した上で、自動投稿する、などといったものもある。なお、圧縮は、ある実施形態では、投稿データにおける本文内容（文字列）を結合し、一般的な圧縮アルゴリズムを用いて行う。そのため、平均圧縮率の計算は、同一のＳＮＳクライアントかつ同一のアカウントごとに、投稿データ中の本文内容を結合した上で圧縮し、それぞれの圧縮率に対し、同一のＳＮＳクライアントごとに平均値（これがＳＮＳクライアントごとの平均圧縮率である）をとることにより行われる。 The average compression rate is an average compression rate for each SNS client when post data (text content) is compressed for each SNS client and account. When the same account makes the same or similar posting using a predetermined SNS client, the compression rate becomes high (the file size after compression becomes small), and when there are many combinations of such SNS clients and accounts, The average compression ratio becomes higher. For example, when the SNS client is a bot application that automatically outputs a fixed sentence registered in the dictionary at regular intervals, the posted data for each account tends to be the same or similar, and the compression rate also increases. Note that the bot application can post set contents not only at regular intervals but also at timings set in advance by the user, such as designated date and time. In addition, although there is a fixed time, strictly speaking, a random time of about several seconds is given for each posting timing, and then automatic posting is performed. In one embodiment, compression is performed using a general compression algorithm by combining body contents (character strings) in post data. Therefore, the average compression rate is calculated by combining the contents of the text in the posted data for the same SNS client and the same account, and compressing the average compression rate for each SNS client for each compression rate ( This is the average compression rate for each SNS client).

一実施形態において、ＳＮＳクライアント集合Ｃに含まれるＳＮＳクライアントＣごとかつアカウントごとの平均圧縮率Ｖ_1Ｃ(ｃ∈Ｃ)は、次の数式により算出することができる。 In one embodiment, the average compression rate V _1C (cεC) for each SNS client C and for each account included in the SNS client set C can be calculated by the following equation.

ここで、ave_ａ∈Ａ()は、アカウント集合Ａに含まれるアカウントａごとに、引数の平均値を算出することを表す。また、ｚｉｐ（ｓｔｒ）は文字列ｓｔｒに対して圧縮処理を行うことを表す。Ｔｗ(Ｃ,ａ)はＳＮＳクライアントＣがアカウントａにより投稿した投稿データ集合である。Σ_{ｔ∈Ｔｗ(Ｃ,ａ)}ｓｔｒ(ｔ)は、投稿データ集合Ｔｗ(Ｃ,ａ)に含まれる投稿データｔの全ての文字列ｓｔｒを連結することを表す。 Here, ave a _{∈ A} () represents that an average value of arguments is calculated for each account a included in the account set A. Also, zip (str) represents that compression processing is performed on the character string str. Tw (C, a) is a post data set posted by the SNS client C using the account a. _{ΣtεTw (C, a)} str (t) represents concatenation of all character strings str of the posted data t included in the posted data set Tw (C, a).

総圧縮率は、ＳＮＳクライアントごとに投稿データを圧縮した際の圧縮率である。同一のＳＮＳクライアントを使用するユーザ間で類似する投稿を行った場合に圧縮率は高くなる。総圧縮率の計算は、同一のＳＮＳクライアントごとに投稿データ中の本文内容を結合した上で圧縮することにより行われ、それぞれの圧縮率が、ＳＮＳクライアントごとの総圧縮率である。 The total compression rate is a compression rate when the posted data is compressed for each SNS client. When similar postings are made between users who use the same SNS client, the compression rate increases. The calculation of the total compression rate is performed by combining the contents of the text in the posted data for each same SNS client and compressing, and each compression rate is the total compression rate for each SNS client.

一実施形態において、ＳＮＳクライアント集合Ｃに含まれるＳＮＳクライアントＣごとの総圧縮率Ｖ_2Ｃ(ｃ∈Ｃ)は、次の数式により算出することができる。 In one embodiment, the total compression rate V _2C (cεC) for each SNS client C included in the SNS client set C can be calculated by the following equation.

各表記の意味は、平均圧縮率Ｖ_1Ｃ(ｃ∈Ｃ)の説明において示した通りである。 The meaning of each notation is as shown in the explanation of the average compression rate V _1C (cεC).

投稿間隔平均標準偏差は、投稿データ中の投稿日時から、ＳＮＳクライアントおよびアカウントごとに投稿間隔を算出し、それぞれの標準偏差から算出されるＳＮＳクライアントごとの平均値である。ＳＮＳクライアントが、一定時間ごとに自動投稿するbotアプリケーションの場合は、投稿間隔が一定であるため、投稿間隔にばらつきが少なく、標準偏差およびその平均値も小さくなる。なお、ここで標準偏差を用いているのは、現在のbotアプリケーションには、一定時間ごとに投稿するものであっても、厳密には、投稿タイミングごとに数秒程度のランダム時間を付与した上で、自動投稿するものもあるためである。投稿間隔平均標準偏差の計算は、同一のＳＮＳクライアントかつ同一のアカウントごとに、投稿データから取得された投稿日時を時系列に並べた上で、投稿間隔を算出し、当該投稿間隔の標準偏差を求め、それぞれの標準偏差に対し、同一のＳＮＳクライアントごとに平均値（これがＳＮＳクライアントごとの投稿間隔平均標準偏差である）をとることにより行われる。 The posting interval average standard deviation is an average value for each SNS client calculated from the respective standard deviations by calculating the posting interval for each SNS client and account from the posting date and time in the posting data. When the SNS client is a bot application that automatically posts every fixed time, since the posting interval is constant, there is little variation in the posting interval, and the standard deviation and its average value are also reduced. Note that the standard deviation is used here, even if the current bot application posts at regular intervals, strictly speaking, after adding a random time of several seconds for each posting timing This is because some of them automatically post. Posting interval average standard deviation is calculated for each of the same SNS client and the same account by calculating the posting interval after arranging the posting date and time obtained from posting data in time series, and calculating the standard deviation of the posting interval. For each standard deviation, the average value is calculated for each identical SNS client (this is the posting interval average standard deviation for each SNS client).

一実施形態において、ＳＮＳクライアント集合Ｃに含まれるＳＮＳクライアントＣごとかつアカウントごとの投稿間隔平均標準偏差Ｖ_3Ｃ(ｃ∈Ｃ)は、次の数式により算出することができる。 In one embodiment, the posting interval average standard deviation V _3C (cεC) for each SNS client C and for each account included in the SNS client set C can be calculated by the following equation.

ここで、ave_ａ∈Ａ()は、アカウント集合Ａに含まれるアカウントａごとに、引数の平均値を算出することを表す。また、stddev_{ｔ∈Ｔｗ(Ｃ,ａ)}()は、ＳＮＳクライアントＣごとかつアカウントａごとの投稿データ集合Ｔｗ(Ｃ,ａ)に含まれる投稿データｔごとに、引数の標準偏差を算出することを表す。ｍｉｎｕｔｅｄｉｆｆ(ｔ_ｉ,ｔ_j)は、投稿データｔ_ｉとｔ_ｊとの投稿時間の差を分単位で算出することを表す。 Here, ave a _{∈ A} () represents that an average value of arguments is calculated for each account a included in the account set A. Stddev _{tεTw (C, a)} () calculates the standard deviation of the argument for each posting data t included in the posting data set Tw (C, a) for each SNS client C and for each account a. Represents. minu t ediff _{(t i,} t _j) denotes calculating a difference between the post time of the post data _{t i} and _{t j} in minutes.

平均投稿数は、ＳＮＳクライアントおよびアカウントごとに算出される投稿数（投稿数）の、ＳＮＳクライアントごとの平均投稿数である。キャンペーン用の投稿を行うアプリケーションは、投稿数が少なくなる傾向がある。平均投稿数の計算は、同一のＳＮＳクライアントかつ同一のアカウントごとに、投稿数をカウントし、同一のＳＮＳクライアントごとに平均値（これがＳＮＳクライアントごとの平均投稿数である）をとることにより行われる。 The average number of posts is the average number of posts for each SNS client of the number of posts (posts) calculated for each SNS client and account. Applications that post for campaigns tend to have fewer posts. The average number of posts is calculated by counting the number of posts for the same SNS client and the same account and taking the average value for each same SNS client (this is the average number of posts for each SNS client). .

一実施形態において、ＳＮＳクライアント集合Ｃに含まれるＳＮＳクライアントＣごとかつアカウントごとの平均投稿数Ｖ_4Ｃ(ｃ∈Ｃ)は、次の数式により算出することができる。 In one embodiment, the average number of posts V _4C (cεC) for each SNS client C and for each account included in the SNS client set C can be calculated by the following equation.

ここで、ave_ａ∈Ａ()は、アカウント集合Ａに含まれるアカウントａごとに、引数の平均値を算出することを表す。また、ｎ(Ｔｗ(Ｃ,ａ))は、ＳＮＳクライアントＣごとかつアカウントａごとの投稿データ集合Ｔｗ(Ｃ,ａ)の要素数である。 Here, ave a _{∈ A} () represents that an average value of arguments is calculated for each account a included in the account set A. Further, n (Tw (C, a)) is the number of elements of the posting data set Tw (C, a) for each SNS client C and for each account a.

図６の処理フローに戻り、ステップ１０３において各特徴量を算出すると、教師データ生成手段１２３は、ステップ１０２で生成されたＳＮＳクライアントデータ、およびステップ１０３で算出された各特徴量から教師データ（図３）を生成する（ステップ１０４）。生成された教師データは、教師データ記憶部１３１に格納される。図３における「クライアント名」および「カテゴリＩＤ」は、ＳＮＳクライアントデータ（図２）からコピーすることによりデータを格納することが出来る。なお、一実施形態において、「クライアント名」および「カテゴリＩＤ」の紐付けは分析担当者が行うこともできるまた、各特徴量（平均圧縮率、総圧縮率、投稿間隔平均標準偏差、平均投稿数）は、ステップ１０３において算出された値そのものである。 Returning to the processing flow of FIG. 6, when each feature amount is calculated in step 103, the teacher data generation unit 123 uses the SNS client data generated in step 102 and the teacher data (FIG. 3) is generated (step 104). The generated teacher data is stored in the teacher data storage unit 131. The “client name” and “category ID” in FIG. 3 can be stored by copying from the SNS client data (FIG. 2). In one embodiment, the “client name” and “category ID” can be linked by the person in charge of analysis. Each feature amount (average compression ratio, total compression ratio, posting interval average standard deviation, average posting) (Number) is the value itself calculated in step 103.

次に、分類モデルデータ生成手段１２４は、ステップ１０４で生成された教師データの各特徴量を説明変数、およびカテゴリＩＤを目的変数として、分類（学習）モデルデータ（図５）を生成する（ステップ１０５）。生成された分類モデルデータは、分類モデルデータ記憶部１３２に格納される。分類モデルデータから示すことができる分類モデル（図４）は、いわゆる決定木であり、一般的な決定木生成アルゴリズムを用いて生成される。分類モデルデータの詳細説明、および分類モデルデータと分類モデルの関係については上述した通りである。また、分類モデル生成時の説明変数として用いられる各特徴量は、必ずしも全てを用いる必要はない。例えば、平均圧縮率および総圧縮率のみを説明変数として、学習モデルを生成することもできる。ステップ１０５の後、本処理は終了する。 Next, the classification model data generation unit 124 generates classification (learning) model data (FIG. 5) using each feature amount of the teacher data generated in step 104 as an explanatory variable and a category ID as an objective variable (step 5). 105). The generated classification model data is stored in the classification model data storage unit 132. The classification model (FIG. 4) that can be shown from the classification model data is a so-called decision tree, and is generated using a general decision tree generation algorithm. The detailed description of the classification model data and the relationship between the classification model data and the classification model are as described above. Further, it is not always necessary to use all the feature quantities used as explanatory variables when generating the classification model. For example, a learning model can be generated using only the average compression rate and the total compression rate as explanatory variables. After step 105, the process ends.

次に、本発明のＳＮＳクライアント分類処理について流れに沿って説明する。図７は、本発明の一実施形態に係るＳＮＳクライアント分類処理を示すフローチャートである。まず、ステップ２０１にて、投稿データ取得手段１２０は、分類対象とするＳＮＳクライアントに係る投稿データ（以下、分類対象投稿データという）を投稿データ記憶部１３３から取得する。投稿データについての説明は、ステップ１０１のものと同様である。 Next, the SNS client classification process of the present invention will be described along the flow. FIG. 7 is a flowchart showing SNS client classification processing according to an embodiment of the present invention. First, in step 201, the posting data acquisition unit 120 acquires posting data related to the SNS client to be classified (hereinafter referred to as classification target posting data) from the posting data storage unit 133. The explanation about the posted data is the same as that in step 101.

次に、分類対象投稿データを取得すると、特徴量算出手段１２２は、取得された分類対象投稿データからＳＮＳクライアント名、アカウント名、本文内容、投稿日時などを抽出し、各特徴量（平均圧縮率、総圧縮率、投稿間隔平均標準偏差、平均投稿数）を算出する（ステップ２０２）。各特徴量についての説明もステップ１０３のものと同様である。 Next, when the classification target post data is acquired, the feature amount calculation unit 122 extracts the SNS client name, the account name, the body content, the posting date and the like from the acquired classification target post data, and each feature amount (average compression rate) , The total compression rate, the posting interval average standard deviation, and the average number of postings) are calculated (step 202). The description of each feature amount is the same as that in step 103.

分類対象投稿データの各特徴量を算出すると、ＳＮＳクライアント分類手段１２５は、算出された各特徴量を、ステップ１０５で生成した分類モデルのインプットとして、分類対象のＳＮＳクライアントを分類する。分類方法は、分類モデルにおける各ノード（終端ノードを除く）および各ノードに対するエッジを条件式（例えば、if文や、switch文）として考え、分類対象のＳＮＳクライアントの各特徴量を用いて、ルートノードから各条件に沿って進み、いずれかの終端ノードに辿りつくことにより行われる。辿りついた終端ノードに係るカテゴリが、分類対象のＳＮＳクライアントが分類されたカテゴリである。 When each feature amount of the classification target posted data is calculated, the SNS client classification unit 125 classifies the SNS client to be classified using each calculated feature amount as an input of the classification model generated in Step 105. The classification method considers each node (excluding the terminal node) in the classification model and an edge for each node as a conditional expression (for example, an if statement or a switch statement), and uses each feature amount of the SNS client to be classified as a route. This is done by proceeding from the node according to each condition and reaching one of the terminal nodes. The category related to the terminal node that has been reached is the category in which the SNS clients to be classified are classified.

ステップ２０３において、分類対象のＳＮＳクライアントが分類されると、ＳＮＳクライアント分類手段１２５は、その分類結果（カテゴリＩＤおよびカテゴリ名）、およびステップ２０２で算出された各特徴量を、ＳＮＳクライアントデータ記憶部１３０、および教師データ生成手段１２３に反映することができる（ステップ２０４および２０５）。これにより、再度、分類モデル生成処理（図６）を実行することにより、分類モデルを更新（学習）することもできる。ステップ２０５の後、本処理は終了する。なお、初期値のまま分類モデルを更新しない場合は、ステップ２０４および２０５は実施せず、ステップ２０３の後、本処理は終了する。 In step 203, when the SNS clients to be classified are classified, the SNS client classification unit 125 uses the classification result (category ID and category name) and each feature amount calculated in step 202 as the SNS client data storage unit. 130 and teacher data generation means 123 (steps 204 and 205). Thus, the classification model can be updated (learned) by executing the classification model generation process (FIG. 6) again. After step 205, the process ends. If the classification model is not updated with the initial value, steps 204 and 205 are not performed, and after step 203, the process ends.

なお、分類モデルを更新（学習）するか否か、どのような学習を行うかについては、所定のルールに従って、実行する学習パターンを判断することができる。図８は、本発明の一実施形態における実行する学習パターンの判断方法を示すフローチャートである。まず、ステップ３０１において、分析する投稿データの投稿期間が一定範囲の固定か否かを判断する。様々な投稿期間のデータに対して繰り返し分析を行う場合は、Ｎｏルートに進む。一方、分析する投稿データの投稿期間が一定範囲の固定であると判断されると、Ｙｅｓルートに進み、以下の実行パターン１（非学習パターン）のルールに則って、図７の分類処理を実行することができる。 Whether or not to update (learn) the classification model and what kind of learning to perform can be determined according to a predetermined rule. FIG. 8 is a flowchart showing a learning pattern determination method to be executed in an embodiment of the present invention. First, in step 301, it is determined whether or not the posting period of post data to be analyzed is fixed within a certain range. When repeatedly analyzing data for various posting periods, the process proceeds to the No route. On the other hand, if it is determined that the posting period of the posting data to be analyzed is fixed within a certain range, the process proceeds to the Yes route and executes the classification process of FIG. 7 in accordance with the following execution pattern 1 (non-learning pattern) rule. can do.

実行パターン１（非学習パターン）
一度、分類モデルを生成した後、当該分類モデルを用いてＳＮＳクライアントを分類するが、分類モデルの学習は行わない。すなわち、分類結果や各特徴量の反映（ステップ２０４および２０５の実行）をせず、分類モデル生成処理（図６）が初回時に実行されるのみである。 Execution pattern 1 (non-learning pattern)
Once the classification model is generated, the SNS client is classified using the classification model, but the classification model is not learned. That is, the classification model generation process (FIG. 6) is only executed at the first time without reflecting the classification result and each feature amount (execution of steps 204 and 205).

一方、ステップ３０１において、分析する投稿データの投稿期間が一定範囲の固定でないと判断されると、Ｎｏルートに進み、過去データの保存領域が十分に確保できるか否かを判断する（ステップ３０２）。分類モデルの学習の際、過去データを用いることで、より精度の高い分類や、細かい分類が可能となる。しかしながら、そのためには過去データを保存しておくデータ領域が必要となる。過去データの保存量は、求める分類の精度などに基づいて定めることができ、過去データが定めた保存量を上回る場合は、古いものから順に削除することが出来る。 On the other hand, if it is determined in step 301 that the posting period of the posting data to be analyzed is not fixed within a certain range, the process proceeds to the No route, and it is determined whether or not a storage area for past data can be sufficiently secured (step 302). . When learning the classification model, more accurate classification and fine classification are possible by using past data. However, this requires a data area for storing past data. The storage amount of past data can be determined based on the accuracy of classification to be obtained, and when the past data exceeds the determined storage amount, the oldest data can be deleted in order.

ステップ３０２において、過去データの保存領域が十分に確保できないと判断されると、Ｎｏルートに進み、以下の実行パターン２（未分類ＳＮＳクライアント学習パターン）のルールに則って、図７の分類処理を実行することができる。 If it is determined in step 302 that a sufficient storage area for past data cannot be secured, the process proceeds to the No route, and the classification process of FIG. 7 is performed in accordance with the rule of the following execution pattern 2 (unclassified SNS client learning pattern). Can be executed.

実行パターン２（未分類ＳＮＳクライアント学習パターン）
第１の期間を対象として分類モデルを用いてＳＮＳクライアントを分類した後、第１の期間とは別の第２の期間に未知のＳＮＳクライアントが所定数以上発生したことを条件に、第２の期間を対象としてＳＮＳクライアントの分類を行う。第２の期間を対象とした分類の際、分類対象が既知のＳＮＳクライアントの場合は、前回の分類において既に教師データ記憶部１３１に格納されている各特徴量を用いて分類モデルの更新を行う。一方、分類対象が未知のＳＮＳクライアントの場合は、投稿データ記憶部１３３に格納された投稿データから各特徴量を算出する。更新した分類モデルのインプットとして、算出した未知のＳＮＳクライアントの各特徴量を用いて、未知のＳＮＳクライアントの分類を行う。 Execution pattern 2 (uncategorized SNS client learning pattern)
After classifying the SNS clients using the classification model for the first period, the second SNS client is generated on the condition that a predetermined number or more of unknown SNS clients occur in a second period different from the first period. SNS clients are classified for the period. In the case of classification for the second period, if the classification target is a known SNS client, the classification model is updated using each feature amount already stored in the teacher data storage unit 131 in the previous classification. . On the other hand, in the case of an SNS client whose classification target is unknown, each feature amount is calculated from the posted data stored in the posted data storage unit 133. The unknown SNS client is classified using the calculated feature values of the unknown SNS client as the input of the updated classification model.

一方、ステップ３０２において、過去データの保存領域が十分に確保できると判断されると、Ｙｅｓルートに進み、投稿データに対するより精度の高い分類が必要か否かを判断する（ステップ３０３）。ステップ３０３において、投稿データに対するより精度の高い分類が必要ないと判断されると、Ｎｏルートに進み、以下の実行パターン３（全学習パターン）のルールに則って、図７の分類処理を実行することができる。 On the other hand, if it is determined in step 302 that a sufficient storage area for past data can be secured, the process proceeds to the Yes route, and it is determined whether or not more accurate classification of the posted data is necessary (step 303). If it is determined in step 303 that more accurate classification is not required for the posted data, the process proceeds to the No route, and the classification process of FIG. 7 is executed in accordance with the following rules of execution pattern 3 (all learning patterns). be able to.

実行パターン３（全学習パターン）
第１の期間を対象として分類モデルを用いてＳＮＳクライアントを分類した後、第１の期間とは別の第２の期間に投稿データが所定数以上発生したことを条件に、第２の期間を対象としてＳＮＳクライアントの分類を行う。第２の期間を対象とした分類の際、分類対象のＳＮＳクライアントが既知であるか未知であるかに関わらず、投稿データ記憶部１３３に格納された投稿データから各特徴量を算出する。前回の分類において既に教師データ記憶部１３１に格納されている各特徴量を用いて分類モデルの更新を行い、更新した分類モデルのインプットとして、算出した既知および未知のＳＮＳクライアントの各特徴量を用いて、既知および未知のＳＮＳクライアントの分類を行う。 Execution pattern 3 (all learning patterns)
After the SNS client is classified using the classification model for the first period, the second period is set on condition that a predetermined number or more of post data is generated in a second period different from the first period. The SNS client is classified as a target. At the time of classification for the second period, each feature amount is calculated from the posted data stored in the posted data storage unit 133 regardless of whether the SNS client to be classified is known or unknown. The classification model is updated using each feature quantity already stored in the teacher data storage unit 131 in the previous classification, and the calculated feature quantities of the known and unknown SNS clients are used as the input of the updated classification model. To classify known and unknown SNS clients.

なお、図８に示す実行する学習パターンの判断方法は、あくまでも一実施形態であり、本発明は、これらの判断および学習パターンに限定されない。また、投稿データの分析に対して過去データを考慮したり、分析を所定の期間別に行ったりすることで、以下のようにＳＮＳクライアントをさらに細かく分類することもできる。 Note that the learning pattern determination method to be executed shown in FIG. 8 is merely an embodiment, and the present invention is not limited to these determinations and learning patterns. In addition, by considering past data for analysis of posted data or performing analysis for each predetermined period, SNS clients can be further classified as follows.

実行パターン４（過去データ考慮学習パターン）
分類モデルを用いてＳＮＳクライアントを分類した後、当該ＳＮＳクライアント分類結果が以前の分類結果と同一であった場合、分類結果および各特徴量の反映を行い、再度、分類モデル生成処理を実行することで、分類モデルの学習を行う。ＳＮＳクライアント分類結果が以前の分類結果と同一である場合のみ学習が行われるため、分類結果のブレに対応することができ、より精度の高い分類が可能となる。 Execution pattern 4 (Past data consideration learning pattern)
After classifying the SNS client using the classification model, if the SNS client classification result is the same as the previous classification result, the classification result and each feature amount are reflected, and the classification model generation process is executed again. Then, the classification model is learned. Since learning is performed only when the SNS client classification result is the same as the previous classification result, it is possible to cope with the blur of the classification result, and classification with higher accuracy is possible.

実行パターン５（期間別分類パターン）
ＳＮＳクライアントデータ（図２）および教師データ（図３）に期間の概念を持たせ、期間ごとに分類する。すなわち、同一のＳＮＳクライアントであっても、期間ごとに、当該期間における投稿データから各特徴量が算出され、個別のカテゴリに分類される。例えば、ＳＮＳクライアント「Patent」は、期間「２０１４年３月」においてはカテゴリ「本人」であるが、「２０１４年４月」では、カテゴリ「自動投稿」であるといった分類をすることができる。 Execution pattern 5 (period-specific classification pattern)
The SNS client data (FIG. 2) and the teacher data (FIG. 3) have the concept of a period and are classified for each period. That is, even for the same SNS client, for each period, each feature amount is calculated from the posted data in the period and is classified into individual categories. For example, the SNS client “Patent” can be classified as the category “person” in the period “March 2014” but the category “automatic posting” in “April 2014”.

Claims

A method of classifying SNS clients that output post data based on post data in a social network service (SNS), the method comprising:
Obtaining the post data;
Extracting at least an SNS client identifier, an account identifier, and text content from the acquired post data, and calculating a feature amount for each SNS client, wherein the feature amount includes an average compression rate and a total compression rate; The average compression rate includes at least the average compression rate for each SNS client identifier when the body content is combined and compressed for each of the SNS client identifier and the account identifier, and the total compression rate is , The compression ratio when the body content is combined and compressed for each SNS client identifier, and
Classifying SNS clients that output the acquired post data using the feature quantity as input data of a learning model, wherein the learning model extracts the feature quantities of a plurality of SNS clients from post data in the SNS. A method comprising: generating an explanatory variable and a classification of the SNS client as an objective variable.

The feature amount further includes an average number of posts, and the average number of posts is an average number of posts for each SNS client identifier of the number of posts calculated for each of the SNS client identifier and the account identifier. The method of claim 1.

The extracting includes further extracting a posting date and time from the acquired posting data, and the feature amount further includes a posting interval average standard deviation, and the posting interval average standard deviation is calculated from the posting date and time, The method according to claim 2, wherein a posting interval is calculated for each SNS client identifier and each account identifier, and is an average value for each SNS client identifier calculated from each standard deviation.

A computer program having computer-executable instructions for causing a computer to execute a method of classifying SNS clients that output the posted data based on posted data in a social network service (SNS), wherein the computer program ,
Get the post data,
The SNS client identifier, the account identifier, and the text content are extracted at least from the acquired posted data, the feature amount for each SNS client is calculated, and the feature amount includes at least an average compression rate and a total compression rate, The average compression rate is an average compression rate for each SNS client identifier when the body content is combined and compressed for each of the SNS client identifier and the account identifier, and the total compression rate is the SNS client It is a compression ratio when the body content is combined and compressed for each identifier,
The SNS client that outputs the acquired post data is classified using the feature amount as input data of a learning model, and the learning model uses the feature amount of the plurality of SNS clients as explanatory variables from post data in the SNS, and A computer program generated using the classification of the SNS client as an objective variable.

A server computer that classifies SNS clients that output post data based on post data in a social network service (SNS), the server computer comprising:
Means for obtaining the post data;
A means for extracting at least an SNS client identifier, an account identifier, and text content from the acquired post data, and calculating a feature amount for each SNS client, wherein the feature amount includes an average compression rate and a total compression rate. The average compression rate includes at least the average compression rate for each SNS client identifier when the body content is combined and compressed for each of the SNS client identifier and the account identifier, and the total compression rate is , Means that is a compression ratio when the body content is combined and compressed for each SNS client identifier;
The means for classifying SNS clients that output the acquired post data using the feature quantity as input data of a learning model, wherein the learning model uses the feature quantities of a plurality of SNS clients from post data in the SNS. A server computer comprising: means for generating an explanatory variable and a classification of the SNS client as an objective variable.