JP6680666B2

JP6680666B2 - Information analysis device, information analysis system, information analysis method, and information analysis program

Info

Publication number: JP6680666B2
Application number: JP2016227589A
Authority: JP
Inventors: 義裕安藤; 山本　浩司; 浩司山本
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2016-11-24
Filing date: 2016-11-24
Publication date: 2020-04-15
Anticipated expiration: 2036-11-24
Also published as: JP2018084953A

Description

本発明は、情報解析装置、情報解析システム、情報解析方法、および情報解析プログラムに関する。 The present invention relates to an information analysis device, an information analysis system, an information analysis method, and an information analysis program.

ＳＮＳ（Social Networking Service）などで利用されるユーザの識別情報（以下、ユーザＩＤと称する）から複数の特徴量を抽出して、この抽出した複数の特徴量を機械学習することで、ユーザＩＤを大量に取得している不正なユーザを検出する技術が知られている。 By extracting a plurality of feature quantities from user identification information (hereinafter referred to as user ID) used in SNS (Social Networking Service) and the like, and performing machine learning on the extracted plurality of feature quantities, the user ID is obtained. A technique for detecting a large number of unauthorized users is known.

ZAFARANI, Reza LIU, Huan. 10 Bits of Surprise: Detecting Malicious Users with Minimum Information. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 2015. p. 423431.ZAFARANI, Reza LIU, Huan. 10 Bits of Surprise: Detecting Malicious Users with Minimum Information.In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management.ACM, 2015.p. 423431.

しかしながら、従来の技術では、抽出される特徴量が多岐に亘り、その特徴量の組み合わせによっては、不正に取得されたユーザＩＤの検出精度が向上しない場合があった。 However, in the related art, the feature amounts extracted are diverse, and the detection accuracy of the illegally acquired user ID may not be improved depending on the combination of the feature amounts.

本発明は、このような事情を考慮してなされたものであり、不正に取得されたユーザＩＤの検出精度を向上させることを目的の一つとする。 The present invention has been made in consideration of such circumstances, and an object thereof is to improve the detection accuracy of an illegally acquired user ID.

本発明の一態様は、ユーザの識別情報を取得する取得部と、前記取得部により取得された前記ユーザの識別情報が示す文字列から、文字列または文字の存在確率に関する特徴量、文字列に含まれる特定の記号に関する特徴量、および地域によって異なるキーボードの配列に関する特徴量のうち少なくとも一部を抽出する抽出部と、前記抽出部により前記文字列から抽出された特徴量の中から、不正に取得されたユーザの識別情報を検出するための特徴量を、機械学習を用いて選択する機械学習部と、を備える情報解析装置である。 One aspect of the present invention is an acquisition unit that acquires identification information of a user, and a character string indicated by the identification information of the user acquired by the acquisition unit, into a character string or a feature amount related to the probability of existence of a character, and a character string. An illegal extraction is performed from the feature amount related to a specific symbol included and the feature amount extracted from the character string by the extraction unit that extracts at least a part of the feature amount related to the keyboard layout that differs depending on the region. It is an information analysis device comprising: a machine learning unit that selects, using machine learning, a feature amount for detecting the acquired identification information of a user.

本発明の一態様によれば、不正に取得されたユーザＩＤの検出精度を向上させることができる。 According to one aspect of the present invention, it is possible to improve the detection accuracy of an illegally acquired user ID.

実施形態における情報解析装置１００を含む情報解析システム１の一例を示す図である。It is a figure showing an example of information analysis system 1 containing information analysis device 100 in an embodiment. 実施形態における端末装置１０の構成の一例を示す図である。It is a figure which shows an example of a structure of the terminal device 10 in embodiment. 実施形態におけるサーバ装置５０の構成の一例を示す図である。It is a figure which shows an example of a structure of the server apparatus 50 in embodiment. アカウント情報５４の一例を示す図である。It is a figure showing an example of account information 54. 実施形態における情報解析装置１００の構成の一例を示す図である。It is a figure showing an example of composition of information analysis device 100 in an embodiment. 二値分類問題を解くためのパターン識別モデルを生成する処理の一例を示すフローチャートである。It is a flowchart which shows an example of the process which produces | generates the pattern identification model for solving a binary classification problem. 教師データ１３２の一例を示す図である。It is a figure which shows an example of the teacher data 132. 特徴量情報１３４の一例を示す図である。It is a figure showing an example of feature amount information 134. ＱＷＥＲＴＹ配列のキーボードと、ＤＶＯＲＡＫ配列のキーボードの一例を示す図である。It is a figure which shows an example of the keyboard of QWERTY arrangement | positioning, and the keyboard of DVORAK arrangement | positioning. 実際の評価結果の一例を示す図である。It is a figure which shows an example of an actual evaluation result. 生成したパターン認識モデルを用いて、未分類のユーザＩＤを正例または負例に分類する処理の一例を示すフローチャートである。It is a flow chart which shows an example of processing which classifies unclassified user ID into a positive example or a negative example using the generated pattern recognition model. ユーザＩＤの文字数の制限の有無に応じたInformation Surpriseの特徴量の一例を示す図である。It is a figure which shows an example of the feature-value of Information Surprise according to the presence or absence of limitation of the number of characters of a user ID. ユーザＩＤの認証時に端末装置１０の表示部１３に表示される画面の一例を示す図である。6 is a diagram showing an example of a screen displayed on the display unit 13 of the terminal device 10 when authenticating a user ID. FIG. 実施形態の端末装置１０、サーバ装置５０、および情報解析装置１００のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the terminal device 10, the server apparatus 50, and the information analysis apparatus 100 of embodiment.

以下、図面を参照し、本発明の情報解析装置、情報解析システム、情報解析方法、および情報解析プログラムの実施形態について説明する。 An embodiment of an information analysis device, an information analysis system, an information analysis method, and an information analysis program of the present invention will be described below with reference to the drawings.

［概要］
実施形態の情報解析装置は、一以上のプロセッサによって実現される。情報解析装置は、ユーザＩＤを取得し、このユーザＩＤが示す文字列から、文字列または文字の存在確率に関する特徴量、文字列に含まれる特定の記号に関する特徴量、および地域によって異なるキーボードの配列に関する特徴量を抽出する。本実施形態におけるユーザＩＤは、例えば、アルファベットなどの文字、数字、アンダーバーなどの記号のうち一部または全部を含む文字列によって表されるユーザの識別情報である。 [Overview]
The information analysis device of the embodiment is realized by one or more processors. The information analysis device acquires a user ID, and from the character string indicated by the user ID, a feature amount regarding the probability of existence of a character string or a character, a feature amount regarding a specific symbol included in the character string, and a keyboard arrangement that differs depending on the region. The feature amount regarding The user ID in the present embodiment is, for example, user identification information represented by a character string including some or all of letters such as alphabets, numbers, and symbols such as underscores.

情報解析装置は、文字列から抽出した複数の特徴量の中から、不正に取得されたユーザＩＤを検出するための特徴量を、機械学習を用いて選択する。そして、情報解析装置は、不正に取得されたユーザＩＤを検出するための特徴量に基づいて、取得した複数のユーザＩＤの中から、不正に取得されたユーザＩＤを検出する。これによって、不正に取得されたユーザＩＤの検出精度を向上させることができる。 The information analysis device selects a feature amount for detecting an illegally acquired user ID from a plurality of feature amounts extracted from a character string by using machine learning. Then, the information analysis device detects the illegally acquired user ID from the plurality of acquired user IDs based on the characteristic amount for detecting the illegally acquired user ID. As a result, it is possible to improve the detection accuracy of the illegally acquired user ID.

なお、本実施形態における「不正に取得」とは、例えば、ある観測期間内に、所定数以上（例えば１００個以上）のユーザＩＤが取得されることをいう。 The “illegal acquisition” in the present embodiment means that a predetermined number or more (for example, 100 or more) user IDs are acquired within a certain observation period, for example.

［全体構成］
図１は、実施形態における情報解析装置１００を含む情報解析システム１の一例を示す図である。実施形態における情報解析システム１は、一つ以上の端末装置１０と、サーバ装置５０と、情報解析装置１００とを備える。これらの装置は、ネットワークＮＷを介して互いに接続される。ネットワークＮＷは、例えば、無線基地局、Ｗｉ−Ｆｉアクセスポイント、通信回線、プロバイダ、インターネットなどを含む。なお、図１に示す各装置の全ての組み合わせが相互に通信可能である必要はなく、ネットワークＮＷは、一部にローカルなネットワークを含んでもよい。 [overall structure]
FIG. 1 is a diagram illustrating an example of an information analysis system 1 including an information analysis device 100 according to the embodiment. The information analysis system 1 according to the embodiment includes one or more terminal devices 10, a server device 50, and an information analysis device 100. These devices are connected to each other via the network NW. The network NW includes, for example, a wireless base station, Wi-Fi access point, communication line, provider, Internet, and the like. Note that it is not necessary that all combinations of the respective devices illustrated in FIG. 1 can communicate with each other, and the network NW may partially include a local network.

端末装置１０は、ユーザによって使用される装置である。端末装置１０は、例えば、スマートフォンなどの携帯電話、タブレット端末、パーソナルコンピュータなどのコンピュータ装置である。例えば、端末装置１０は、ショッピングサイトなどのウェブサイト、メールサービス、ＳＮＳサービス、情報提供サービスなどにおいてユーザＩＤを登録するために利用されてよい。 The terminal device 10 is a device used by a user. The terminal device 10 is, for example, a mobile phone such as a smartphone, a tablet terminal, or a computer device such as a personal computer. For example, the terminal device 10 may be used to register a user ID on a website such as a shopping site, a mail service, an SNS service, an information providing service, or the like.

サーバ装置５０は、各種サービスを提供する。例えば、サーバ装置５０は、端末装置１０において起動されるウェブブラウザを介して、各種サービスを提供するためのウェブサイトを提供するウェブサーバ装置であってよい。また、サーバ装置５０は、所定のアプリケーションプログラムが起動（実行）された端末装置１０と通信を行うことで、各種情報の受け渡しを行うアプリケーションサーバ装置であってもよい。所定のアプリケーションプログラムが起動された端末装置１０には、サーバ装置５０との通信により、各種サービスを提供可能な画面が表示される。以下、説明を簡略化するために、サーバ装置５０がウェブサーバ装置であるものとして説明する。 The server device 50 provides various services. For example, the server device 50 may be a web server device that provides a website for providing various services via a web browser activated on the terminal device 10. Further, the server device 50 may be an application server device that delivers various information by communicating with the terminal device 10 in which a predetermined application program is started (executed). A screen capable of providing various services is displayed on the terminal device 10 in which a predetermined application program has been activated, by communication with the server device 50. Hereinafter, in order to simplify the description, it is assumed that the server device 50 is a web server device.

例えば、サーバ装置５０は、サービスの提供前にユーザＩＤの認証を行い、ユーザの確認を行う。サーバ装置５０は、認証の結果、既にユーザＩＤが登録されたユーザであれば各種サービスを提供し、ユーザＩＤが登録されていないユーザであれば、ユーザＩＤが未登録であることを通知したり、ユーザＩＤの登録を促したりする。ユーザＩＤが未登録であることを受けて、ユーザが新規にユーザＩＤの登録した場合、サーバ装置５０は、新たに登録されたユーザＩＤを発行する。これによって、ユーザは新規にユーザＩＤを取得することができる。 For example, the server device 50 authenticates the user ID before providing the service and confirms the user. As a result of the authentication, the server device 50 provides various services if the user has already registered the user ID, and notifies that the user has not registered the user ID if the user has not registered the user ID. , Prompt user ID registration. When the user newly registers the user ID in response to the fact that the user ID has not been registered, the server device 50 issues the newly registered user ID. This allows the user to newly acquire the user ID.

情報解析装置１００は、サーバ装置５０と通信を行って、サーバ装置５０が提供するサービスを利用するユーザのユーザＩＤを取得し、このユーザＩＤを機械学習により解析することで、不正に取得されたユーザＩＤが存在しているかどうかを検出する。本実施形態における機械学習は、ＳＶＭ（Support Vector Machine）やロジスティック回帰などの教師あり学習である。 The information analysis device 100 communicates with the server device 50, acquires the user ID of the user who uses the service provided by the server device 50, and analyzes this user ID by machine learning, whereby the information is illegally acquired. It detects whether or not the user ID exists. The machine learning in this embodiment is supervised learning such as SVM (Support Vector Machine) and logistic regression.

［端末装置の構成］
以下、各装置の構成について説明する。図２は、実施形態における端末装置１０の構成の一例を示す図である。図示のように、端末装置１０は、例えば、端末側通信部１１と、受付部１２と、表示部１３と、端末側記憶部１４と、端末側制御部１５とを備える。 [Configuration of terminal device]
The configuration of each device will be described below. FIG. 2 is a diagram illustrating an example of the configuration of the terminal device 10 according to the embodiment. As illustrated, the terminal device 10 includes, for example, a terminal side communication unit 11, a reception unit 12, a display unit 13, a terminal side storage unit 14, and a terminal side control unit 15.

端末側通信部１１は、ネットワークＮＷを介してサーバ装置５０と通信する。端末側通信部１１は、サーバ装置５０から情報を受信した場合、受信した情報を端末側制御部１５に出力する。また、端末側通信部１１は、端末側制御部１５による制御を受けて、サーバ装置５０に情報を送信する。 The terminal-side communication unit 11 communicates with the server device 50 via the network NW. When receiving information from the server device 50, the terminal-side communication unit 11 outputs the received information to the terminal-side control unit 15. The terminal-side communication unit 11 also receives information from the terminal-side control unit 15 and transmits information to the server device 50.

受付部１２は、例えば、キーボード、ボタン、マウス、マイク、タッチパネル等のユーザインターフェースであり、ユーザからの操作を受け付ける。また、受付部１２は、例えば、音声による入力を受け付けるものであってもよい。なお、表示部１３がタッチパネルである場合、受付部１２の一部は表示部１３と一体として形成される。 The reception unit 12 is, for example, a user interface such as a keyboard, a button, a mouse, a microphone, and a touch panel, and receives an operation from a user. Further, the receiving unit 12 may receive, for example, a voice input. When the display unit 13 is a touch panel, part of the reception unit 12 is formed integrally with the display unit 13.

表示部１３は、例えば、ＬＣＤ（Liquid Crystal Display）や有機ＥＬ（Electroluminescence）ディスプレイなどの表示装置である。表示部１３は、端末側制御部１５から入力される情報に基づいて各種画像を表示する。 The display unit 13 is a display device such as an LCD (Liquid Crystal Display) or an organic EL (Electroluminescence) display, for example. The display unit 13 displays various images based on the information input from the terminal-side control unit 15.

端末側記憶部１４は、例えば、ＨＤＤ（Hard Disc Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＯＭ（Read Only Memory）、またはＲＡＭ（Random Access Memory）などにより実現される。 The terminal-side storage unit 14 is realized by, for example, an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), a RAM (Random Access Memory), or the like.

端末側制御部１５は、例えば、ＣＰＵ（Central Processing Unit）などのプロセッサが端末側記憶部１４に格納されたプログラムを実行することにより実現される。また、端末側制御部１５は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、またはＦＰＧＡ（Field-Programmable Gate Array）などのハードウェアにより実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。 The terminal-side control unit 15 is realized by, for example, a processor such as a CPU (Central Processing Unit) executing a program stored in the terminal-side storage unit 14. The terminal-side control unit 15 may be realized by hardware such as an LSI (Large Scale Integration), an ASIC (Application Specific Integrated Circuit), or an FPGA (Field-Programmable Gate Array), or may be realized by software and hardware. It may be realized by collaboration.

端末側制御部１５は、例えば、ウェブブラウザなどのＵＡ（User Agent）を起動し、受付部１２に対して所定の操作がなされることで、端末側通信部１１を用いて、サーバ装置５０に対してＨＴＴＰ（Hypertext Transfer Protocol）リクエストを送信する。そして、端末側制御部１５は、サーバ装置５０から返信されたウェブページに基づいてウェブ画面を生成し、これを表示部１３に表示させる。 The terminal-side control unit 15 activates a UA (User Agent) such as a web browser, and a predetermined operation is performed on the reception unit 12, so that the terminal-side communication unit 11 causes the server device 50 to operate. In response, an HTTP (Hypertext Transfer Protocol) request is transmitted. Then, the terminal-side control unit 15 generates a web screen based on the web page returned from the server device 50 and causes the display unit 13 to display the web screen.

［サーバ装置の構成］
図３は、実施形態におけるサーバ装置５０の構成の一例を示す図である。図示のように、サーバ装置５０は、例えば、サーバ側通信部５１と、サーバ側記憶部５２と、サーバ側制御部５５とを備える。サーバ側制御部５５は、「認証部」の一例である。 [Configuration of server device]
FIG. 3 is a diagram illustrating an example of the configuration of the server device 50 according to the embodiment. As illustrated, the server device 50 includes, for example, a server-side communication unit 51, a server-side storage unit 52, and a server-side control unit 55. The server-side control unit 55 is an example of an “authentication unit”.

サーバ側通信部５１は、ネットワークＮＷを介して端末装置１０または情報解析装置１００と通信する。サーバ側通信部５１は、端末装置１０または情報解析装置１００から情報を受信した場合、受信した情報をサーバ側制御部５５に出力する。また、サーバ側通信部５１は、サーバ側制御部５５による制御を受けて、端末装置１０または情報解析装置１００に情報を送信する。 The server-side communication unit 51 communicates with the terminal device 10 or the information analysis device 100 via the network NW. When receiving information from the terminal device 10 or the information analysis device 100, the server-side communication unit 51 outputs the received information to the server-side control unit 55. Further, the server-side communication unit 51 transmits information to the terminal device 10 or the information analysis device 100 under the control of the server-side control unit 55.

サーバ側記憶部５２は、例えば、ＨＤＤ、フラッシュメモリ、ＥＥＰＲＯＭ、ＲＯＭ、またはＲＡＭなどにより実現される。サーバ側記憶部５２は、例えば、ウェブサイトを提供するための情報（以下、ウェブサイト情報５３と称する）と、アカウント情報５４とを記憶する。ウェブサイト情報５３は、例えば、ＨＴＭＬ（Hyper Text Markup Language）等のマークアップ言語で記述されたテキストデータや、スタイルシート、静止画像データ、動画データ、音声データなどを含むウェブページに関する情報である。アカウント情報５４は、ウェブサイトにおいて登録されたユーザＩＤや、メールアドレス、パスワードなどの情報を含む。 The server-side storage unit 52 is realized by, for example, a HDD, a flash memory, an EEPROM, a ROM, a RAM, or the like. The server-side storage unit 52 stores, for example, information for providing a website (hereinafter referred to as website information 53) and account information 54. The website information 53 is information about a web page including text data described in a markup language such as HTML (Hyper Text Markup Language), style sheets, still image data, moving image data, and audio data. The account information 54 includes information such as a user ID registered on the website, a mail address, and a password.

図４は、アカウント情報５４の一例を示す図である。図示の例のように、アカウント情報５４は、ユーザＩＤに対して、メールアドレスやパスワードなどの情報が対応付けられた情報である。 FIG. 4 is a diagram showing an example of the account information 54. As in the illustrated example, the account information 54 is information in which information such as a mail address and a password is associated with the user ID.

サーバ側制御部５５は、例えば、ＣＰＵなどのプロセッサがサーバ側記憶部５２に格納されたプログラムを実行することにより実現される。また、サーバ側制御部５５は、ＬＳＩ、ＡＳＩＣ、またはＦＰＧＡなどのハードウェアにより実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。 The server-side control unit 55 is realized, for example, by a processor such as a CPU executing a program stored in the server-side storage unit 52. The server-side control unit 55 may be realized by hardware such as LSI, ASIC, or FPGA, or may be realized by cooperation of software and hardware.

例えば、サーバ側制御部５５は、サーバ側通信部５１により端末装置１０からＨＴＴＰリクエストが受信されると、ユーザＩＤを認証するためのウェブページを、サーバ側通信部５１を介して端末装置１０へと返信する。端末装置１０においてユーザＩＤが入力された場合、サーバ側制御部５５は、入力されたユーザＩＤとアカウント情報５４とを比較して、入力されたユーザＩＤが既に登録されているか否かを判定する。 For example, when the server-side communication unit 51 receives the HTTP request from the terminal device 10, the server-side control unit 55 sends a web page for authenticating the user ID to the terminal device 10 via the server-side communication unit 51. Reply When the user ID is input to the terminal device 10, the server-side control unit 55 compares the input user ID with the account information 54 and determines whether or not the input user ID has already been registered. .

入力されたユーザＩＤが未だ登録されていない場合、サーバ側制御部５５は、サーバ側通信部５１を介して端末装置１０にユーザＩＤが未登録であることを通知したり、ユーザＩＤの登録を促したりするための情報を送信する。端末装置１０においてユーザＩＤの新規登録が行われた場合、サーバ側通信部５１は、端末装置１０から新規登録されたユーザＩＤを受信する。そして、サーバ側制御部５５は、サーバ側通信部５１により受信された、新たなユーザＩＤをアカウント情報５４に追加する。これによって、ユーザＩＤが新たに発行される。 If the input user ID has not been registered yet, the server-side control unit 55 notifies the terminal device 10 via the server-side communication unit 51 that the user ID has not been registered or registers the user ID. Send information to encourage you. When the user ID is newly registered in the terminal device 10, the server-side communication unit 51 receives the newly registered user ID from the terminal device 10. Then, the server-side control unit 55 adds the new user ID received by the server-side communication unit 51 to the account information 54. As a result, a new user ID is issued.

一方、入力されたユーザＩＤが既に登録されている場合、サーバ側制御部５５は、サーバ側通信部５１を介して端末装置１０にウェブサイト情報５３を送信する。これによって、端末装置１０は、ウェブブラウザの機能により、ウェブサイト情報５３に基づいて、各種サービスを享受可能なウェブページが描画された画面を表示する。 On the other hand, if the input user ID is already registered, the server-side control unit 55 transmits the website information 53 to the terminal device 10 via the server-side communication unit 51. Thereby, the terminal device 10 displays a screen on which a web page where various services can be enjoyed is drawn based on the website information 53 by the function of the web browser.

［情報解析装置の構成］
図５は、実施形態における情報解析装置１００の構成の一例を示す図である。図示のように、情報解析装置１００は、例えば、解析装置側通信部１０２と、解析装置側制御部１１０と、解析装置側記憶部１３０とを備える。 [Configuration of information analysis device]
FIG. 5 is a diagram illustrating an example of the configuration of the information analysis device 100 according to the embodiment. As illustrated, the information analysis device 100 includes, for example, an analysis device side communication unit 102, an analysis device side control unit 110, and an analysis device side storage unit 130.

解析装置側通信部１０２は、例えば、ＮＩＣ等の通信インターフェースを含む。解析装置側通信部１０２は、ネットワークＮＷを介してサーバ装置５０と通信する。解析装置側通信部１０２は、サーバ装置５０から情報を受信した場合、受信した情報を解析装置側制御部１１０に出力する。例えば、解析装置側通信部１０２は、サーバ装置５０からアカウント情報５４を受信する。また、解析装置側通信部１０２は、解析装置側制御部１１０による制御を受けて、サーバ装置５０に情報を送信する。 The analyzer communication unit 102 includes, for example, a communication interface such as NIC. The analysis device side communication unit 102 communicates with the server device 50 via the network NW. When receiving the information from the server device 50, the analysis device-side communication unit 102 outputs the received information to the analysis device-side control unit 110. For example, the analysis device side communication unit 102 receives the account information 54 from the server device 50. Further, the analysis device side communication unit 102 is controlled by the analysis device side control unit 110 and transmits information to the server device 50.

解析装置側制御部１１０は、例えば、取得部１１２と、抽出部１１４と、機械学習部１１６と、検出部１１８と、出力制御部１２０とを備える。これらの構成要素の一部または全部は、ＣＰＵなどのプロセッサが解析装置側記憶部１３０に格納されたプログラムを実行することにより実現される。また、解析装置側制御部１１０の構成要素の一部または全部は、ＬＳＩ、ＡＳＩＣ、またはＦＰＧＡなどのハードウェアにより実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。 The analyzer control unit 110 includes, for example, an acquisition unit 112, an extraction unit 114, a machine learning unit 116, a detection unit 118, and an output control unit 120. Some or all of these components are realized by a processor such as a CPU executing a program stored in the analysis device side storage unit 130. Further, some or all of the constituent elements of the analyzer control unit 110 may be realized by hardware such as LSI, ASIC, or FPGA, or may be realized by cooperation of software and hardware.

解析装置側記憶部１３０は、例えば、ＨＤＤ、フラッシュメモリ、ＥＥＰＲＯＭ、ＲＯＭ、またはＲＡＭなどにより実現される。解析装置側記憶部１３０は、例えば、教師データ１３２と、特徴量情報１３４と、学習条件情報１３６と、学習データ１３８と、不正ＩＤ情報１４０とを記憶する。これらの情報については後述する。 The analyzer storage unit 130 is realized by, for example, a HDD, a flash memory, an EEPROM, a ROM, a RAM, or the like. The analyzer storage unit 130 stores, for example, teacher data 132, feature amount information 134, learning condition information 136, learning data 138, and fraudulent ID information 140. These pieces of information will be described later.

［教師データによる機械学習］
まず、機械学習において、二値分類問題を解くためのパターン識別モデルを生成する処理についてフローチャートを用いて説明する。本実施形態における二値分類問題とは、学習対象のユーザＩＤを、そのユーザＩＤの取得が正常（通常）であるのか、またはユーザＩＤの取得が不正であるのかのいずれかに分類することをいう。ユーザＩＤの取得が正常である例については「正例」として扱われ、ユーザＩＤの取得が不正である例については「負例」として扱われる。 [Machine learning using teacher data]
First, in machine learning, a process of generating a pattern identification model for solving a binary classification problem will be described using a flowchart. The binary classification problem in the present embodiment is to classify a learning target user ID into whether the acquisition of the user ID is normal (normal) or the acquisition of the user ID is illegal. Say. An example in which the acquisition of the user ID is normal is treated as a “positive example”, and an example in which the acquisition of the user ID is incorrect is treated as a “negative example”.

図６は、二値分類問題を解くためのパターン識別モデルを生成する処理の一例を示すフローチャートである。まず、取得部１１２は、教師データ１３２を参照して、このデータからユーザＩＤを取得する（Ｓ１００）。 FIG. 6 is a flowchart showing an example of processing for generating a pattern identification model for solving a binary classification problem. First, the acquisition unit 112 refers to the teacher data 132 and acquires a user ID from this data (S100).

図７は、教師データ１３２の一例を示す図である。教師データ１３２は、例えば、ユーザＩＤに対して、不正に取得されたユーザＩＤであるか否かを示すフラグが付与された情報である。言い換えれば、教師データ１３２は、不正か否かが既に判明した情報である。例えば、不正に取得されたユーザＩＤである場合、「１」のフラグに付与され、不正でなく正常に取得されたユーザＩＤである場合、「０」のフラグに付与される。例えば、教師データ１３２は、過去のある時点で不正であると判断されたユーザＩＤと、これと同時期に使用され、且つ不正でないと判断されたユーザＩＤとを集約した情報である。 FIG. 7 is a diagram showing an example of the teacher data 132. The teacher data 132 is, for example, information in which a flag indicating whether or not the user ID is an illegally acquired user ID is added to the user ID. In other words, the teacher data 132 is information that has already been determined to be fraudulent. For example, if the user ID is an illegally acquired user ID, it is given to the flag of "1", and if the user ID is not improperly acquired and is normally acquired, it is given to the flag of "0". For example, the teacher data 132 is information in which a user ID determined to be illegal at a certain point in the past and a user ID used at the same time as this and determined not to be illegal are aggregated.

次に、抽出部１１４は、取得部１１２が教師データ１３２から取得したユーザＩＤごとに、そのユーザＩＤが示す文字列から、特徴量情報１３４において指定された種々の特徴量を抽出する（Ｓ１０２）。例えば、抽出部１１４はユーザＩＤが示す文字列から、ユーザＩＤの入力のしやすさ、文字列に含める文字のランダム性などを表す特徴量を抽出する。 Next, the extraction unit 114 extracts, for each user ID acquired by the acquisition unit 112 from the teacher data 132, various feature amounts specified in the feature amount information 134 from the character string indicated by the user ID (S102). . For example, the extraction unit 114 extracts, from the character string indicated by the user ID, a feature amount indicating the ease of inputting the user ID, the randomness of characters included in the character string, and the like.

図８は、特徴量情報１３４の一例を示す図である。図示の例のように、特徴量情報１３４は、抽出対象の特徴量がどういったものであるのかを表している。例えば、抽出対象の特徴量には、以下の１０種類が存在する。下記の（１）、（１０）の特徴量は、「文字列または文字の存在確率に関する特徴量」の一例である。また、（２）、（５）の特徴量は、「文字列に含まれる特定の記号に関する特徴量」の一例であり、（３）、（４）、（６）〜（９）の特徴量は、「地域によって異なるキーボードの配列に関する特徴量」の一例である。 FIG. 8 is a diagram showing an example of the feature amount information 134. As in the illustrated example, the feature amount information 134 represents what the feature amount to be extracted is. For example, there are the following 10 types of feature quantities to be extracted. The following feature amounts (1) and (10) are examples of the “feature amount regarding the existence probability of a character string or a character”. Further, the feature amounts of (2) and (5) are examples of the “feature amount regarding a specific symbol included in the character string”, and the feature amounts of (3), (4), (6) to (9). Is an example of "a feature amount relating to the arrangement of keyboards which differs depending on the region".

（１）Information Surprise
（２）ユーザＩＤの文字列に含まれる数字の数
（３）ＱＷＥＲＴＹ配列のＴｏｐＲｏｗにある文字がユーザＩＤに含まれている割合
（４）ＤＶＯＲＡＫ配列のＴｏｐＲｏｗにある文字がユーザＩＤに含まれている割合
（５）ユーザＩＤの文字列に含まれる数字の割合
（６）ＤＶＯＲＡＫ配列でユーザＩＤをタイプしたときの想定される指の移動量［ｍ］
（７）ＱＷＥＲＴＹ配列のＨｏｍｅＲｏｗにある文字がユーザＩＤに含まれている割合
（８）ＱＷＥＲＴＹ配列でユーザＩＤをタイプしたときの想定される指の移動量［ｍ］
（９）ＤＶＯＲＡＫ配列のＢｏｔｔｏｍＲｏｗにある文字がユーザＩＤに含まれている割合
（１０）ユーザＩＤのエントロピー（シャノン情報量） (1) Information Surprise
(2) Number of numbers included in the character string of the user ID (3) Percentage of characters included in TopRow of the QWERTY array included in the user ID (4) Characters included in TopRow of the DVORAK array included in the user ID Proportion (5) Proportion of numbers included in the character string of the user ID (6) Expected amount of finger movement [m] when typing the user ID in the DVORAK array
(7) Proportion of characters in HomeRow of QWERTY array included in user ID (8) Amount of movement of finger [m] expected when user ID is typed in QWERTY array
(9) Ratio of characters in BottomRow of DVORAK array included in user ID (10) Entropy of user ID (Shannon information amount)

図９は、ＱＷＥＲＴＹ配列のキーボードと、ＤＶＯＲＡＫ配列のキーボードの一例を示す図である。例えば、ＱＷＥＲＴＹ配列のキーボードの場合、ＴｏｐＲｏｗにある文字は、数字キーの一段下にある「Ｑ、Ｗ、Ｅ、…、Ｏ、Ｐ」の文字となる。また、ＨｏｍｅＲｏｗにある文字は、ＴｏｐＲｏｗよりも更に一段下の「Ａ、Ｗ、Ｅ、…、Ｏ、Ｐ」の文字であり、ＢｏｔｔｏｍＲｏｗにある文字とは、スペースキーの一段上（ＨｏｍｅＲｏｗの一段下）の「Ｚ、Ｘ、Ｃ、…、Ｎ、Ｍ」の文字である。なおこれらの文字には、アンダーバーやスラッシュ、カンマ、不等号などの記号が含まれてもよい。 FIG. 9 is a diagram showing an example of a QWERTY keyboard and a DVORAK keyboard. For example, in the case of a QWERTY keyboard, the characters in TopRow are the characters "Q, W, E, ..., O, P" below the numeric key. The characters in HomeRow are “A, W, E, ..., O, P” one step below TopRow, and the characters in BottomRow are one step above the space key (one step below HomeRow). ) "Z, X, C, ..., N, M". Note that these characters may include symbols such as underscores, slashes, commas, and signs.

また、上記の各種特徴量は、国や地域によって、一部が省略されてもよいし、他の特徴量が追加されてもよい。例えば、日本国では、ＱＷＥＲＴＹ配列のキーボードが主流であるため、ＤＶＯＲＡＫ配列に関する各種特徴量（（４）、（６）、（９））は、省略されてよい。 In addition, depending on the country or region, a part of the above various characteristic amounts may be omitted, or other characteristic amounts may be added. For example, in Japan, a keyboard having a QWERTY layout is the mainstream, so various feature quantities ((4), (6), (9)) relating to the DVORAK layout may be omitted.

例えば、抽出部１１４は、以下の数式（１）、（２）に基づいて、（１）Information Surpriseの特徴量を抽出する。 For example, the extraction unit 114 extracts the feature quantity of (1) Information Surprise based on the following mathematical expressions (1) and (2).

数式（１）におけるＩ（ｕ）は、特徴量であるInformation Surpriseのエントロピー値を表している。また、数式（１）および（２）におけるｕは、対象とするユーザＩＤの文字列を表し、ｐ（ｕ）は、文字列ｕの存在確率を表し、ｍは、文字列ｕの長さ（文字数）を表している。また、数式（２）におけるｃ_ｉは、対象となる文字列ｕの中でｉ番目の文字を表している。 I (u) in Expression (1) represents the entropy value of Information Surprise, which is a feature amount. Further, u in the equations (1) and (2) represents a character string of the target user ID, p (u) represents the existence probability of the character string u, and m represents the length of the character string u ( Represents the number of characters). Further, c _i in Expression (2) represents the i-th character in the target character string u.

例えば、抽出部１１４は、数式（２）に示すように、ｎ−ｇｒａｍ法を用いて、文字列ｕをｎ文字（例えばｎ＝６）ずつシフトさせながら分割し、分割した文字列ｕに含まれる各文字ｃ_ｉが全文字列中に存在する存在確率ｐ（＝（ｃ_ｉ｜ｃ_{ｉ−（ｎ−１）}…））を導出する。抽出部１１４は、文字ｃ_ｉの存在確率ｐを、ｎ−ｇｒａｍ法により分割した文字列ｕごとに導出し、分割した各文字列ｕ単位での文字ｃ_ｉの存在確率ｐを全て乗算することで、文字列ｕの存在確率ｐ（ｕ）を導出する。 For example, the extraction unit 114 divides the character string u by shifting n characters (for example, n = 6) by using the n-gram method, as shown in Expression (2), and includes the divided character string u. to derive the _| each character _{c i} is the existence probability _p present in the total character string _{(c i- (n-1)} ...) = (c i) to be. Extraction unit 114, the existence probability p of characters c _i, that derived for each string u obtained by dividing by n-gram method, multiplying all existence probability p of characters c _i for each string u units divided Then, the existence probability p (u) of the character string u is derived.

そして、抽出部１１４は、数式（２）に基づき導出した文字列ｕの存在確率ｐ（ｕ）を、数式（１）に代入することで、Information Surpriseのエントロピー値を表すＩ（ｕ）を導出する。これによって、（１）の特徴量が抽出される。 Then, the extraction unit 114 derives I (u) representing the entropy value of Information Surprise by substituting the existence probability p (u) of the character string u derived based on Expression (2) into Expression (1). To do. As a result, the feature quantity (1) is extracted.

また、抽出部１１４は、ユーザＩＤの文字列に含まれる０から９の数を計数することで、（２）の特徴量を抽出する。 The extraction unit 114 also extracts the feature amount (2) by counting the number of 0 to 9 included in the character string of the user ID.

また、抽出部１１４は、ユーザＩＤの文字列に含まれる全文字数に対する、当該ユーザＩＤの文字列に含まれる「Ｑ、Ｗ、Ｅ、…、Ｏ、Ｐ」の文字数の割合を導出することで、（３）の特徴量を抽出する。 Further, the extraction unit 114 derives the ratio of the number of characters “Q, W, E, ..., O, P” included in the character string of the user ID to the total number of characters included in the character string of the user ID. , (3) are extracted.

また、抽出部１１４は、ユーザＩＤの文字列に含まれる全文字数に対する、当該ユーザＩＤの文字列に含まれる「Ｐ、Ｙ、Ｆ、…、Ｒ、Ｌ」の文字数の割合を導出することで、（４）の特徴量を抽出する。 Further, the extraction unit 114 derives the ratio of the number of characters “P, Y, F, ..., R, L” included in the character string of the user ID to the total number of characters included in the character string of the user ID. , (4) are extracted.

また、抽出部１１４は、ユーザＩＤの文字列に含まれる全文字数に対する、当該ユーザＩＤの文字列に含まれる０から９の数の割合を導出することで、（５）の特徴量を抽出する。 In addition, the extraction unit 114 extracts the feature amount of (5) by deriving the ratio of the number of 0 to 9 included in the character string of the user ID to the total number of characters included in the character string of the user ID. .

また、抽出部１１４は、ＤＶＯＲＡＫ配列のキーボードを二次元平面と捉えて、二次元平面における各キーの相対的な位置関係に基づいて、文字列に含まれる文字を、その列順にタイプしたときに想定されるユーザの指の移動距離を導出することで、（６）の特徴量を抽出する。例えば、抽出部１１４は、ＢｏｔｔｏｍＲｏｗの左下のキー（Ｃｔｒｌキー）を原点座標Ｏ（０，０）とし、ＤＶＯＲＡＫ配列における各キーの位置座標を原点座標Ｏからの相対座標として決定する。原点座標Ｏとして割り当てられるキーを含む全キーの座標は、例えば、各キーのキートップの領域での中心座標であってよい。抽出部１１４は、ユーザＩＤが示す文字列を一文字ずつ分割し、各文字に対応するキーの座標を導出する。そして、抽出部１１４は、文字列順に各文字に対応するキーの座標間の距離を導出する。例えば、抽出部１１４は、文字列が「ＡＢＣ」であれば、「Ａ」のキーの座標から「Ｂ」のキーの座標の間の距離と、「Ｂ」のキーの座標から「Ｃ」のキーの座標の間の距離とを合わせた合計の距離を、実際の想定されるキーボードの大きさに基づく倍率で乗算し、その乗算値（合計距離×倍率）を、所定値（例えば１００）で除算することで、指の移動距離を導出する。これによって、（６）の特徴量が抽出される。 Further, the extraction unit 114 regards the DVORAK array keyboard as a two-dimensional plane, and when the characters included in the character string are typed in the column order based on the relative positional relationship of each key on the two-dimensional plane. The feature amount of (6) is extracted by deriving the expected moving distance of the user's finger. For example, the extraction unit 114 determines the lower left key of the Bottom Row (Ctrl key) as the origin coordinate O (0,0), and determines the position coordinate of each key in the DVORAK array as the relative coordinate from the origin coordinate O. The coordinates of all keys including the key assigned as the origin coordinate O may be, for example, the center coordinates of the key top area of each key. The extraction unit 114 divides the character string indicated by the user ID into characters, and derives the coordinates of the key corresponding to each character. Then, the extraction unit 114 derives the distance between the coordinates of the keys corresponding to each character in the character string order. For example, when the character string is “ABC”, the extraction unit 114 determines the distance between the coordinates of the “A” key and the coordinates of the “B” key and the “C” from the coordinates of the “B” key. The total distance, including the distance between the coordinates of the keys, is multiplied by a scaling factor based on the actual expected size of the keyboard, and the multiplication value (total distance x scaling factor) is set to a predetermined value (for example, 100). The moving distance of the finger is derived by performing the division. As a result, the feature amount (6) is extracted.

また、抽出部１１４は、ユーザＩＤの文字列に含まれる全文字数に対する、当該ユーザＩＤの文字列に含まれる「Ａ、Ｓ、Ｄ、…、Ｋ、Ｌ」の文字数の割合を導出することで、（７）の特徴量を抽出する。 Further, the extraction unit 114 derives the ratio of the number of characters “A, S, D, ..., K, L” included in the character string of the user ID to the total number of characters included in the character string of the user ID. , (7) are extracted.

また、抽出部１１４は、（６）の特徴量の抽出方法と同様に、ＱＷＥＲＴＹ配列のキーボードを二次元平面と捉えて、二次元平面における各キーの相対的な位置関係に基づいて、文字列に含まれる文字を、その列順にタイプしたときに想定されるユーザの指の移動距離を導出することで、（８）の特徴量を抽出する。 In addition, the extraction unit 114 regards the QWERTY keyboard as a two-dimensional plane and uses the character string based on the relative positional relationship of each key in the two-dimensional plane, as in the feature amount extraction method of (6). The feature amount of (8) is extracted by deriving the moving distance of the user's finger assumed when the characters included in are typed in the column order.

また、抽出部１１４は、ユーザＩＤの文字列に含まれる全文字数に対する、当該ユーザＩＤの文字列に含まれる「Ｑ、Ｊ、Ｋ、…、Ｖ、Ｚ」の文字数の割合を導出することで、（９）の特徴量を抽出する。 Further, the extraction unit 114 derives the ratio of the number of characters “Q, J, K, ..., V, Z” included in the character string of the user ID to the total number of characters included in the character string of the user ID. , (9) are extracted.

また、抽出部１１４は、以下の数式（３）に基づいて、（１０）ユーザＩＤのエントロピーの特徴量を抽出する。 Further, the extraction unit 114 extracts (10) the entropy feature amount of the user ID based on the following mathematical expression (3).

数式（３）におけるＨ（ｕ）は、ユーザＩＤのエントロピー値を表している。例えば、抽出部１１４は、数式（３）に示すシャノンの情報量（平均情報量）の定義式に基づいて、ユーザＩＤのエントロピー値Ｈ（ｕ）を導出する。これによって、（１０）の特徴量が抽出される。 H (u) in the mathematical expression (3) represents the entropy value of the user ID. For example, the extraction unit 114 derives the entropy value H (u) of the user ID based on the definition formula of Shannon's information amount (average information amount) shown in Expression (3). As a result, the feature amount (10) is extracted.

ここで、図６のフローチャートの説明に戻る。次に、機械学習部１１６は、抽出部１１４により抽出された複数の特徴量のうち一部または全部を用いて機械学習を行い（Ｓ１０４）、特徴量の抽出元であるユーザＩＤを正例または負例に分類するためのパターン識別モデルを生成する。 Here, the description returns to the flowchart of FIG. Next, the machine learning unit 116 performs machine learning using some or all of the plurality of feature amounts extracted by the extraction unit 114 (S104), and extracts the user ID from which the feature amount is extracted as a positive example or A pattern identification model for classifying as a negative example is generated.

例えば、機械学習部１１６は、ＳＶＭにおいて、抽出部１１４により抽出された複数の特徴量のそれぞれを素性として扱い、各素性を特徴ベクトルとした特徴空間において、各特徴ベクトルを正例または負例に分類する超平面（特徴空間の次元数から１低下した次元を有する空間）を、パターン識別モデルとして導出する。このとき、機械学習部１１６は、教師データ１３２において、「０」のフラグが付与されたユーザＩＤが正例に、「１」のフラグが付与されたユーザＩＤが負例に分類されるように超平面を導出する。 For example, the machine learning unit 116 treats each of the plurality of feature amounts extracted by the extraction unit 114 as a feature in the SVM, and sets each feature vector as a positive example or a negative example in a feature space in which each feature is a feature vector. A hyperplane to be classified (a space having a dimension that is one less than the dimension of the feature space) is derived as a pattern identification model. At this time, the machine learning unit 116 classifies, in the teacher data 132, the user ID to which the flag of “0” is added as a positive example and the user ID to which the flag of “1” is added as a negative example. Derive the hyperplane.

また、機械学習部１１６は、機械学習としてロジスティック回帰を利用する場合、抽出部１１４により抽出された複数の特徴量のそれぞれを独立変数とし、正例または負例を従属変数として扱うことで、ロジスティック曲線（パターン識別モデルの他の例）を導出する。 When using logistic regression as machine learning, the machine learning unit 116 treats each of the plurality of feature quantities extracted by the extraction unit 114 as an independent variable and treats a positive example or a negative example as a dependent variable, thereby A curve (another example of the pattern identification model) is derived.

そして、機械学習部１１６は、導出したパターン識別モデルを評価する（Ｓ１０６）。例えば、機械学習部１１６は、ＳＶＭとロジスティック回帰における双方のパターン識別モデルについて、Ｆ値（Ｆ−ｍｅａｓｕｒｅ）を用いて評価する。Ｆ値とは、パターン識別モデルによるユーザＩＤの分類結果が、真の結果とどの程度一致するのかを評価する指標である。Ｆ値は、「スコア」の一例である。例えば、Ｆ値は、以下の数式（４）から（６）に基づいて導出される。 Then, the machine learning unit 116 evaluates the derived pattern identification model (S106). For example, the machine learning unit 116 evaluates both the pattern identification models in the SVM and the logistic regression by using the F value (F-measure). The F value is an index for evaluating how much the user ID classification result by the pattern identification model matches the true result. The F value is an example of “score”. For example, the F value is derived based on the following formulas (4) to (6).

ｐｒｅｃｉｓｉｏｎ（精度）は、パターン識別モデルにより正例として分類されたユーザＩＤのうち、実際に正例であるユーザＩＤ（教師データ１３２において「０」のフラグが付与されたユーザＩＤ）の割合を表している。ＴＰは、パターン識別モデルによる分類結果が正であり、真の結果も正であるユーザＩＤの数を表し、ＦＰは、パターン識別モデルによる分類結果が正であり、真の結果が負であるユーザＩＤの数を表している。Ｒｅｃａｌｌ（再現率）は、実際に正例であるユーザＩＤのうち、パターン識別モデルにより正例として分類されたユーザＩＤの割合を表している。ＦＮは、パターン識別モデルによる分類結果が負であり、真の結果が正であるユーザＩＤの数を表している。例えば、Ｆ値（Ｆ−ｍｅａｓｕｒｅ）が１００［％］であれば、教師データ１３２を完全に正例と負例に分類できたことを意味する。 The precision indicates the proportion of user IDs that are actually positive examples (user IDs to which a “0” flag is added in the teacher data 132) among the user IDs that are classified as positive examples by the pattern identification model. ing. TP represents the number of user IDs whose classification result by the pattern identification model is positive and whose true result is also positive, and FP is a user whose classification result by the pattern identification model is positive and whose true result is negative. It represents the number of IDs. “Recall” (reproduction rate) represents the ratio of user IDs classified as positive examples by the pattern identification model among the user IDs that are actually positive examples. FN represents the number of user IDs whose classification result by the pattern identification model is negative and whose true result is positive. For example, if the F value (F-measure) is 100 [%], it means that the teacher data 132 can be completely classified into a positive example and a negative example.

機械学習部１１６は、特徴量の組み合わせごとのパターン識別モデルの評価結果（Ｆ値）に基づいて、学習条件を決定する（Ｓ１０８）。学習条件には、（１）抽出部１１４により抽出された複数の特徴量のうち、機械学習において用いる特徴量の組み合わせを指定すること、（２）機械学習の対象とするユーザＩＤの文字数に制限を設けること（例えば１０文字未満のユーザＩＤは、機械学習の対象から除外する）、（３）複数の機械学習の手法うち好適な手法を選択すること、といった条件が含まれる。機械学習部１１６により決定された学習条件は、学習条件情報１３６として解析装置側記憶部１３０に記憶される。 The machine learning unit 116 determines the learning condition based on the evaluation result (F value) of the pattern identification model for each combination of feature amounts (S108). The learning conditions include (1) designating a combination of feature amounts used in machine learning among a plurality of feature amounts extracted by the extraction unit 114, and (2) limiting the number of characters of a user ID targeted for machine learning. Is provided (for example, a user ID having less than 10 characters is excluded from the target of machine learning), and (3) a suitable method is selected from among a plurality of machine learning methods. The learning condition determined by the machine learning unit 116 is stored in the analyzer storage unit 130 as learning condition information 136.

ユーザＩＤの文字数に制限を設ける意味は、機械学習におけるノイズの影響を抑制するためである。一般的に、成長期（過渡期）を過ぎて成熟期（定常期）に移行したサービスでは、そのサービスにおいてユーザにより取得されるユーザＩＤは、ある文字数以上に収束しやすくなる。これは、年月が増すにつれて、申請されたユーザＩＤが既に取得されたユーザＩＤと重複する確率が高くなるためである。従って、ユーザＩＤの文字数に制限を設けることによって、不正に取得されている蓋然性の高いユーザＩＤの文字数と異なる文字数のユーザＩＤを除外することができる。すなわち、不正に取得されている蓋然性の低いユーザＩＤを除外することができる。 The purpose of limiting the number of characters of the user ID is to suppress the influence of noise in machine learning. Generally, in a service that has transitioned from a growth period (transitional period) to a maturity period (steady period), the user ID acquired by the user in the service tends to converge to a certain number of characters or more. This is because the probability that the applied user ID overlaps with the already acquired user ID increases as the years increase. Therefore, by limiting the number of characters of the user ID, it is possible to exclude the user ID having the number of characters different from the number of characters of the user ID which is highly likely to be illegally acquired. That is, it is possible to exclude a user ID that has a low probability of being illegally acquired.

図１０は、実際の評価結果の一例を示す図である。図中（ａ）に示すように、ある観測時期にログイン（認証）に成功したユーザ（通常ユーザ）のユーザＩＤの数は「２．４×１０^６」程度であり、上記の観測時期と同時期に１００個以上のユーザＩＤを取得したユーザ（不正ユーザ）のユーザＩＤの数は「１２．１×１０^３」程度であった。また、不正ユーザのうち、そのユーザＩＤの文字数が１０文字以上のユーザＩＤの数は、「９．４×１０^３」程度であった。 FIG. 10 is a diagram showing an example of an actual evaluation result. As shown in (a) in the figure, the number of user IDs of users (normal users) who have successfully logged in (authenticated) at a certain observation time is about “2.4 × 10 ⁶ ”, which is the same as the above observation time. The number of user IDs of users (illegal users) who acquired 100 or more user IDs at the time was about “12.1 × 10 ³ ”. In addition, the number of user IDs of which the number of characters of the user ID is 10 or more among the unauthorized users was about “9.4 × 10 ³ ”.

解析装置側制御部１１０は、上記（ａ）を学習データ１３８として扱い、ＳＶＭおよびロジスティック回帰の双方の機械学習を行うことで、各機械学習におけるパターン識別モデルを評価した。このとき、クラスインバランスを考慮して、上記（ａ）の観測データのうち、学習データ１３８として扱う通常ユーザのユーザＩＤ数を、不正ユーザのユーザＩＤ数と同程度とした。また、ＳＶＭについては、特徴空間において特徴ベクトル同士の重なりを考慮して（特徴ベクトルを線形分離できない場合を考慮して）、ソフトマージンＳＶＭ（Ｃ‐ＳＶＭ）を用いた。また、ロジスティック回帰については、過学習が生じるのを抑制するために、Ｌ１正則化ロジスティック回帰を用いた。また、Ｆ値の導出時には、Ｋ−分割交差検証（例えばＫ＝１０）を用いた。 The analyzer control unit 110 treats (a) as learning data 138 and performs machine learning of both SVM and logistic regression to evaluate the pattern identification model in each machine learning. At this time, in consideration of class imbalance, the number of user IDs of normal users treated as the learning data 138 in the observation data of (a) is set to be approximately the same as the number of user IDs of unauthorized users. As for the SVM, the soft margin SVM (C-SVM) is used in consideration of the overlap between the feature vectors in the feature space (in consideration of the case where the feature vectors cannot be linearly separated). Regarding logistic regression, L1 regularized logistic regression was used in order to suppress the occurrence of overfitting. Further, when deriving the F value, K-split cross validation (for example, K = 10) was used.

同図の（ｂ）は、各パターン識別モデルの評価結果を表している。図示の例では、ユーザＩＤの長さ（文字数）を指定せずに、Ｆ値が最も大きくなる学習条件は、ソフトマージンＳＶＭ（Ｃ‐ＳＶＭ）において、（１）〜（１０）の１０個の特徴量を全て組み合わせて学習したときであった（Ｆ値＝８５．４９［％］）。また、ユーザＩＤの文字数（ユーザＩＤの長さ）を１０文字以上に指定したときのＦ値が最も大きくなる学習条件は、ソフトマージンＳＶＭ（Ｃ‐ＳＶＭ）において、（１）、（２）、（４）〜（７）、（１０）の７つの特徴量を組み合わせて学習したときであった（Ｆ値＝８９．７７［％］）。 FIG. 7B shows the evaluation result of each pattern identification model. In the illustrated example, the learning condition that maximizes the F value without specifying the length (the number of characters) of the user ID is 10 in the soft margin SVM (C-SVM) (1) to (10). It was when the learning was performed by combining all the feature amounts (F value = 85.49 [%]). Further, the learning condition that maximizes the F value when the number of characters of the user ID (the length of the user ID) is specified to be 10 characters or more is (1), (2) in the soft margin SVM (C-SVM), It was when learning was performed by combining the seven feature amounts of (4) to (7) and (10) (F value = 89.77 [%]).

ユーザＩＤの文字数に制限を設ける場合、Information Surpriseの特徴量が変動するため、Ｆ値の結果が変わる。上述したように、Information Surpriseのエントロピー値Ｉ（ｕ）は、対象とするユーザＩＤの文字列ｕの存在確率ｐ（ｕ）に起因しているため、文字列ｕが短ければ、その存在確率ｐ（ｕ）は大きくなる。これにより、Ｉ（ｕ）は大きくなり、Ｆ値が向上する。 When the number of characters of the user ID is limited, the result of the F value changes because the characteristic amount of Information Surprise changes. As described above, since the entropy value I (u) of Information Surprise is caused by the existence probability p (u) of the character string u of the target user ID, if the character string u is short, its existence probability p (U) becomes large. This increases I (u) and improves the F value.

このように、機械学習部１１６は、評価結果のＦ値を参照して、最もＦ値が高くなる学習条件を選出し、その学習条件を次回以降の学習におけるパラメータとして設定する。図１０の例の場合、Ｆ値は８９．７７［％］が最大であるため、機械学習部１１６は、学習条件として、機械学習において用いる特徴量の組み合わせを、（１）、（２）、（４）〜（７）、（１０）の７つの特徴量に、ユーザＩＤの制限文字数を１０文字以上に、更に、機械学習の手法をＳＶＭ（Ｃ‐ＳＶＭ）に決定する。 In this way, the machine learning unit 116 refers to the F value of the evaluation result, selects the learning condition having the highest F value, and sets the learning condition as a parameter for the learning after the next time. In the example of FIG. 10, since the F value is 89.77 [%], the machine learning unit 116 sets the combination of the feature amounts used in machine learning as (1), (2), For the seven feature quantities (4) to (7) and (10), the character limit of the user ID is determined to be 10 or more, and the machine learning method is determined to be SVM (C-SVM).

［学習データによる機械学習］
上述したフローチャートの処理により学習条件を決定した後、解析装置側制御部１１０は、生成したパターン認識モデルを用いて、サーバ装置５０により登録されたユーザＩＤのうち、教師データ１３２として利用しなかったユーザＩＤを正例または負例に分類する。 [Machine learning using learning data]
After determining the learning condition by the process of the above-described flowchart, the analysis device-side control unit 110 did not use the generated pattern recognition model as the teacher data 132 among the user IDs registered by the server device 50. User IDs are classified into positive examples and negative examples.

図１１は、生成したパターン認識モデルを用いて、未分類のユーザＩＤを正例または負例に分類する処理の一例を示すフローチャートである。まず、取得部１１２は、学習データ１３８を参照して、このデータからユーザＩＤを取得する（Ｓ２００）。 FIG. 11 is a flowchart showing an example of a process of classifying unclassified user IDs into positive examples or negative examples using the generated pattern recognition model. First, the acquisition unit 112 refers to the learning data 138 and acquires the user ID from this data (S200).

学習データ１３８とは、サーバ装置５０により登録されたユーザＩＤのうち、教師データ１３２として利用しなかったユーザＩＤであって、未だ不正か否かの判断がなされていないユーザＩＤの集合である。教師データ１３２において過去の時点で不正でないと判断されたユーザＩＤについては、現在においても使用され得ることが想定されるため、学習データ１３８には、教師データ１３２において「０」のフラグが付与されたユーザＩＤが含まれてよい。 The learning data 138 is a set of user IDs that have not been used as the teacher data 132 among the user IDs registered by the server device 50 and that have not been determined to be fraudulent. Since it is assumed that the user ID that was determined to be not illegal in the past in the teacher data 132 can be used even now, the learning data 138 is provided with a flag of “0” in the teacher data 132. User ID may be included.

次に、抽出部１１４は、取得部１１２により取得されたユーザＩＤ（未分類のユーザＩＤ）から（１）〜（１０）の１０個の特徴量を抽出する（Ｓ２０２）。 Next, the extraction unit 114 extracts 10 feature amounts (1) to (10) from the user ID (unclassified user ID) acquired by the acquisition unit 112 (S202).

次に、機械学習部１１６は、教師データ１３２を用いて決定した学習条件に従って、機械学習を行う（Ｓ２０４）。例えば、機械学習部１１６は、上述した図１０の例において決定した学習条件に従う場合、抽出部１１４により抽出された１０個の特徴量のうち、（１）、（２）、（４）〜（７）、（１０）の７個の特徴量を選択し、この７個の特徴量を素性としてＳＶＭ（Ｃ‐ＳＶＭ）による機械学習を行う。この際、機械学習部１１６は、負例に分類するユーザＩＤの文字数を１０文字以上とする。 Next, the machine learning unit 116 performs machine learning according to the learning condition determined using the teacher data 132 (S204). For example, when the machine learning unit 116 follows the learning condition determined in the example of FIG. 10 described above, (1), (2), (4) to ((10) of the 10 feature amounts extracted by the extraction unit 114. 7) of 7) and (10) are selected, and machine learning by SVM (C-SVM) is performed with these 7 feature quantities as features. At this time, the machine learning unit 116 sets the number of characters of the user ID classified as a negative example to 10 or more.

次に、検出部１１８は、機械学習部１１６による機械学習の結果に基づいて、学習データ１３８に含まれるユーザＩＤの中から、不正に取得されたユーザＩＤを検出する（Ｓ２０６）。例えば、検出部１１８は、ＳＶＭによる機械学習が行われる場合、特徴空間において負例に分類された特徴ベクトル（素性）を抽出し、この特徴ベクトルが示す特徴量の抽出元であるユーザＩＤを特定することで、不正に取得されたユーザＩＤを検出する。検出部１１８により検出された不正なユーザＩＤは、不正ＩＤ情報１４０として解析装置側記憶部１３０に記憶される。 Next, the detection unit 118 detects an illegally acquired user ID from the user IDs included in the learning data 138 based on the result of the machine learning by the machine learning unit 116 (S206). For example, when machine learning by SVM is performed, the detection unit 118 extracts a feature vector (feature) classified as a negative example in the feature space, and identifies the user ID from which the feature amount indicated by the feature vector is extracted. By doing so, the illegally acquired user ID is detected. The unauthorized user ID detected by the detector 118 is stored in the analyzer storage unit 130 as the unauthorized ID information 140.

なお、検出部１１８は、機械学習部１１６による機械学習の結果に基づいて不正に取得されたユーザＩＤを検出する代わりに、抽出部１１４により抽出されたInformation Surpriseの特徴量に基づいて、不正に取得されたユーザＩＤを検出してもよい。 It should be noted that the detection unit 118 illegally detects, based on the feature amount of Information Surprise extracted by the extraction unit 114, instead of detecting the user ID obtained illegally based on the result of machine learning by the machine learning unit 116. The acquired user ID may be detected.

図１２は、ユーザＩＤの文字数の制限の有無に応じたInformation Surpriseの特徴量の一例を示す図である。図中（ａ）は、ユーザＩＤの文字数に制限を設けていない場合のInformation Surpriseの特徴量の結果を表し、（ｂ）は、ユーザＩＤの文字数に１０文字以上の制限を設けた場合のInformation Surpriseの特徴量の結果を表している。いずれも横軸は、Information Surpriseの特徴量を、その特徴量の標準偏差で正規化した値を表し、縦軸は、Information Surpriseの特徴量の抽出元の文字列ｕの存在確率ｐ（ｕ）を表している。 FIG. 12 is a diagram showing an example of the Information Surprise feature amount depending on whether or not there is a limit on the number of characters of the user ID. In the figure, (a) shows the result of the feature amount of Information Surprise when the number of characters of the user ID is not limited, and (b) shows Information when the number of characters of the user ID is limited to 10 or more. It shows the result of Surprise features. In each case, the horizontal axis represents the value of the Information Surprise feature quantity normalized by the standard deviation of the feature quantity, and the vertical axis represents the existence probability p (u) of the character string u from which the Information Surprise feature quantity is extracted. Is represented.

例えば、検出部１１８は、Information Surpriseの特徴量が閾値ＴＨ１（例えば２５０）以上のときに、文字列ｕの存在確率ｐ（ｕ）の値が閾値ＴＨ２（例えば５［％］）以上である場合に、そのInformation Surpriseの特徴量の抽出元のユーザＩＤを、不正に取得されたユーザＩＤとして検出してよい。 For example, when the feature amount of Information Surprise is equal to or greater than the threshold TH1 (eg, 250), the detection unit 118 determines that the value of the existence probability p (u) of the character string u is equal to or greater than the threshold TH2 (eg, 5 [%]). In addition, the user ID from which the feature amount of the Information Surprise is extracted may be detected as the illegally acquired user ID.

次に、出力制御部１２０は、解析装置側通信部１０２を用いて、検出部１１８による検出結果である不正ＩＤ情報１４０をサーバ装置５０に送信する（Ｓ２０８）。これによって、本フローチャートの処理が終了する。 Next, the output control unit 120 uses the analysis device-side communication unit 102 to transmit the fraudulent ID information 140, which is the detection result of the detection unit 118, to the server device 50 (S208). This completes the processing of this flowchart.

サーバ装置５０は、情報解析装置１００から不正ＩＤ情報１４０を受信した場合、不正ＩＤ情報１４０に含まれるユーザＩＤによるサービスの利用を禁止してもよいし、そのユーザＩＤの認証方法を変更してもよい。 When the server device 50 receives the unauthorized ID information 140 from the information analysis device 100, the server device 50 may prohibit the use of the service by the user ID included in the unauthorized ID information 140, or change the authentication method of the user ID. Good.

図１３は、ユーザＩＤの認証時に端末装置１０の表示部１３に表示される画面の一例を示す図である。例えば、サーバ側制御部５５は、サーバ側通信部５１により端末装置１０から受信されたユーザＩＤが不正ＩＤ情報１４０に含まれているか否かを判定する。すなわち、サーバ側制御部５５は、認証時に入力されたユーザＩＤが不正に取得されたユーザＩＤであるのか否かを判定する。認証時に入力されたユーザＩＤが、不正ＩＤ情報１４０に含まれていない場合、サーバ側制御部５５は、正常なユーザＩＤであると判断し、ウェブサイトを介してサービスを提供する。 FIG. 13 is a diagram showing an example of a screen displayed on the display unit 13 of the terminal device 10 at the time of authenticating the user ID. For example, the server-side control unit 55 determines whether or not the user ID received from the terminal device 10 by the server-side communication unit 51 is included in the illegal ID information 140. That is, the server-side control unit 55 determines whether or not the user ID input at the time of authentication is an illegally acquired user ID. When the user ID input at the time of authentication is not included in the unauthorized ID information 140, the server-side control unit 55 determines that the user ID is a normal user ID and provides the service via the website.

一方、認証時に入力されたユーザＩＤが、不正ＩＤ情報１４０に含まれている場合、サーバ側制御部５５は、新たに画像認証を求める画面を、端末装置１０の表示部１３に表示させる。これによって、不正なユーザＩＤの蓋然性が高いＩＤについては、ユーザＩＤの認証の難易度を高めることによって、サービスの利用を抑制することができる。また、サーバ側制御部５５は、画像認証に代えて或いは加えて、予め設定した情報（例えば生年月日や家族の名前など）の入力を求めるキーワード認証やその他の認証を行ってもよい。また、サーバ側制御部５５は、画像認証において表示する画像の文字数を増やしたり、文字の歪み度合を大きくしたりすることで、その画像認証自体の難易度を高めてもよい。すなわち、サーバ側制御部５５は、認証回数を増加させたり、各認証の難易度を高めたりすることで、不正に取得されたユーザＩＤを用いたサービスの利用を抑制してよい。 On the other hand, when the user ID input at the time of authentication is included in the fraudulent ID information 140, the server-side control unit 55 causes the display unit 13 of the terminal device 10 to display a new screen for image authentication. This makes it possible to suppress the use of services for IDs that have a high probability of being an unauthorized user ID by increasing the difficulty level of authenticating the user ID. Further, the server-side control unit 55 may perform, in addition to or in addition to the image authentication, keyword authentication for requesting input of preset information (for example, date of birth or family name) and other authentication. The server-side control unit 55 may increase the difficulty level of the image authentication itself by increasing the number of characters of the image displayed in the image authentication or increasing the degree of character distortion. That is, the server-side control unit 55 may suppress the use of the service using the illegally acquired user ID by increasing the number of times of authentication or increasing the difficulty level of each authentication.

以上説明した実施形態によれば、ユーザＩＤを取得する取得部１１２と、取得部１１２により取得されたユーザＩＤが示す文字列から、（１）から（１０）の特徴量のうち少なくとも一部を抽出する抽出部１１４と、抽出部により文字列から抽出された特徴量の中から、不正に取得されたユーザＩＤを検出するための特徴量（例えば、（１）、（２）、（４）〜（７）、（１０）の特徴量）を、機械学習を用いて選択する機械学習部１１６と、を備えることにより、不正に取得されたユーザＩＤの検出精度を向上させることができる。 According to the embodiment described above, at least a part of the feature quantities (1) to (10) is obtained from the acquisition unit 112 that acquires the user ID and the character string that is indicated by the user ID acquired by the acquisition unit 112. From the extraction unit 114 to extract and the feature amount extracted from the character string by the extraction unit, a feature amount for detecting an illegally acquired user ID (for example, (1), (2), (4)). To (7) and (10) (feature amount) are provided by the machine learning unit 116 that selects using machine learning, it is possible to improve the detection accuracy of the illegally acquired user ID.

また、上述した実施形態によれば、負例に分類するユーザＩＤの文字数に制限を設けることにより、機械学習におけるノイズの影響を抑制することができる。 Further, according to the above-described embodiment, the influence of noise in machine learning can be suppressed by limiting the number of characters of the user ID classified as a negative example.

また、上述した実施形態によれば、機械学習を用いて選択した特徴量に基づいて、不正に取得されたユーザＩＤを検出し、サービス利用のための認証時に、検出したユーザＩＤが使用された場合、認証回数を増加させたり、各認証の難易度を高めたりすることで、不正に取得されたユーザＩＤを用いたサービスの利用を抑制することができる。 Further, according to the above-described embodiment, the illegally acquired user ID is detected based on the feature amount selected by using machine learning, and the detected user ID is used at the time of authentication for using the service. In this case, by increasing the number of times of authentication or increasing the difficulty level of each authentication, it is possible to suppress the use of the service using the illegally acquired user ID.

＜ハードウェア構成＞
上述した実施形態の端末装置１０、サーバ装置５０、および情報解析装置１００は、例えば、図１４に示すようなハードウェア構成により実現される。図１４は、実施形態の端末装置１０、サーバ装置５０、および情報解析装置１００のハードウェア構成の一例を示す図である。本図は、端末装置１０がスマートフォンである例を示している。 <Hardware configuration>
The terminal device 10, the server device 50, and the information analysis device 100 of the above-described embodiment are realized by, for example, a hardware configuration as shown in FIG. FIG. 14 is a diagram illustrating an example of a hardware configuration of the terminal device 10, the server device 50, and the information analysis device 100 according to the embodiment. This figure shows an example in which the terminal device 10 is a smartphone.

端末装置１０は、ＣＰＵ１０−１、ＲＡＭ１０−２、ＲＯＭ１０−３、フラッシュメモリなどの二次記憶装置１０−４、タッチパネル１０−５、および無線通信モジュール１０−６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。無線通信モジュール１０−６は、無線基地局にアクセスすることでネットワークＮＷに接続する。無線通信モジュール１０−６は端末側通信部１１に対応し、タッチパネル１０−５は受付部１２および表示部１３に対応する。ＲＡＭ１０−２、ＲＯＭ１０−３、二次記憶装置１０−４は、端末側記憶部１４に対応する。また、二次記憶装置１０−４に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ１０−２に展開され、ＣＰＵ１０−１によって実行されることで、端末側制御部１５が実現される。 In the terminal device 10, the CPU 10-1, the RAM 10-2, the ROM 10-3, the secondary storage device 10-4 such as a flash memory, the touch panel 10-5, and the wireless communication module 10-6 are connected by an internal bus or a dedicated communication line. It is connected to each other. The wireless communication module 10-6 connects to the network NW by accessing the wireless base station. The wireless communication module 10-6 corresponds to the terminal side communication unit 11, and the touch panel 10-5 corresponds to the reception unit 12 and the display unit 13. The RAM 10-2, the ROM 10-3, and the secondary storage device 10-4 correspond to the terminal-side storage unit 14. Further, the program stored in the secondary storage device 10-4 is expanded in the RAM 10-2 by a DMA controller (not shown) or the like and executed by the CPU 10-1, so that the terminal-side control unit 15 is realized.

サーバ装置５０は、ＮＩＣ５０−１、ＣＰＵ５０−２、ＲＡＭ５０−３、ＲＯＭ５０−４、フラッシュメモリやＨＤＤなどの二次記憶装置５０−５、およびドライブ装置５０−６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置５０−６には、光ディスクなどの可搬型記憶媒体が装着される。ＮＩＣ５０−１は、サーバ側通信部５１に対応し、ＲＡＭ５０−３、ＲＯＭ５０−４、二次記憶装置５０−５は、サーバ側記憶部５２に対応する。二次記憶装置５０−５、またはドライブ装置５０−６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ５０−３に展開され、ＣＰＵ５０−２によって実行されることで、サーバ側制御部５５が実現される。サーバ側制御部５５が参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 In the server device 50, the NIC 50-1, CPU 50-2, RAM 50-3, ROM 50-4, secondary storage device 50-5 such as flash memory and HDD, and drive device 50-6 are connected by an internal bus or a dedicated communication line. It is connected to each other. A portable storage medium such as an optical disk is attached to the drive device 50-6. The NIC 50-1 corresponds to the server-side communication unit 51, and the RAM 50-3, the ROM 50-4, and the secondary storage device 50-5 correspond to the server-side storage unit 52. A program stored in a portable storage medium mounted on the secondary storage device 50-5 or the drive device 50-6 is expanded in the RAM 50-3 by a DMA controller (not shown) or the like and executed by the CPU 50-2. As a result, the server-side control unit 55 is realized. The program referred to by the server-side control unit 55 may be downloaded from another device via the network NW.

情報解析装置１００は、ＮＩＣ１００−１、ＣＰＵ１００−２、ＲＡＭ１００−３、ＲＯＭ１００−４、フラッシュメモリやＨＤＤなどの二次記憶装置１００−５、およびドライブ装置１００−６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置１００−６には、光ディスクなどの可搬型記憶媒体が装着される。ＮＩＣ１００−１は、解析装置側通信部１０２に対応し、ＲＡＭ１００−３、ＲＯＭ１００−４、二次記憶装置１００−５は、解析装置側記憶部１３０に対応する。二次記憶装置１００−５、またはドライブ装置１００−６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡ（Direct Memory Access）コントローラ（不図示）などによってＲＡＭ１００−３に展開され、ＣＰＵ１００−２によって実行されることで、解析装置側制御部１１０の各機能部が実現される。解析装置側制御部１１０が参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 In the information analysis device 100, the NIC 100-1, the CPU 100-2, the RAM 100-3, the ROM 100-4, the secondary storage device 100-5 such as a flash memory or an HDD, and the drive device 100-6 are an internal bus or a dedicated communication line. Are connected to each other by. A portable storage medium such as an optical disk is attached to the drive device 100-6. The NIC 100-1 corresponds to the analysis device side communication unit 102, and the RAM 100-3, the ROM 100-4, and the secondary storage device 100-5 correspond to the analysis device side storage unit 130. The program stored in the secondary storage device 100-5 or the portable storage medium mounted in the drive device 100-6 is expanded in the RAM 100-3 by a DMA (Direct Memory Access) controller (not shown), and the CPU 100- By being executed by 2, each functional unit of the analysis device side control unit 110 is realized. The program referenced by the analyzer control unit 110 may be downloaded from another device via the network NW.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 As described above, the embodiments for carrying out the present invention have been described by using the embodiments, but the present invention is not limited to these embodiments at all, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…情報解析システム、１０…端末装置、１１…端末側通信部、１２…受付部、１３…表示部、１４…端末側記憶部、１５…端末側制御部、５０…サーバ装置、５１…サーバ側通信部、５２…サーバ側記憶部、５５…サーバ側制御部、１００…情報解析装置、１０２…解析装置側通信部、１１０…解析装置側制御部、１１２…取得部、１１４…抽出部、１１６…機械学習部、１１８…検出部、１２０…出力制御部、１３０…解析装置側記憶部、１３２…教師データ、１３４…特徴量情報、１３６…学習条件情報、１３８…学習データ、１４０…不正ＩＤ情報、ＮＷ…ネットワーク DESCRIPTION OF SYMBOLS 1 ... Information analysis system, 10 ... Terminal device, 11 ... Terminal side communication part, 12 ... Reception part, 13 ... Display part, 14 ... Terminal side storage part, 15 ... Terminal side control part, 50 ... Server device, 51 ... Server Side communication unit, 52 ... Server side storage unit, 55 ... Server side control unit, 100 ... Information analysis device, 102 ... Analysis device side communication unit, 110 ... Analysis device side control unit, 112 ... Acquisition unit, 114 ... Extraction unit, 116 ... Machine learning unit, 118 ... Detection unit, 120 ... Output control unit, 130 ... Analysis device side storage unit, 132 ... Teacher data, 134 ... Feature amount information, 136 ... Learning condition information, 138 ... Learning data, 140 ... Illegal ID information, NW ... Network

Claims

An acquisition unit that acquires user identification information,
From a character string indicated by the identification information of the user acquired by the acquisition unit, a characteristic amount relating to a character string or a probability of existence of a character, a characteristic amount relating to a specific symbol included in the character string, and a characteristic relating to a keyboard arrangement which varies depending on regions. An extraction unit for extracting at least a part of the amount,
From the feature amount extracted from the character string by the extraction unit, a machine learning unit that selects a feature amount for detecting illegally acquired user identification information using machine learning,
An information analysis device including.

The feature amount relating to the existence probability of the character string or the character is a feature amount based on the entropy value of the existence probability of the character string or character,
The information analysis device according to claim 1.

The machine learning unit is
Solving a binary classification problem that classifies the features into positive examples and negative examples, using each of the plurality of feature quantities extracted by the extraction unit as a feature,
A combination of feature amounts having the highest score in the binary classification problem is selected as a feature amount for detecting the illegally acquired identification information of the user.
The information analysis apparatus according to claim 1.

The feature amount relating to the existence probability of the character string or the character is a feature amount whose value varies according to the length of the character string,
The machine learning unit derives the score by limiting the length of a character string indicated by the identification information of the user,
The information analysis device according to claim 3.

When the region is Japan, the feature amount related to the keyboard layout is set to the feature amount related to the QWERTY layout,
The information analysis device according to any one of claims 1 to 4.

Based on the feature amount selected by the machine learning unit, further includes a detection unit that detects the illegally acquired user identification information from among the plurality of user identification information acquired by the acquisition unit,
The information analysis device according to any one of claims 1 to 5.

Of the plurality of characteristic amounts extracted by the extraction unit, based on the characteristic amount related to the probability of existence of the character string or the character, the illegally acquired from the identification information of the plurality of users acquired by the acquisition unit Further comprising a detection unit for detecting the identified user identification information,
The information analysis device according to any one of claims 1 to 5.

The detection unit, when the feature amount related to the existence probability of the character string or character exceeds a threshold value, the identification information of the user who is the extraction source of the feature amount related to the existence probability of the character string or character exceeding the threshold value, Detected as the illegally acquired user identification information,
The information analysis device according to claim 7.

An acquisition unit that acquires user identification information,
An extraction unit that extracts a feature amount relating to the existence probability of a character string or a character from the character string indicated by the identification information of the user acquired by the acquisition unit,
When the feature amount relating to the existence probability of the character string or the character exceeds a threshold value, the identification information of the user who is the extraction source of the feature amount relating to the existence probability of the character string or the character value exceeding the threshold value is illegally acquired by the user. A detection unit that detects the identification information of
An information analysis device including.

An information analysis apparatus according to any one of claims 7 to 9,
A reception unit that receives an input operation of the user identification information,
An authentication unit that authenticates the user based on an input operation of the identification information of the user received by the reception unit,
When the identification information of the user detected as the identification information of the illegally acquired user by the detection unit is accepted by the reception unit, the authentication unit changes the difficulty level of the authentication,
Information analysis system.

Computer
Obtain the user's identification information,
From the character string indicated by the acquired identification information of the user, at least one of a characteristic amount relating to the probability of existence of a character string or a character, a characteristic amount relating to a specific symbol included in the character string, and a characteristic amount relating to a keyboard layout that differs depending on a region. Extract some,
From the characteristic amounts extracted from the character string, a characteristic amount for detecting the illegally acquired identification information of the user is selected using machine learning,
Information analysis method.

On the computer,
Get user identification information,
From the character string indicated by the acquired identification information of the user, of the characteristic amount relating to the existence probability of the character string or the character, the characteristic amount relating to a specific symbol included in the character string, and the characteristic amount relating to the keyboard layout which differs depending on the region. Let at least a portion be extracted,
From among the feature amounts extracted from the character string, a feature amount for detecting illegally acquired user identification information is selected using machine learning,
Information analysis program.