JP2016177728A

JP2016177728A - Data analysis apparatus and data analysis method

Info

Publication number: JP2016177728A
Application number: JP2015059112A
Authority: JP
Inventors: 光晴大峡; Mitsuharu Ohazama
Original assignee: Hitachi Solutions Ltd
Current assignee: Hitachi Solutions Ltd
Priority date: 2015-03-23
Filing date: 2015-03-23
Publication date: 2016-10-06

Abstract

PROBLEM TO BE SOLVED: To provide a technique capable of appropriately evaluating the similarity among screen transitions for the analysis of a user behavior pattern in a website.SOLUTION: According to the present invention, similar screen transitions are clustered in such a way as to belong to the same cluster and the frequency of each cluster is accumulated, thereby quantitatively representing a user behavior pattern in a website. More specifically, a system according to the present invention includes a function that accumulates access logs and derives data per session; a function that accumulates screen transition patterns in the data per session; a function that clusters screen transition patterns; and a function that accumulates a ratio of the clustered screen transition patterns for every user.SELECTED DRAWING: Figure 1

Description

本発明は、データ分析装置、及びデータ分析方法に関し、例えば、Webサイトのアクセスログを基にユーザの画面遷移パターンを分析するための技術に関する。 The present invention relates to a data analysis apparatus and a data analysis method, for example, a technique for analyzing a user's screen transition pattern based on an access log of a website.

ビッグデータ分析技術が注目を集め、データを分析することでビジネスに役立つ知見を得ることが様々な分野で試みられている。売上拡大に向けた施策や不正ログイン対策の検討においてもデータ分析技術が提案されてきた。特にWebサイトによる製品・サービスの提供においては、顧客の行動がアクセスログとして取得できることから、データ分析技術が多く提案されている。 Big data analysis technology has attracted attention, and in various fields, it has been attempted to obtain knowledge useful for business by analyzing data. Data analysis technology has also been proposed for measures to expand sales and to investigate illegal login measures. In particular, in the provision of products and services through websites, many data analysis techniques have been proposed because customer actions can be acquired as access logs.

非特許文献１及び特許文献１においては、Webサイトのアクセスログからセッション毎の画面遷移(アクセスルート)を導出し、各画面遷移に対してLongest Common Subsequence (LCS)を求める方式が提案されている。LCSとは、リストXの部分系列とリストYの部分系列のなかで両方のリストに共通に含まれるもののなかで、最も長いものを表す。例えば、X=(A, F, B, D, E)とY=(A, B, C, D, E)のLCS(X, Y)は、(A, B, D, E)である。特許文献１では、得られたLCSの中で最も頻度が多いものを頻出アクセスパターンとする方式が提案されている。 Non-Patent Document 1 and Patent Document 1 propose a method of deriving a screen transition (access route) for each session from an access log of a website and obtaining a Longest Common Subsequence (LCS) for each screen transition. . LCS represents the longest of the partial series of list X and the partial series of list Y that are included in both lists in common. For example, the LCS (X, Y) of X = (A, F, B, D, E) and Y = (A, B, C, D, E) is (A, B, D, E). Patent Document 1 proposes a method in which the most frequently obtained LCS is used as a frequent access pattern.

このようにアクセスパターンの傾向（類似性）を分析できると、ログインが必要なWebサイト（例えば、金融関係のWebサイト）におけるユーザのアクセス行動のパターンを抽出することができるようになる。このようなパターンが抽出できればアクセス行動の本人らしさを検知することができ、不正ログインによる不正アクセスを検知することができるようになる。また、このような類似性の情報は、マーケティングにも用いることができる。つまり、ユーザのアクセスパターンからWebページをどのように改良すれば効率よく使うことができるかのヒントとなる。 If the access pattern tendency (similarity) can be analyzed in this way, it becomes possible to extract a user access behavior pattern in a Web site that requires login (for example, a financial Web site). If such a pattern can be extracted, the identity of the access action can be detected, and unauthorized access due to unauthorized login can be detected. Such similarity information can also be used for marketing. In other words, it is a hint of how to improve Web pages based on user access patterns so that they can be used efficiently.

特開２００４−１５２２０９号公報JP 2004-152209 A

戸田誠二、横田治夫：WebログのLCS解析におけるスケーラビリティ向上手法の評価, 日本データベース学会Letters, Vol.2 (No.3), 2003Seiji Toda, Haruo Yokota: Evaluation of Scalability Improvement Methods for LCS Analysis of Web Logs, The Database Society of Japan Letters, Vol.2 (No.3), 2003 Needleman, S.B. and Wunsch, C.D. “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol.48, pp.443-453, 1970.Needleman, S.B. and Wunsch, C.D. “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol.48, pp.443-453, 1970.

しかしながら、LCSは、例えば画面遷移の最初と最後が一致していれば同一となることから、LCSによってアクセスパターンの傾向を分析する方法は、画面遷移の類似性を適切に表現できない場合がある。例えば、P=(A, S, F)とQ=(A, T, F)のLCS(P, Q)は(A, F)である。一方、P=(A, S, F)とR=(A, V, W, X, Y, Z, F)のLCS(P, R)も(A, F)である。このように、同一のLCSであっても元の画面遷移同士が類似しているとは限らないことが、LCSによるアクセスパターン分析における課題である。 However, since the LCS is the same if, for example, the start and end of the screen transition match, the method of analyzing the tendency of the access pattern by the LCS may not be able to appropriately express the similarity of the screen transition. For example, the LCS (P, Q) of P = (A, S, F) and Q = (A, T, F) is (A, F). On the other hand, LCS (P, R) of P = (A, S, F) and R = (A, V, W, X, Y, Z, F) is also (A, F). In this way, even in the same LCS, the original screen transitions are not always similar, which is a problem in access pattern analysis by LCS.

本発明はこのような状況に鑑みてなされたものであり、Webサイトにおけるユーザの行動パターン分析のため、画面遷移間の類似性を適切に評価することができる技術を提供するものである。 The present invention has been made in view of such a situation, and provides a technique capable of appropriately evaluating the similarity between screen transitions in order to analyze a user's behavior pattern on a Web site.

上記課題を解決するために、本発明では、類似した画面遷移同士が同一のクラスタに属するようにクラスタリングし、各クラスタの頻度を集計することで、Webサイトにおけるユーザの行動パターンを定量的に表す技術を提供する。つまり、本発明では、アクセスログから、セッション毎の画面遷移を表すデータが画面遷移データとして変換される。そして、文字列間距離演算を用いて画面遷移データにおける画面遷移のパターン間の類似度が算出され、当該類似度に基づいて画面遷移のパターンがクラスタリングされる。当該クラスタリングの情報は、記憶装置に格納され、アクセスログの分析に用いられる。 In order to solve the above problems, in the present invention, clustering is performed so that similar screen transitions belong to the same cluster, and the frequency of each cluster is aggregated to quantitatively represent a user's behavior pattern on the website. Provide technology. That is, in the present invention, data representing screen transition for each session is converted as screen transition data from the access log. Then, the similarity between the screen transition patterns in the screen transition data is calculated using the distance calculation between character strings, and the screen transition patterns are clustered based on the similarity. The clustering information is stored in a storage device and used for access log analysis.

本発明によれば、Webサイトにおいて、ユーザのマクロ的な画面遷移パターンを精密かつ定量的に把握することが可能となる。これにより、ユーザの画面遷移パターンに応じたWebサイトの構築や、キャンペーン施策前後のユーザの行動の変化の分析や、正規ユーザになりすましてログインしたユーザの検出などのアクセス解析を効率化することができる。 According to the present invention, it is possible to accurately and quantitatively grasp a user's macro screen transition pattern on a website. This makes it possible to improve the efficiency of access analysis, such as the construction of a website according to the user's screen transition pattern, analysis of changes in user behavior before and after campaign measures, and detection of users who impersonated regular users and logged in it can.

なお、上述した以外の課題、構成及び効果は、以下の本発明を実施するための形態および添付図面によって明らかになるものである。 Problems, configurations, and effects other than those described above will become apparent from the following embodiments for carrying out the present invention and the accompanying drawings.

本発明の実施形態に係るシステム（データ分析装置）の概略構成を示す図である。It is a figure showing a schematic structure of a system (data analysis device) concerning an embodiment of the present invention. アクセスログデータの一例を示す図である。It is a figure which shows an example of access log data. セッションデータの一例を示す図である。It is a figure which shows an example of session data. 画面遷移データの一例を示す図である。It is a figure which shows an example of screen transition data. ユーザパターンデータの一例を示す図である。It is a figure which shows an example of user pattern data. クラスタリング処理の例を説明するための図である。It is a figure for demonstrating the example of a clustering process. セッションデータ生成処理を説明するためのフローチャートである。It is a flowchart for demonstrating a session data generation process. 画面遷移集計処理を説明するためのフローチャートである。It is a flowchart for demonstrating a screen transition total process. クラスタリング処理を説明するためのフローチャートである。It is a flowchart for demonstrating a clustering process. ユーザパターン集計処理を説明するためのフローチャートである。It is a flowchart for demonstrating a user pattern total process.

以下、添付図面を参照して本発明の実施形態について説明する。ただし、本実施形態は本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではないことに注意すべきである。また、各図において共通の構成については同一の参照番号が付されている。 Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. However, it should be noted that this embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention. In each drawing, the same reference numerals are assigned to common components.

以後の説明では「プログラム」を主語として説明を行うが、プログラムはプロセッサによって実行されることで定められた処理をメモリ及び通信ポート（通信制御装置）を用いながら行うため、プロセッサを主語とした説明としてもよい。また、プログラムを主語として開示された処理は管理サーバ等の計算機、情報処理装置が行う処理としてもよい。プログラムの一部または全ては専用ハードウェアで実現してもよく、また、モジュール化されていても良い。各種プログラムはプログラム配布サーバや記憶メディアによって各計算機にインストールされてもよい。 In the following description, “program” will be the subject, but the program is executed by the processor, and processing determined by using the memory and communication port (communication control device) will be performed. It is good. Further, the processing disclosed with the program as the subject may be processing performed by a computer such as a management server or an information processing apparatus. Part or all of the program may be realized by dedicated hardware, or may be modularized. Various programs may be installed in each computer by a program distribution server or a storage medium.

＜データ分析装置の構成＞
図１は、本発明の実施形態によるデータ分析装置の概略構成を示す機能ブロック図である。このデータ分析装置は、必要な演算処理及び制御処理等を行う中央処理装置（プロセッサ）100と、データの入出力を行うための入出力装置110と、中央処理装置100での処理に必要なプログラムを格納するプログラムメモリ120と、中央処理装置100での処理対象となるデータまたは処理後のデータを格納する記憶装置130を有している。 <Configuration of data analyzer>
FIG. 1 is a functional block diagram showing a schematic configuration of a data analysis apparatus according to an embodiment of the present invention. The data analysis apparatus includes a central processing unit (processor) 100 that performs necessary arithmetic processing and control processing, an input / output device 110 for inputting and outputting data, and a program required for processing in the central processing unit 100. And a storage device 130 for storing data to be processed by the central processing unit 100 or data after processing.

入出力装置110は、データを表示するための表示装置111やプリンタ（図示せず）等で構成される出力デバイスと、表示されたデータに対してメニューを選択するなどの操作を行うためのキーボード112、マウスなどのポインティングデバイス113と、を有している。 The input / output device 110 includes a display device 111 for displaying data, an output device including a printer (not shown), and a keyboard for performing operations such as selecting a menu for the displayed data. 112, a pointing device 113 such as a mouse.

プログラムメモリ120は、Webサイトにおけるアクセスログを分析し、セッション単位のデータに変換するセッションデータ生成プログラム121と、セッション単位のデータにおける画面遷移のパターンを集計する画面遷移集計プログラム122と、画面遷移のパターンをクラスタリングするクラスタリングプログラム123と、ユーザパターン集計プログラム124と、を格納している。なお、各処理プログラムは、プログラムコードとしてプログラムメモリ120に格納されており、中央処理装置100が各プログラムコードを実行することによって各処理が実現される。 The program memory 120 analyzes the access log on the website and converts it into session unit data, a session data generation program 121, a screen transition tabulation program 122 that tabulates screen transition patterns in the session unit data, A clustering program 123 for clustering patterns and a user pattern totaling program 124 are stored. Each processing program is stored in the program memory 120 as a program code, and each processing is realized by the central processing unit 100 executing each program code.

記憶装置130は、あらかじめ蓄積されたアクセスログが格納されたアクセスログデータ131と、アクセスログデータから抽出したセッション単位の情報が格納されたセッションデータ132と、セッションデータにおける画面遷移のパターンを格納する画面遷移データ133と、ユーザ毎の画面遷移のクラスタリング結果の比率を格納するユーザパターンデータ134と、を格納している。なお、記憶装置130は、ネットワークを介して遠隔的に配置されていているストレージシステムであってもよい。 The storage device 130 stores the access log data 131 storing the access log accumulated in advance, the session data 132 storing the information for each session extracted from the access log data, and the screen transition pattern in the session data. The screen transition data 133 and user pattern data 134 for storing the ratio of the screen transition clustering results for each user are stored. The storage device 130 may be a storage system that is remotely located via a network.

以上に述べた処理プログラム・データ・各プログラム等は、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、ＭＯ、フロッピー（登録商標）ディスク、ＵＳＢメモリ等の種々の記録媒体に格納して提供することもできる。 The processing program, data, each program, etc. described above can be provided by being stored in various recording media such as a CD-ROM, DVD-ROM, MO, floppy (registered trademark) disk, USB memory or the like.

＜アクセスログデータ＞
図２は、記憶装置130内のアクセスログデータ131の一例を示す図である。アクセスログデータは、アクセスログの取得対象のWebサイトにおいてユーザが各画面にアクセスした際の情報が格納されている。セッションID201はセッションを一意に表すIDである。セッションIDは、ユーザがWebサイトにログインしてから別のWebサイトに遷移またはWebブラウザを閉じるまで同一セッションIDが付与される。アクセス日時202は、ユーザが当該画面にアクセスした日時を表す。ユーザID203は、ユーザIDを一意に表すIDであり、当該画面にアクセスしたユーザのユーザIDが格納される。アクセス画面ID204は、ユーザがアクセスした画面のIDを表す。アクセス画面IDは画面ごとに一意に決まっている。 <Access log data>
FIG. 2 is a diagram illustrating an example of the access log data 131 in the storage device 130. The access log data stores information when the user accesses each screen on the Web site from which the access log is acquired. The session ID 201 is an ID that uniquely represents a session. The session ID is assigned the same session ID after the user logs in to the Web site until the user transitions to another Web site or closes the Web browser. The access date 202 represents the date when the user accessed the screen. The user ID 203 is an ID that uniquely represents the user ID, and stores the user ID of the user who accessed the screen. The access screen ID 204 represents the ID of the screen accessed by the user. The access screen ID is uniquely determined for each screen.

なお、アクセス対象のWebサイトは、ユーザIDを取得するため、例えばログイン（ユーザ認証）を前提とするものである。 The Web site to be accessed is premised on, for example, login (user authentication) in order to acquire a user ID.

＜セッションデータ＞
図３は、記憶装置130内のセッションデータ132の一例を示す図である。セッションID301は、アクセスログデータ131におけるセッションID201と同一である。アクセス開始日時302は、同一セッションにおける画面遷移の中で、最初の画面にアクセスした時のアクセス日時である。ユーザID303は、アクセスログデータ131におけるユーザID203と同一である。画面遷移304は、同一セッションにおいて遷移した画面のアクセス画面IDを、アクセス順に列挙したものである。例えば、(A, B, C)であれば、アクセス画面IDが「A」「B」「C」の画面を順にアクセスしたことを表す。 <Session data>
FIG. 3 is a diagram illustrating an example of the session data 132 in the storage device 130. The session ID 301 is the same as the session ID 201 in the access log data 131. The access start date and time 302 is an access date and time when the first screen is accessed during screen transition in the same session. The user ID 303 is the same as the user ID 203 in the access log data 131. The screen transition 304 is a list of access screen IDs of screens transitioned in the same session in the order of access. For example, (A, B, C) indicates that the screens whose access screen IDs are “A”, “B”, and “C” are accessed in order.

なお、図３において、セッションIDがd00001234での画面遷移が（A, B, C）、d00001236での画面遷移が（E, B, C, D）となっている。アクセス画面IDが同一のもの（例えば、BやC）は同一画面を示しているが、入口となるトップページがAとEで異なるものとなっている。これは、入口となるトップページが複数種類設けられているためである。例えば、画面AがPC用トップページを示し、画面Eがモバイル用トップページを示している。 In FIG. 3, the screen transition when the session ID is d00001234 is (A, B, C), and the screen transition when d00001236 is (E, B, C, D). The same access screen ID (for example, B or C) indicates the same screen, but the top page as the entrance is different between A and E. This is because a plurality of types of top pages serving as entrances are provided. For example, the screen A shows the PC top page, and the screen E shows the mobile top page.

＜画面遷移データ＞
図４は、記憶装置130内の画面遷移データ133の一例を示す図である。No.401は各レコードすなわち画面遷移を一意に表すIDである。画面遷移402は、セッションデータ132における画面遷移304と同様に、同一セッションにおいて遷移した画面のアクセス画面IDを、アクセス順に列挙したものである。なお、画面遷移データ133においては、同一の画面遷移は1レコードにまとめられる。クラスタID403は、画面遷移をクラスタリングした結果が該当するクラスタを表すIDである。画面遷移データは元の画面遷移と当該画面遷移が属するクラスタとの対応を表す。 <Screen transition data>
FIG. 4 is a diagram illustrating an example of the screen transition data 133 in the storage device 130. No. 401 is an ID that uniquely represents each record, that is, a screen transition. The screen transition 402 is a list of access screen IDs of screens transitioned in the same session in the order of access, like the screen transition 304 in the session data 132. In the screen transition data 133, the same screen transition is collected into one record. The cluster ID 403 is an ID representing a cluster corresponding to the result of clustering the screen transition. The screen transition data represents the correspondence between the original screen transition and the cluster to which the screen transition belongs.

＜ユーザパターンデータ＞
図５は、記憶装置130内のユーザパターンデータ134の一例を示す図である。ユーザID501は、アクセスログデータ131におけるユーザID203と同一である。クラスタID502は、画面遷移データ133におけるクラスタID403と同一であり、当該ユーザの画面遷移を、画面遷移データにおける該当するクラスタIDで置き換えたものである。比率503は、当該ユーザにおける全セッション数に対する当該クラスタIDをもつセッション数の比率を表す。 <User pattern data>
FIG. 5 is a diagram showing an example of user pattern data 134 in the storage device 130. The user ID 501 is the same as the user ID 203 in the access log data 131. The cluster ID 502 is the same as the cluster ID 403 in the screen transition data 133, and the screen transition of the user is replaced with the corresponding cluster ID in the screen transition data. The ratio 503 represents the ratio of the number of sessions having the cluster ID to the total number of sessions for the user.

＜データ分析装置における処理概要＞
上述の構成を有するデータ分析装置において行われる処理について説明する。まず、中央処理装置100は、セッションデータ生成プログラム121を用いて、記憶装置130におけるアクセスログデータ131を読み込み、セッションデータ132を生成する。 <Outline of processing in data analyzer>
Processing performed in the data analysis apparatus having the above-described configuration will be described. First, the central processing unit 100 uses the session data generation program 121 to read the access log data 131 in the storage device 130 and generate session data 132.

次に、画面遷移集計プログラム122が実行される。記憶装置130からセッションデータ132を読み込み、網羅的かつ重複がない状態で画面遷移を集計し、得られた画面遷移データ133を記憶装置130に格納する。次にクラスタリングプログラム123が実行される。クラスタリングとは、類似の画面遷移パターンを同一のクラスタに分類する処理である。クラスタリング結果を画面遷移データに書込み、記憶装置130に格納する。 Next, the screen transition aggregation program 122 is executed. The session data 132 is read from the storage device 130, the screen transitions are tabulated in a comprehensive and non-overlapping state, and the obtained screen transition data 133 is stored in the storage device 130. Next, the clustering program 123 is executed. Clustering is a process of classifying similar screen transition patterns into the same cluster. The clustering result is written in the screen transition data and stored in the storage device 130.

続いて、ユーザパターン集計プログラムが実行される。記憶装置130からセッションデータ132と画面遷移データ133を読み込み、セッションデータをユーザ単位に集計し、各ユーザの画面遷移をクラスタリング結果で置き換え、各ユーザのセッション数に対する各クラスタの比率を計算し、ユーザパターンデータ134として記憶装置130に格納する。それぞれの処理については、以下詳細に説明する。 Subsequently, a user pattern totaling program is executed. The session data 132 and the screen transition data 133 are read from the storage device 130, the session data is aggregated for each user, the screen transition of each user is replaced with the clustering result, and the ratio of each cluster to the number of sessions of each user is calculated. The pattern data 134 is stored in the storage device 130. Each process will be described in detail below.

＜セッションデータ生成処理＞
図７は、セッションデータ生成プログラム121が実行するセッションデータ生成処理を説明するためのフローチャートである。ここでの動作主体はセッションデータ生成プログラムである。セッションデータ生成処理では、図２のようなアクセスログデータをもとに図３のようなセッションデータを生成する処理を行う。 <Session data generation processing>
FIG. 7 is a flowchart for explaining session data generation processing executed by the session data generation program 121. The operating subject here is a session data generation program. In the session data generation process, a process for generating session data as shown in FIG. 3 is performed based on the access log data as shown in FIG.

ステップ701では、アクセスログデータ131を読み込む。 In step 701, the access log data 131 is read.

ステップ702では、未処理のセッションIDを抽出する。すなわち、ステップ703からステップ706ではセッションID単位の処理を行うため、これらの処理をまだ行っていないセッションIDを抽出する。 In step 702, an unprocessed session ID is extracted. In other words, since processing is performed in units of session IDs from step 703 to step 706, session IDs that have not been subjected to these processings are extracted.

ステップ703では、ステップ702で抽出されたセッションIDを持つレコードを抽出する。 In step 703, the record having the session ID extracted in step 702 is extracted.

ステップ704では、ステップ703で抽出されたレコードの中で、最も古いアクセス日時をもつレコードの中からアクセス日時を取得し、これをアクセス開始日時とする。 In step 704, the access date / time is acquired from the records having the oldest access date / time among the records extracted in step 703, and this is set as the access start date / time.

ステップ705では、ステップ703で抽出したレコードから、アクセス日時の昇順にアクセス画面IDを列挙したデータ（画面遷移データ）を生成する。例えば、ステップ703で抽出したレコードにおいて、セッションIDが「d00001234」のレコードが3件あり、アクセス日時が古い順にアクセス画面IDが「A」「B」「C」であった場合の画面遷移データは(A, B, C)となる。 In step 705, data (screen transition data) listing access screen IDs in ascending order of access date and time is generated from the record extracted in step 703. For example, in the record extracted in step 703, there are 3 records with the session ID “d00001234”, and the screen transition data when the access screen IDs are “A”, “B”, and “C” in order from the oldest access date (A, B, C).

ステップ706では、得られた各データ、すなわちセッションID、アクセス開始日時、ユーザID、画面遷移、をセッションデータにおける1レコードとして格納する。このような処理を全セッションIDに対して行う。 In step 706, the obtained data, that is, the session ID, access start date / time, user ID, and screen transition are stored as one record in the session data. Such processing is performed for all session IDs.

ステップ707では、全セッションIDを処理したか否かを判定する。判定結果が真の場合には処理を終了する。偽の場合にはステップ702に戻る。 In step 707, it is determined whether all session IDs have been processed. If the determination result is true, the process ends. If not, the process returns to step 702.

＜画面遷移集計処理＞
図８は、画面遷移集計プログラム122が実行する画面遷移集計処理を説明するためのフローチャートである。ここでの動作主体は画面遷移集計プログラムである。画面遷移集計処理では、図３のようなセッションデータをもとに図４のような画面遷移データを生成する処理を行う。なお、画面遷移の集計は、所定期間ごと（例えば、３ヶ月単位）に行われる。 <Screen transition aggregation processing>
FIG. 8 is a flowchart for explaining the screen transition totaling process executed by the screen transition totaling program 122. The operating subject here is a screen transition counting program. In the screen transition totaling process, a process for generating screen transition data as shown in FIG. 4 is performed based on the session data as shown in FIG. Note that the screen transitions are totaled every predetermined period (for example, every three months).

ステップ801では、セッションデータ132を読み込む。 In step 801, the session data 132 is read.

ステップ802では、セッションデータにおける未処理のレコードを読み込む。 In step 802, an unprocessed record in the session data is read.

ステップ803では、ステップ802で読み込んだレコードにおける画面遷移が、画面遷移データ133に登録済か否かを判定する。登録済であればステップ805に進み、登録済でなければステップ804に進む。 In step 803, it is determined whether the screen transition in the record read in step 802 has been registered in the screen transition data 133. If registered, the process proceeds to step 805, and if not registered, the process proceeds to step 804.

ステップ804では、ステップ802で読み込んだレコードにおける画面遷移を画面遷移データ133に登録する。すなわち、画面遷移データ133における1レコードとして、No.、画面遷移を格納する。No.は一意の番号を任意に付与する。この時、クラスタIDにはデータが格納されていない状態(Null値)である。 In step 804, the screen transition in the record read in step 802 is registered in the screen transition data 133. That is, No. and screen transition are stored as one record in the screen transition data 133. No. assigns a unique number arbitrarily. At this time, no data is stored in the cluster ID (Null value).

ステップ805では、セッションデータ132における全レコードを処理したか否かを判定する。処理していないならステップ802に戻り、処理していれば処理を終了する。 In step 805, it is determined whether all records in the session data 132 have been processed. If not processed, the process returns to step 802, and if processed, the process is terminated.

＜クラスタリング処理＞
図９は、クラスタリングプログラム123が実行するクラスタリング処理を説明するためのフローチャートである。ここでの動作主体はクラスタリングプログラムである。クラスタリング処理では、図４のような画面遷移データにおいて、画面遷移402を基にクラスタリング処理を行い、クラスタID403を求める処理を行う。 <Clustering processing>
FIG. 9 is a flowchart for explaining the clustering process executed by the clustering program 123. The operation subject here is a clustering program. In the clustering processing, clustering processing is performed on the screen transition data as shown in FIG. 4 based on the screen transition 402, and processing for obtaining the cluster ID 403 is performed.

図６は、クラスタリング処理を説明するための模式図である。以降、図４、図６の例を使用して図９のフローチャートに沿って説明する。 FIG. 6 is a schematic diagram for explaining the clustering process. Hereinafter, description will be made along the flowchart of FIG. 9 using the examples of FIGS.

ステップ901では、画面遷移データを読み込む。例として図４における(1)から(5)のデータが格納されているものとする。また、このときの画面構造は図６(a)のように定義されているものとする。図６(a)において、各矩形は画面を表し、付与されている文字は画面IDを表す。また、画面間の線はリンクを表す。 In step 901, screen transition data is read. As an example, assume that data (1) to (5) in FIG. 4 are stored. The screen structure at this time is defined as shown in FIG. In FIG. 6A, each rectangle represents a screen, and the attached character represents a screen ID. Moreover, the line between screens represents a link.

ステップ902では、クラスタ間の類似度を求める。初期状態では、全画面遷移がシングルトン、すなわち当該画面遷移のみで構成されるクラスタの状態から開始する。クラスタ間の類似度は、図６(b)に示すように全ペア総当りで求める。この時、類似度は式(1)にしたがって求める。 In step 902, the similarity between clusters is obtained. In the initial state, all screen transitions start from a singleton, that is, a cluster state composed only of the screen transitions. Similarity between clusters is obtained for all pairs as shown in FIG. At this time, the similarity is obtained according to equation (1).

S(a,b) = Len(LCS(a,b)) / Dist(a,b) ・・・ (1) S (a, b) = Len (LCS (a, b)) / Dist (a, b) (1)

式(1)において、S(a,b)は、画面遷移aと画面遷移bの類似度を表す。LCS(a,b)は画面遷移aと画面遷移bのLCSを表す。Len(a)は画面遷移aを構成する画面数を表す。Dist(a,b)は、画面遷移aと画面遷移b間における文字列間距離を表す。文字列間距離とは、２つの文字列がどれだけ異なるかを距離として表した情報である。文字列間距離には、例えば、レーベンシュタイン距離や、ジャロ・ウィンクラー距離等がある。文字列間距離は任意に選択可能であるが、本実施形態ではレーベンシュタイン距離を使用して説明する。レーベンシュタイン距離を求める手法としては、非特許文献2に示すようなDPマッチングと呼ばれる手法がある。非特許文献2に示すように、DPマッチングとは、2つのオブジェクト(文字・DNA・音声等)間で非線形伸縮させ、最も整合した状態をマッチング結果とするマッチング方法である。DPマッチングにおいて、オブジェクト間の不整合の尺度として、ズレに対するペナルティPzと、不一致に対するペナルティPnがある。経験的にペナルティの設定値は、Pz：Pnが1：3となるように設定することが好ましい。画面遷移に適用した場合、例えばPz=1, Pn=3, a=(A,B,C), b=(A,B,D)の時、Dist(a,b)=3となる。また、式(1)を用いて前述したP=(A,S,F), Q=(A,T,F), R=(A,V,W,X,Y,Z,F)の類似度を求めると、Dist(P,Q)=3, Dist(P,R)=19となるため、S(P,Q)=2/3, S(P,R)=2/19となり、LCSのみを用いた場合よりもより精密なマッチングが可能となる。LCSは、画面遷移間で不一致の画面が多数含まれていても不問であることが精密なマッチングの際の問題となるが、文字列間距離を組み合わせて類似度を求めることにより、LCSで不問としていた不一致画面をペナルティとして考慮できるため、より精密なマッチングが可能となる。クラスタに複数の画面遷移が含まれている場合は別処理が必要になるがそれについては後述する。 In equation (1), S (a, b) represents the similarity between screen transition a and screen transition b. LCS (a, b) represents the LCS of screen transition a and screen transition b. Len (a) represents the number of screens constituting the screen transition a. Dist (a, b) represents the distance between character strings between screen transition a and screen transition b. The distance between character strings is information that represents how much two character strings differ as a distance. Examples of the distance between character strings include a Levenshtein distance and a Jaro-Winkler distance. Although the distance between character strings can be arbitrarily selected, in the present embodiment, description will be made using the Levenshtein distance. As a method for obtaining the Levenshtein distance, there is a method called DP matching as shown in Non-Patent Document 2. As shown in Non-Patent Document 2, DP matching is a matching method in which a non-linear expansion / contraction between two objects (characters, DNA, speech, etc.) and the most matched state is used as a matching result. In DP matching, there are a penalty Pz for deviation and a penalty Pn for mismatch as measures of inconsistency between objects. Empirically, the penalty setting value is preferably set so that Pz: Pn is 1: 3. When applied to screen transition, for example, when Pz = 1, Pn = 3, a = (A, B, C), b = (A, B, D), Dist (a, b) = 3. Also, the similarity of P = (A, S, F), Q = (A, T, F), R = (A, V, W, X, Y, Z, F) described above using equation (1) When calculating the degree, Dist (P, Q) = 3, Dist (P, R) = 19, so S (P, Q) = 2/3, S (P, R) = 2/19, LCS More precise matching is possible than in the case of using only. LCS is a problem for precise matching even if there are many mismatched screens between screen transitions, but it is not necessary for LCS by combining the distance between character strings to obtain the similarity. Since the mismatched screen can be considered as a penalty, more precise matching is possible. If the cluster contains multiple screen transitions, separate processing is required, which will be described later.

フローチャートに戻り、ステップ903では、求めたクラスタ間類似度の中から最大の類似度S_maxを求める。図６(b)では、クラスタペア(4)-(5)の場合の1.67が最大値となるためこの値をS_maxとする。 Returning to the flowchart, in step 903, the maximum similarity S _max is obtained from the obtained inter-cluster similarity. In FIG. 6B, since 1.67 in the case of the cluster pair (4)-(5) is the maximum value, this value is set as S _max .

ステップ904では、S_maxが類似度閾値S_T以上か否かを判定する。S_Tはどのくらいの画面遷移を類似しているとみなすかに応じて変更可能である。本実施形態ではS_T=0.8として説明する。判定結果が真の場合はステップ905に進み、偽の場合はステップ906に進む。 In step 904, S _max is determined whether or similarity threshold S _T. S _T can be changed depending on whether considered to be similar to how much screen transition. In the present embodiment, description will be made assuming that S _T = 0.8. If the determination result is true, the process proceeds to step 905, and if it is false, the process proceeds to step 906.

ステップ905では、ステップ903で類似度が最大となったクラスタペアをマージする。すなわち、クラスタペアに含まれる画面遷移を同一クラスタに含め、同一のクラスタIDを付与する。図６の例では、(4)-(5)のペアに対してC1というクラスタIDを付与している。 In step 905, the cluster pair having the maximum similarity in step 903 is merged. That is, the screen transitions included in the cluster pair are included in the same cluster and given the same cluster ID. In the example of FIG. 6, a cluster ID C1 is assigned to the pair (4)-(5).

ステップ902に戻り、再度クラスタ間類似度を求める。1回目のループではすべてのクラスタがシングルトンであったが、2回目のループでは複数の画面遷移を含むクラスタがあるため、類似度の求め方が異なる。具体的には、両クラスタに含まれる画面遷移間の総当りで類似度を求め、その最小値を当該クラスタ間の類似度とする。例えば、(1)とC1の類似度を求める場合には、C1に含まれる画面遷移と(1)との総当りで類似度を求め、その最小値を(1)とC1の類似度とする。C1は(4)と(5)を含むため、(1)-(4)と(1)-(5)の類似度を求める。(1)-(4)の類似度は0.06、(1)-(5)の類似度は0.15であり、最小値は0.06であるためこれを(1)とC1の類似度とする。クラスタに含まれる画面遷移における類似度の最小値を採用することにより、クラスタ内の画面遷移すべてが一定以上の類似度を持つことになり精密なクラスタリングが可能となる。このようにして、図６(c)のようにクラスタペアの類似度を求める。 Returning to step 902, the similarity between clusters is obtained again. In the first loop, all clusters were singletons, but in the second loop, there are clusters that contain multiple screen transitions, so the method of obtaining similarity is different. Specifically, the similarity is obtained by brute force between screen transitions included in both clusters, and the minimum value is set as the similarity between the clusters. For example, when calculating the similarity between (1) and C1, calculate the similarity between the screen transitions included in C1 and (1), and set the minimum value as the similarity between (1) and C1. . Since C1 includes (4) and (5), the similarity between (1)-(4) and (1)-(5) is obtained. Since the similarity between (1)-(4) is 0.06, the similarity between (1)-(5) is 0.15, and the minimum value is 0.06, this is the similarity between (1) and C1. By adopting the minimum value of the similarity in the screen transitions included in the cluster, all the screen transitions in the cluster have a certain degree of similarity, and precise clustering is possible. In this way, the cluster pair similarity is obtained as shown in FIG.

ステップ903ではSmaxを求める。図６(c)の場合は(1)-(2)の類似度が1.0であり最大値となる。 In step 903, Smax is obtained. In the case of FIG. 6C, the similarity between (1) and (2) is 1.0, which is the maximum value.

ステップ904ではSmax=1.0がS_T=0.8以上であるか判定する。真であるため、(1)と(2)をマージする。すなわち、クラスタを一意に表すクラスタIDを付与する。この例ではC2を付与する。 In step 904, it is determined whether Smax = 1.0 is S _T = 0.8 or more. Since it is true, merge (1) and (2). That is, a cluster ID that uniquely represents the cluster is assigned. In this example, C2 is assigned.

以後、ステップ904が偽、すなわちSmaxがS_T未満となるまでループを繰り返す。偽の場合、ステップ906に進み、シングルトンの画面遷移にクラスタIDを付与する。すなわち、ステップ902からステップ905までの処理でどの画面遷移とも同一クラスタにならなかった画面遷移に対してクラスタIDを付与する。図６の例では(3)はシングルトンであるため、この処理によってC3というクラスタIDが付与された。 Thereafter, step 904 is false, i.e. Smax repeats a loop until less than S _T. If false, the process proceeds to step 906, and a cluster ID is assigned to the screen transition of the singleton. That is, a cluster ID is assigned to a screen transition that has not become the same cluster as any screen transition in the processing from step 902 to step 905. In the example of FIG. 6, since (3) is a singleton, a cluster ID of C3 is given by this processing.

ステップ907では、このようにして得られたクラスタの情報、すなわち、各画面遷移のクラスタIDを画面遷移データに書き込み、記憶装置130に格納する。図４はこの例における最終的に得られた画面遷移データを示している。 In step 907, the cluster information thus obtained, that is, the cluster ID of each screen transition is written in the screen transition data and stored in the storage device 130. FIG. 4 shows the screen transition data finally obtained in this example.

＜ユーザパターン集計処理＞
図１０は、ユーザパターン集計プログラム124が実行するユーザパターン集計処理を説明するためのフローチャートである。ここでの動作主体はユーザパターン集計プログラムである。ユーザパターン集計処理では、図３のようなセッションデータから、図４のような画面遷移データを用いて、ユーザ毎のクラスタIDの比率を求め、図５のようなユーザパターン集計データを生成する処理を行う。 <User pattern aggregation processing>
FIG. 10 is a flowchart for explaining the user pattern totaling process executed by the user pattern totaling program 124. The operating subject here is a user pattern totaling program. In the user pattern totaling process, the cluster ID ratio for each user is obtained from the session data as shown in FIG. 3 using the screen transition data as shown in FIG. 4, and the user pattern totaling data as shown in FIG. 5 is generated. I do.

ステップ1001では、セッションデータと画面遷移データを読み込む。 In step 1001, session data and screen transition data are read.

ステップ1002では、セッションデータをユーザ単位に集計する。すなわち、ユーザIDをキーとしてセッションデータから同一ユーザのレコードを抽出しユーザ単位に集計する処理を行う。 In step 1002, session data is totaled for each user. That is, a process of extracting records of the same user from the session data using the user ID as a key and totaling them for each user.

ステップ1003では、ユーザ単位に集計したセッション群の中から未処理のユーザのセッション群を選択する。 In step 1003, an unprocessed user session group is selected from among the session groups tabulated for each user.

ステップ1004では、ステップ1003で選択したセッション群における各セッションの画面遷移に対応するクラスタIDを、画面遷移データを参照することで求め、画面遷移をクラスタIDで置き換える。 In step 1004, the cluster ID corresponding to the screen transition of each session in the session group selected in step 1003 is obtained by referring to the screen transition data, and the screen transition is replaced with the cluster ID.

ステップ1005では、当該セッション群におけるセッション数に対する、各クラスタIDの比率を求め、クラスタID及びその比率をユーザパターンデータに格納する。例えば、当該セッション群におけるセッション数が100でクラスタC1の数12であれば比率は0.12となる。 In step 1005, the ratio of each cluster ID to the number of sessions in the session group is obtained, and the cluster ID and the ratio are stored in the user pattern data. For example, if the number of sessions in the session group is 100 and the number of clusters C1 is 12, the ratio is 0.12.

ステップ1006では、全ユーザのセッションデータを処理したか否かを判定する。判定結果が偽であればステップ1003に戻り、真であれば処理を終了する。 In Step 1006, it is determined whether or not all user session data has been processed. If the determination result is false, the process returns to step 1003, and if true, the process ends.

＜まとめ＞
本発明の実施形態によるデータ分析装置では、Webサイトのアクセスログがあらかじめ取得される。このアクセスログには、ユーザが対象Webサイトにログインした時から対象Webサイト外に離脱するまでの情報が記録される。このアクセスログをもとに、セッション単位のデータ（セッションデータ）に変換する。これには同一セッションにおいてユーザがアクセスした画面の遷移情報が記録される。次に、データ分析装置は、このセッションデータをもとにして、画面遷移の種類を網羅的に求める。さらに、データ分析装置は、セッションデータに基づいて、画面遷移間で類似度を求め、類似度をもとにして画面遷移のクラスタリングを行う。この時、類似度を求める際にLCSだけでなく、文字列間距離演算（一例としてDPマッチング）も使用することで、ユーザのマクロ的な画面遷移傾向をより精密なマッチングが可能となる。そして、データ分析装置は、セッションデータをユーザ単位に分類後、画面遷移データを参照してユーザの各画面遷移をクラスタIDに変換する。そしてユーザのセッション数に対する各クラスタIDの比率を求め、ユーザパターンデータを生成する。得られたユーザパターンは、ユーザがアクセスした画面遷移がクラスタIDに置き換えられた後の比率を保持している。このようにして得られたクラスタリング結果は、細かな画面遷移の違いは吸収され、よりマクロな視点での画面遷移パターンを表している。従って、Webサイトにおいて、ユーザがどのような画面遷移を行ったかをより容易かつ正確に把握することが可能となる。 <Summary>
In the data analysis apparatus according to the embodiment of the present invention, an access log of a website is acquired in advance. In this access log, information from when the user logs in to the target website until it leaves the target website is recorded. Based on the access log, the data is converted into session unit data (session data). In this, the transition information of the screen accessed by the user in the same session is recorded. Next, the data analysis apparatus comprehensively obtains the types of screen transitions based on the session data. Furthermore, the data analysis apparatus obtains a similarity between screen transitions based on the session data, and performs screen transition clustering based on the similarity. At this time, when calculating the similarity, not only LCS but also calculation of distance between character strings (DP matching as an example) enables more precise matching of the user's macro screen transition tendency. The data analysis apparatus classifies the session data in units of users and then converts each screen transition of the user into a cluster ID with reference to the screen transition data. Then, the ratio of each cluster ID to the number of user sessions is obtained, and user pattern data is generated. The obtained user pattern holds the ratio after the screen transition accessed by the user is replaced with the cluster ID. The clustering result obtained in this way represents a screen transition pattern from a more macro viewpoint, with the difference between fine screen transitions absorbed. Therefore, it is possible to more easily and accurately grasp what screen transition the user has performed on the website.

なお、本発明は、実施形態そのままに限定されるものではなく、実施段階では、その要旨を逸脱しない範囲で構成要素を変形して具体化できる。また、実施形態に開示されている複数の構成要素の適宜な組み合わせにより、種々の発明を形成できる。例えば、実施形態に示される全構成要素から幾つかの構成要素を削除してもよい。さらに、異なる実施形態にわたる構成要素を適宜組み合わせてもよい。 Note that the present invention is not limited to the embodiments as they are, and can be embodied by modifying the constituent elements without departing from the scope of the invention in the implementation stage. Various inventions can be formed by appropriately combining a plurality of constituent elements disclosed in the embodiments. For example, some components may be deleted from all the components shown in the embodiment. Furthermore, constituent elements over different embodiments may be appropriately combined.

また、実施形態で示された各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現しても良い。また、上記各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現しても良い。各機能等を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、ＳＳＤ（Solid State Drive）等の記録或いは記憶装置、またはＩＣカード、ＳＤカード、ＤＶＤ等の記録或いは記憶媒体に格納することができる。 In addition, each configuration, function, processing unit, processing unit, and the like described in the embodiments may be realized in hardware by designing a part or all of them with, for example, an integrated circuit. Further, each of the above-described configurations, functions, etc. may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files for realizing each function is stored in a recording or storage device such as a memory, hard disk, or SSD (Solid State Drive), or in a recording or storage medium such as an IC card, SD card, or DVD. be able to.

さらに、上述の実施形態において、制御線や情報線は説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていても良い。 Furthermore, in the above-described embodiment, control lines and information lines are those that are considered necessary for explanation, and not all control lines and information lines on the product are necessarily shown. All the components may be connected to each other.

100 中央処理装置（プロセッサ）
110 入出力装置
111 表示装置
112 キーボード
113 マウス
120 プログラムメモリ
121 セッションデータ生成プログラム
122 画面遷移集計プログラム
123 クラスタリングプログラム
124 ユーザパターン集計プログラム
130 記憶装置
131 アクセスログデータ
132 セッションデータ
133 画面遷移データ
134 ユーザパターンデータ 100 Central processing unit (processor)
110 I / O devices
111 Display device
112 keyboard
113 mouse
120 program memory
121 Session data generator
122 Screen transition summary program
123 Clustering program
124 User pattern aggregation program
130 storage devices
131 Access log data
132 Session data
133 Screen transition data
134 User pattern data

Claims

A data analysis device for analyzing access logs,
A processor that executes a program for analyzing the access log;
A storage device for storing management information for enabling analysis of the access log,
The processor is
From the access log, a process of converting data representing screen transition for each session as screen transition data;
A process of calculating a similarity between screen transition patterns in the screen transition data using a distance calculation between character strings, and clustering the screen transition patterns based on the similarity,
Processing for storing the clustering information in the storage device;
Data analysis device that executes.

In claim 1,
The data analysis apparatus further executes a process of applying a result of the clustering process to the screen transition in a session group for each user and totaling a ratio of screen transitions after clustering.

In claim 1,
The data analysis apparatus, wherein the processor clusters the screen transition patterns for each user in the clustering process.

In claim 1,
The data analysis apparatus, wherein in the clustering process, the processor classifies the pair into one cluster when the similarity degree of the screen transition pattern pair indicates a predetermined value or more.

In claim 4,
In the clustering process, the processor includes a pattern included in the screen transition pattern classified into the cluster when there is no similarity between the screen transition pattern pair and the similarity is equal to or greater than a predetermined value. A data analysis apparatus that classifies each of the screen transition patterns other than as a single cluster.

In claim 1,
The data analysis apparatus, wherein the screen transition pattern includes screen transition in the same session.

In claim 1,
The data analysis apparatus, wherein the access log is an access log for a website that requires user authentication.

A data analysis method for analyzing access logs,
A processor that executes a program for analyzing an access log converts data representing a screen transition for each session as screen transition data from the access log;
The processor calculates a similarity between screen transition patterns in the screen transition data using a distance calculation between character strings, and clusters the screen transition patterns based on the similarity;
The processor storing the clustering information in a storage device for storing management information for enabling the access log to be analyzed;
Data analysis method including.

The claim 8, further comprising:
A data analysis method including a step in which the processor applies a result of the clustering process to the screen transition in a session group for each user, and totals a ratio of screen transitions after clustering.

In claim 8,
The data analysis method, wherein in the clustering step, the processor clusters the screen transition patterns for each user.

In claim 8,
The data analysis method, wherein in the clustering step, the processor classifies the pair into one cluster when the similarity degree of the screen transition pattern pair indicates a predetermined value or more.

In claim 11,
In the clustering step, the processor includes a pattern included in the screen transition pattern classified into the cluster when there are no more similarities in the screen transition pattern pair indicating the predetermined value. A data analysis method for classifying each of the screen transition patterns other than as a single cluster.

In claim 8,
The data analysis method, wherein the screen transition pattern includes screen transition in the same session.

In claim 8,
The data analysis method, wherein the access log is an access log for a website that requires user authentication.