TW201839633A

TW201839633A - Method of analyzing a URL to generate a user profile having a profile analysis by using chi-square test and variance analysis techniques to find out significant attributes

Info

Publication number: TW201839633A
Application number: TW106112719A
Authority: TW
Inventors: 廖界信; 簡志丞
Original assignee: 中華電信股份有限公司
Priority date: 2017-04-17
Filing date: 2017-04-17
Publication date: 2018-11-01
Also published as: TWI626549B

Abstract

The invention relates to a method for analyzing a URL to generate a user profile, which mainly selects a plurality of users in a URL browsing log according to an application category, and uses the usage information according to the number, times, or period of browsing of the users to obtain a weight value corresponding to the application category, and then a score of the user's record is calculated based on the weight value, and sorting the same to find the user group with a high usage amount, and performing a profile analysis by using chi-square test and variance analysis of some attributes related to the user group to find out significant attributes among them and depict the user profile of the application category.

Description

Method of analyzing a URL to generate a user profile

本發明有關於一種產生用戶輪廓之方法，由為一種選擇應用類別以分析連結或瀏覽某種特定URL(Uniform Resource Locator)位址用戶之輪廓的方法。 The present invention relates to a method of generating a user profile by selecting an application category to analyze a method of linking or browsing a profile of a particular URL (Uniform Resource Locator) address.

隨著行動設備和寬頻網路的普及化，人們對網路便利性所產生的依賴亦漸漸加重，而在人們透過網路取得月來多元化的各種服務的同時，也同時產生了大量的使用紀錄。從服務提供商的角度觀之，為了因應各種潛在的企業競爭和客戶喜好的變化所造成的業務挑戰，需要有更佳的方法使公司能夠提供更符合客戶需求的產品及服務，亦即為須將行銷方法已從過去的大量行銷及產品差異化行銷轉向為針對目標的行銷，為了達成此種針對性的行銷，需要蒐集相當程度針對某種客戶的資料，而在這個網路時代中，客戶使用網路所留下的紀錄無疑成為了一個有相當高價值的分析標的。 With the popularization of mobile devices and broadband networks, people's reliance on network convenience has gradually increased. While people have access to a variety of services that are diversified over the Internet, they also generate a large amount of use. Record. From a service provider's point of view, in order to meet the business challenges caused by various potential corporate competitions and changes in customer preferences, there is a need for better ways for companies to provide products and services that better meet customer needs. The marketing method has been shifted from the mass marketing and product differentiation marketing in the past to targeted marketing. In order to achieve such targeted marketing, it is necessary to collect a considerable amount of information for a certain customer. In this network era, customers The record left by using the Internet has undoubtedly become an analytical target with considerable value.

先前技術中已經存在著若干方法有效運用這些資料，例如有關鍵字導向方法，其可以準確的追蹤特定客戶裝置上的行為並分析客戶所瀏覽網路資源內的內容，以擷取客戶在多個網站瀏覽的內容關鍵字、來源與建立時間等等資訊，進一步建立欲行銷之活動與網路關鍵字的配對關係。 There are several methods in the prior art that effectively use these materials, such as a keyword-oriented method, which can accurately track the behavior on a particular client device and analyze the content of the network resources browsed by the client to capture multiple clients. Information such as content keywords, source and creation time of the website browsing, further establishes the matching relationship between the activities to be marketed and the network keywords.

先前技術中亦有其他類似方法，係可透過分析使用者特性，例如瀏覽網站次數、花費的時間和網頁內容等，透過排列演算法將使用者特性之組合加權平均，以決定根據何種使用者特性的排列組合來回應客戶不同的結果。 There are other similar methods in the prior art, which can determine the user according to the weighted average of the user characteristics through the ranking algorithm by analyzing the user characteristics, such as the number of times of browsing the website, the time spent, and the content of the webpage. The combination of features to respond to different outcomes of customers.

然而，目前的技術中仍缺少可以針對某特定類別或屬性自資料中挖掘能代表客戶的顯著屬性以對客戶做較全面的側寫之技術。 However, there is still a lack of technology in the current technology that can mine a significant attribute of a customer for a particular category or attribute to perform a more comprehensive profile on the customer.

為了補足先前技術之不足，本發明提供了一種結合了巨量資料與統計分析的用戶輪廓描述方法，其主要目的係為提供一個讓使用者可以透過定義好的URLs(Uniform Resource Locators)，分析以建立特定領域用戶的輪廓(profiling)以及描述該輪廓的特徵之流程方法。 In order to make up for the deficiencies of the prior art, the present invention provides a user profile description method combining a huge amount of data and statistical analysis, the main purpose of which is to provide a user with the ability to analyze through defined URLs (Uniform Resource Locators). Profiling the user of a particular domain and the process of describing the characteristics of the profile.

本發明次要目的，係用以識別特定用戶領域中，用量較高的用戶與其餘用戶是否於統計上具有顯著差異之屬性，此種分析結果可用於建構出用戶族群的顯著屬性，例如用戶基本屬性、使用狀況、興趣喜好、服務申請等等顯著屬性資訊描繪的具體輪廓樣貌，而這種具體輪廓樣貌可進一步作為商業行銷規劃之依據，亦可協助業務單位瞭解目標用戶之特性，使得其於接觸客戶時能更有效地傳達合適的行銷方案，並規劃更適宜的活動內容。 The secondary object of the present invention is to identify whether a user with a higher usage and a remaining user in a specific user domain have statistically significant differences, and such analysis results can be used to construct significant attributes of the user group, such as user basics. Attributes, usage status, interest preferences, service requests, etc. The specific contours of the significant attribute information can be further used as a basis for commercial marketing planning, and can also assist business units to understand the characteristics of the target users. It communicates with customers more effectively to communicate appropriate marketing plans and plan for more appropriate activities.

基於上述目的，本發明提出了一種分析URL以產生用戶輪廓之方法，其可透過能夠取得用戶的原始瀏覽日誌的一輪廓分析伺服器，來執行至少包含以下步驟：選定一應用類別，於包含複數用戶的一原始瀏覽日誌中根據該應用類別選取出一用戶組。其中，該輪廓分析伺服器可與網路連線以取得該原始瀏覽日誌，其中，該應用類別係為一種預定義之內容同質或近似的URL(Uniform Resource Locator)位址類別，該應用類別內包含有內容同質或近似的至少一代表性URL位址，其中，若該等用戶於該原始瀏覽日誌中被紀錄有曾瀏覽該應用類別之URL位址，該等用戶將被選入該用戶組中。另外，該原始瀏覽日誌可為儲存於巨量資料運算技術如Hadoop Hive分散式資料庫中的結構化資料，而該原始瀏覽日誌係依一定格式以分散式架構儲存，例如選擇依據(「用戶識別碼」，「瀏覽日期時間」，「瀏覽網址」)此等格式來儲存，如此可以大幅降低所需分析的資料量，以使本方法可以透過具備統計運算能力的單機進行處理。 Based on the above object, the present invention provides a method for analyzing a URL to generate a user profile, which is executable by a profile analysis server capable of obtaining a user's original browsing log, and includes at least the following steps: selecting an application category, including a plurality of A user group is selected in an original browsing log of the user according to the application category. The profile analysis server may be connected to the network to obtain the original browsing log, wherein the application category is a predefined content homogeneous or approximate URL (Uniform Resource Locator) address category, and the application category includes At least one representative URL address having the same or similar content, wherein if the users are recorded in the original browsing log with a URL address that has browsed the application category, the users will be selected into the user group. . In addition, the original browsing log may be structured data stored in a huge data computing technology such as a Hadoop Hive distributed database, and the original browsing log is stored in a decentralized structure according to a certain format, for example, selection basis ("user identification" Codes, "Browse Dates", "Browse URLs") are stored in these formats. This greatly reduces the amount of data that needs to be analyzed, so that the method can be processed by a single machine with statistical computing capabilities.

分析該用戶組中的該等用戶各自連結至或瀏覽該應用類別中的URL位址之使用資訊以產生一用戶記錄。其中，使用資訊包含該等用戶對該應用類別中的URL位址的持續瀏覽日期(Start Date & End Date or Date Duration)、瀏覽時間(Days Count)、瀏覽時間(Hours Count)、瀏覽次數(Access Count)、瀏覽周期(Access Frequency)、各日期類別瀏覽次數等，其中，日期類別係指日期所可能代表之意義，例如一個日期可能屬於假日、工作日或連假等等類別。 The usage information of the URL addresses in the application category of the user groups in the user group is analyzed to generate a user record. The usage information includes the Start Date & End Date or Date Duration, the Days Count, the Hours Count, and the number of views of the URLs of the user categories in the application category. Count), the access frequency (Access Frequency), the number of times of each date category, etc., the date category refers to the meaning that the date may represent, for example, a date may belong to the holiday, workday or even false.

以影視應用類別中已知的正例或負例之URL位址對該應用類別進行變量分析之機器訓練，以獲取對應於該應用類別的一權重值。其中，變量分析的訓練係以上述用以產生該用戶記錄的使用資訊(如持續瀏覽日期、瀏覽周期、各日期類別瀏覽次數等)作為輸入變量，透過神經元網路(Neural Network)或邏輯分析(Logistic Regression)模型等對輸入變量進行分析訓練以產生該權重值。 The machine training of the variable analysis of the application category is performed by using a URL of a positive or negative example known in the movie application category to obtain a weight value corresponding to the application category. The training of the variable analysis is performed by using the above-mentioned usage information (such as the continuous browsing date, the browsing period, the number of browsing times of each date category, etc.) for generating the user record, through the neural network or logic analysis. (Logistic Regression) model or the like analyzes the input variables to generate the weight values.

根據該權重值對該用戶記錄進行分數計算，並對計算後之分數排序，進而根據排序之分數自該用戶組中獲取用量高的至少一用戶，其中，該至少一用戶係選擇自該用戶記錄所得分數由高分至低分遞減排序後分數較高的若干百分比率(Top N)之用戶。 Performing a score calculation on the user record according to the weight value, and sorting the calculated scores, and then acquiring at least one user with a high usage amount from the user group according to the sorted score, wherein the at least one user is selected from the user record The scores obtained from the high score to the low score are the users of the percentage ratio (Top N) with higher scores after sorting.

對該至少一用戶進行輪廓分析以產生對應該應用類別的一用戶輪廓，其中，輪廓分析係根據預先賦予該應用類別的屬性經統計分析以產生該至少一用戶的至少一顯著屬性，更判斷該至少一顯著屬性相對於該至少一用戶的相關性，以根據該至少一顯著屬性描述該用戶輪廓。其中，預先賦予該應用類別的屬性可包括一般用戶屬性，例如用戶的性別、年齡、教育程度...等，此外，使用者亦可自行定義玉探詢的屬性，也就是用以描述高用量用戶有哪些值得被注意的代表性行為等的屬性，例如，若使用本發明的目的係為找出潛在的高用量影視用戶，可以依據此目的定義用戶是否為付費用戶、申辦速率別、近期帳單平均金額、使用數據流量、傳輸時段、漫遊基地台數..等等之屬性。 Performing profile analysis on the at least one user to generate a user profile corresponding to the application category, wherein the profile analysis is statistically analyzed according to an attribute previously assigned to the application category to generate at least one significant attribute of the at least one user, and further determining the The at least one significant attribute is related to the at least one user to describe the user profile based on the at least one significant attribute. The attributes that are pre-assigned to the application category may include general user attributes, such as the user's gender, age, education level, etc. In addition, the user may also define the attributes of the jade inquiry, that is, describe the high-volume users. There are attributes such as representative behaviors worthy of attention, for example, if the purpose of the present invention is to find potential high-volume video users, it is possible to define whether the user is a paying user, a bid rate, and a recent bill average. The amount of money, usage data traffic, transmission time, number of roaming base stations, etc.

其中，前述輪廓分析所採用的統計分析方法可包含一卡方檢定(Chi-square Test)流程，該卡方檢定流程係用以進行交叉分析中的類別變數分析，以將χ²(Chi-square)值顯著較大者輸出為該至少一顯著屬性。其中，輪廓分析所採用的統計分析方法可包含一變異數分析(Analysis of variance,ANOVA)流程，該變異數分析流程係用以進行交叉分析中的連續變數分析，以將F檢定(F-test)值顯著較大者輸出為該至少一顯著屬性。 Wherein, the statistical analysis method used in the foregoing contour analysis may include a Chi-square Test process for performing category variable analysis in the cross-analysis to select χ ² (Chi-square) The value that is significantly larger is output as the at least one significant attribute. The statistical analysis method used in the contour analysis may include an analysis of variance (ANOVA) process for performing continuous variable analysis in the cross-analysis to determine the F-test (F-test). The value that is significantly larger is output as the at least one significant attribute.

其中，前述輪廓分析可透過邏輯迴歸(Logistic Regression)模型判斷該至少一顯著屬性與該至少一用戶的相關性係屬於正相關或負相關，以找出該顯著屬性對於高用量用戶變數之影響關係，例如：若找出申請服務的用戶年齡和高用量用戶族群之間的正或負相關關係，即可描述高用量用戶係屬於壯年或是年輕族群的輪廓。此外，進行正負相關判斷的模型可以為任一種現行的或可用的模型，並不限於邏輯迴歸模型。 The foregoing contour analysis may determine, by using a Logistic Regression model, that the at least one significant attribute is positively or negatively correlated with the at least one user, to find out the influence of the significant attribute on the high-volume user variable. For example, if you find the positive or negative correlation between the age of the user applying for the service and the high-volume user community, you can describe the profile of the high-volume user family that belongs to the prime or younger group. In addition, the model for performing positive and negative correlation judgments may be any current or available model, and is not limited to a logistic regression model.

另外，本發明亦可透過選取其他的應用類別再次執行上述步驟以產生針對其他應用類別的用戶輪廓。 In addition, the present invention can also perform the above steps again by selecting other application categories to generate user profiles for other application categories.

通過本發明之方法，可讓使用者透過該輪廓分析伺服器選擇預定義的URLs(Uniform Resource Locators)之應用類別，進而分析原始瀏覽日誌以建立特定用戶的輪廓，更可以分辨用量較高的用戶與其他用戶於特定屬性上是否有統計的顯著差異。 Through the method of the present invention, the user can select the application categories of the predefined URLs (Uniform Resource Locators) through the profile analysis server, and then analyze the original browsing log to establish a specific user profile, and can distinguish the users with higher usage. Whether there are statistically significant differences with other users on specific attributes.

S100~S300‧‧‧步驟流程 S100~S300‧‧‧Step process

110‧‧‧原始瀏覽日誌的數據庫 110‧‧‧Database of the original browsing log

120‧‧‧影視應用類別的用戶記錄 120‧‧‧User records for the video application category

210‧‧‧正例及負例資料 210‧‧‧ Regular and negative data

S220~S230‧‧‧步驟流程 S220~S230‧‧‧Step procedure

240‧‧‧高用量用戶名單 240‧‧‧High-volume user list

S310~S340‧‧‧步驟流程 S310~S340‧‧‧Step procedure

350‧‧‧顯著屬性 350‧‧‧ significant attributes

360‧‧‧用戶輪廓 360‧‧‧User profile

圖1為本發明的主要步驟流程示意圖；圖2為本發明瀏覽日誌網址分析的細部步驟示意圖；圖3為本發明分數計算排序的細部步驟示意圖；圖4為本發明的第一範例示意圖；圖5本發明為輪廓分析的細部步驟示意圖；圖6為本發明的第二範例示意圖；圖7為本發明的第三範例示意圖；圖8為本發明的第四範例示意圖；以及圖9為本發明的第五範例示意圖。 1 is a schematic diagram of the main steps of the present invention; FIG. 2 is a schematic diagram showing the detailed steps of the browsing log URL analysis according to the present invention; FIG. 3 is a schematic diagram showing the detailed steps of the score calculation sorting according to the present invention; 5 is a schematic diagram of a detailed step of contour analysis; FIG. 6 is a schematic diagram of a second example of the present invention; FIG. 7 is a schematic diagram of a third example of the present invention; FIG. 8 is a schematic diagram of a fourth example of the present invention; A schematic diagram of the fifth example.

以下將以實施例結合圖式對本發明進行進一步說明。 The invention will be further illustrated by the following examples in conjunction with the drawings.

首先，請參照圖1，係為本發明的主要步驟流程示意圖，其包含三個步驟，依序為S100瀏覽日誌網址分析、S200分數計算排序以及S300描述用戶輪廓，其中，如同發明內容中所述，本發明係先透過步驟S100自原始瀏覽日誌中選取出符合特定應用類別的用戶組以及該用戶組的使用資訊，例如持續瀏覽日期、瀏覽時間、瀏覽周期等，再透過步驟S200依據該些使用資訊的權重值對該用戶組內的用戶計算分數並排序，以找出高用量的用戶，進而透過步驟S300進行輪廓分析以描述符合該特定應用類別的用戶輪廓，而這種用戶輪廓可被用於進行針對特定應用類別用戶的行銷或活動規劃。 First, please refer to FIG. 1 , which is a schematic flowchart of the main steps of the present invention, which includes three steps, which are S100 browsing log URL analysis, S200 score calculation sorting, and S300 description user profile, wherein, as described in the Summary of the Invention The present invention first selects a user group that matches a specific application category and usage information of the user group from the original browsing log, for example, a continuous browsing date, a browsing time, a browsing period, and the like, and then performs the use according to the step S200. The weight value of the information is calculated and sorted by the users in the user group to find a high-volume user, and then the contour analysis is performed through step S300 to describe the contour of the user conforming to the specific application category, and the user contour can be used. For marketing or event planning for users of a specific application category.

在本實施例中，作為被分析資料的原始瀏覽日誌中包含複數用戶網路的連線記錄資料，原始瀏覽日誌的時間區間為西元2015年的一月至二月，周數共八周。 In this embodiment, the original browsing log as the analyzed data includes the connection record data of the plurality of user networks, and the time interval of the original browsing log is from January to February in the week of 2015, and the number of weeks is eight weeks.

而本發明方法的輪廓分析伺服器可藉由Hadoop Hive於原始瀏覽日誌中根據預定義的一影視應用類別以過濾出一用戶組，於本實施例中，該影視應用類別係包含使用者預定義的一種關於影視內容的URL網址，本發明方法即用以比對原始瀏覽日誌中符合該第一應用類別的連線記錄，在本實施例中，每日收集的原始資料量約為500GB。 The profile analysis server of the method of the present invention can filter out a user group according to a predefined video application category in the original browsing log by Hadoop Hive. In this embodiment, the video application category includes user predefined The method of the present invention is to compare the connection records in the original browsing log that match the first application category. In this embodiment, the amount of original data collected daily is about 500 GB.

請參閱圖2，其係為步驟S100瀏覽日誌網址分析的細部步驟示意圖，基於上述的實施例，根據使用者的選擇，提取已預定義的該影視應用類別之影視應用的代表性URL網址，例如土豆網的api.3g.tudou.com、千尋的kankan.1kxun.com...等URL位址，本發明的伺服器自原始瀏覽日誌的數據庫110中篩選出曾瀏覽該等網址的用戶，並將該等用戶的持續瀏覽日期(Start Date & End Date or Date Duration)、瀏覽時間(Days Count)、瀏覽時間(Hours Count)、瀏覽次數(Access Count)、瀏覽周期(Access Frequency)及各日期類別瀏覽次數等使用資訊作為輸入變量進行分析，以產生符合該影視應用類別的用戶記錄120。 Please refer to FIG. 2 , which is a schematic diagram of the detailed steps of browsing the log URL analysis in step S100. According to the above embodiment, according to the user's selection, the representative URL of the video application of the predefined video application category is extracted, for example, for example. The URL of the api.3g.tudou.com, kankan.1kxun.com, etc. of Tudou. The server of the present invention filters out the users who have browsed the URLs from the database 110 of the original browsing log, and Start Date & End Date or Date Duration, Days Count, Hours Count, Access Count, Access Frequency, and date categories for these users The usage information, such as the number of views, is analyzed as an input variable to generate a user record 120 that conforms to the category of the video application.

再請參考圖3，其係為步驟S200分數計算排序的細部步驟示意圖，值得注意的是，在進行此步驟之前，本發明之伺服器已預先針對複數種應用類別(包含該影視應用類別)執行步驟S220權重分析以對該等應用類別各自產生一權重值(weight)，其中，針對該影視應用類別的權重分析，係為將影視類別所對應的正例與負例訓練資料210，輸入神經元網路(Neural Network)模型以進行單層模型訓練，產出的模型中將包含前述各輸入變量的權重，而此權重即為該影視應用類別的權重值。 Please refer to FIG. 3 again, which is a schematic diagram of the detailed steps of the step calculation of the step S200. It is worth noting that before performing this step, the server of the present invention has been executed in advance for a plurality of application categories (including the video application category). Step S220, the weight analysis is to generate a weight for each of the application categories, wherein the weight analysis for the movie application category is to input the positive example and the negative training data 210 corresponding to the movie category into the neuron. The Neural Network model is trained in a single layer model. The output model will contain the weights of the aforementioned input variables, and the weight is the weight value of the film application category.

接著，本發明之伺服器將該影視應用類別的用戶記錄120結合該影視應用類別之輸入變量所對應的權重值，執行步驟S230用戶分數計算。經計算後，將獲得曾瀏覽該等網址的該等用戶在該影視應用類別的得分，再依據分數由高至低進行遞減排序，更依據本發明的使用者決定的N值選取出使用量位於前百分之N的高用量用戶名單240。請再參閱圖4的第一範例示意圖，在本實施例中，用戶係透過行動網路或是寬頻網路閱覽影視內容，因此，使用者可分別針對行動網路用戶與寬頻網路用戶分別取其總得分排名前20%(13分以上)及前15%(32分以上)的用戶作為正例樣本。此外，亦可參考使用量分布的平均數與中位數等統計量，對於行動網路用戶而言，其平均數為9.0分，中位數為3分，而寬頻用戶的平均數為14.4分，中位數為10分。 Next, the server of the present invention combines the user record 120 of the movie application category with the weight value corresponding to the input variable of the video application category, and performs step S230 user score calculation. After calculation, the scores of the users who have browsed the web sites in the video application category will be obtained, and then the scores will be sorted according to the scores from high to low, and the user-determined N values according to the invention are selected to be used. The top N percent of high-volume users list 240. Please refer to the first example diagram of FIG. 4 . In this embodiment, the user browses the video content through the mobile network or the broadband network. Therefore, the user can separately take the mobile network user and the broadband network user separately. The top 20% (13 points or more) and the top 15% (32 points or more) of the total score are used as positive samples. In addition, you can also refer to statistics such as the mean and median of the usage distribution. For mobile network users, the average is 9.0 points, the median is 3 points, and the average number of broadband users is 14.4 points. The median is 10 points.

請參閱圖5所示，其係為步驟S300輪廓分析的細部步驟示意圖，該輪廓分析步驟係將高用量用戶名單240作為輸入，並進行步驟S310設定欲針對該影視應用類別分析之複數屬性，該等屬性包含一般用戶屬性，例如用戶的性別、年齡、教育程度...等，另外，使用者更可依據需求設定屬性，例如：若使用者之目的係為找出潛在的高用量影視用戶，可設定例如用戶是否為付費用戶、申辦速率別、近期帳單平均金額、下載上傳數據流量、主要傳輸時段、漫遊使用的基地台數等屬性。 Please refer to FIG. 5 , which is a detailed step diagram of the step analysis of the step S300. The profile analysis step takes the high-volume user list 240 as an input, and performs step S310 to set a complex attribute to be analyzed for the movie application category. The attributes include general user attributes, such as the user's gender, age, education level, etc. In addition, the user can set attributes according to requirements, for example, if the user's purpose is to find potential high-volume video users, It can be set, for example, whether the user is a paying user, a bid rate, a recent bill average amount, a download upload data traffic, a main transmission period, and a number of base stations used for roaming.

高用量用戶名單240經步驟S310設定欲分析之複數屬性後，本發明之伺服器將執行交叉分析，其一為步驟S320的卡方檢定，用以分析類別屬性變數，經卡方檢定後χ²值明顯較大者，即可選為用戶的一種顯著屬性350；例如，於本實施例中，請參考圖6的第二範例圖所示，經常觀看熱門影視的高用量用戶中以女性居多，數量為男性的1.24倍，而在其他的用戶中，女性之於男性的比例則僅為0.87倍。同樣地，在步驟S320卡方檢定中，第二個被選為顯著屬性350的屬性係有關於用戶的地域分布，根據影視應用程式高用量用戶居住縣市之分佈進行觀察，居住於北部都會區與中南部直轄市的行動用戶中，有較高的比例係為影視應用程式的付費高用量用戶，其中在臺北、新北、桃園、新竹、台中、台南的付費用戶比例皆高於平均數。 After the high-volume user list 240 sets the complex attribute to be analyzed via step S310, the server of the present invention performs cross-analysis, one of which is the chi-square verification of step S320, which is used to analyze the category attribute variable, after the chi-square verification, ² The value is significantly larger, that is, it can be selected as a significant attribute 350 of the user; for example, in the embodiment, please refer to the second example of FIG. 6 , and the majority of the high-volume users who frequently watch popular movies and TVs are mostly women. The number is 1.24 times that of men, while among other users, the proportion of women to men is only 0.87 times. Similarly, in the chi-square verification in step S320, the second attribute selected as the significant attribute 350 has a geographical distribution about the user, and is observed according to the distribution of the high-volume user living in the county and the city, and resides in the northern metropolitan area. Among the mobile users in the Central and Southern Territory, a higher proportion is paid for high-volume users of video applications. The proportion of paying users in Taipei, New Taipei, Taoyuan, Hsinchu, Taichung, and Tainan is higher than the average.

另外，本發明之伺服器亦將執行步驟S330的變異數分析(ANOVA)以分析連續屬性變數，選取出F檢定值明顯較大的用戶屬性，作為顯著屬性350。例如：於本實施例中，使用者欲找出用戶的年齡屬性對於用戶是否是行動高用量之用戶有顯著之影響，在步驟S340分析顯著屬性與用戶的正負相關關係中，本發明透過邏輯迴歸演算法觀察到其迴歸係數為-0.006774，係為一負值，其表示用戶年齡此屬性與高用量用戶族群的分布呈現負關係，進一步觀察用戶的年齡分佈，如圖7的第三範例圖所示，可發現影視應用類別的高用量用戶平均年齡(37.9歲)比其他用量的用戶(45.1歲)年輕，並且，高用量用戶於18至24歲之區間內呈現最高比例，此為產生的第三種顯著屬性350。 In addition, the server of the present invention will also perform the variance analysis (ANOVA) of step S330 to analyze the continuous attribute variables, and select the user attribute with a significantly larger F-verification value as the salient attribute 350. For example, in this embodiment, the user wants to find out whether the age attribute of the user has a significant influence on whether the user is a high-activity user, and analyzes the positive and negative correlation between the significant attribute and the user in step S340. The algorithm observes that the regression coefficient is -0.006774, which is a negative value, which indicates that the user age has a negative relationship with the distribution of high-volume user groups, and further observes the user's age distribution, as shown in the third example of Figure 7. It can be found that the average age of high-volume users (37.9 years old) in the video application category is younger than other users (45.1 years old), and the high-volume users show the highest ratio in the 18 to 24 years old range. Three significant attributes are 350.

同樣地，於步驟S330的變異數分析更可分析出第四種顯著屬性350，其係關於用戶的移動屬性，根據圖8的第四範例圖所示，由用戶的移動距離與發話型態等統計數據進行觀察，高用量用戶多為喜歡四處移動，交友圈、生活圈較大的用戶。在平日時段，高用量用戶的移動距離比其他用量的用戶多出151公里，在週末時段，高用量用戶的移動距離比其他用戶多出55公里。由該現象可推測，高用量用戶可能為通勤族居多，其可能於搭乘火車、地鐵或公車時觀看影視打發時間。另一方面，亦可觀察到，高用量用戶在發話縣市數、發話使用基地台數、總發話分鐘數、總發話對象數、網外發話對象都佔較高之比率，高用量用戶明顯為社交活動較多的族群。 Similarly, the variation analysis in step S330 can further analyze the fourth significant attribute 350, which is related to the user's mobile attribute, according to the fourth example diagram of FIG. 8, the user's moving distance and the utterance type, etc. Statistical data is observed, and users with high usage are mostly users who like to move around, make friends and have a large circle of life. During weekdays, the mobile distance of high-volume users is 151 km more than that of other users. During weekends, the mobile distance of high-volume users is 55 km more than other users. It can be inferred from this phenomenon that high-volume users may be mostly commuters, and they may watch the film and television time when they are on a train, subway or bus. On the other hand, it can be observed that the number of users in high-volume users, the number of base stations used for sending calls, the number of minutes of total calls, the number of total calls, and the number of outgoing objects outside the network account for a high ratio. The high-volume users are obviously A group with more social activities.

於本實施例中，根據步驟S330的變異數分析，更可找出第五種顯著屬性350，其係關於付費程度之屬性，其顯示高用量用戶願意花費較高的購機金額在手機上，其比率為其他用量用戶平均數的1.5倍，且其平均購機週期較短，為其他用量用戶平均數的0.8倍。另外，就用戶申辦電信業務的方案而言，吃到飽方案用戶的高用量用戶比例為13.83%，較非吃到飽用戶的高用量用戶比例5.74%高出許多。並且，若觀察近三個月內的帳單平均金額，高用量用戶的帳單平均金額為其他用量用戶的1.7倍，其請求資料服務與加值服務的金額亦佔其帳單金額的比例較高。 In this embodiment, according to the analysis of the variance of step S330, a fifth saliency attribute 350 can be found, which is related to the attribute of the degree of payment, which indicates that the high-volume user is willing to spend a higher purchase amount on the mobile phone, The ratio is 1.5 times the average of other users, and the average purchase period is shorter, which is 0.8 times the average of other users. In addition, in terms of the user's bidding for the telecommunications business, the proportion of high-volume users who eat the full-fledged users is 13.83%, which is much higher than the proportion of high-volume users who are not full users. Moreover, if the average amount of bills in the past three months is observed, the average amount of bills for high-volume users is 1.7 times that of other users, and the amount of data services and value-added services also accounts for the proportion of their bills. high.

於本實施例中，根據步驟S320的卡方檢定，更可找出第六種顯著屬性350，其係關於用戶行動業務方案之屬性，即為在高用量的寬頻用戶中，觀察高用量用戶購買本公司業務組合情形。即可將已申租某一通信公司的行動門號之用戶特別標記起來，以使該通信公司的客服可以在接洽時快速發現客戶的這項屬性，以嘗試推薦用戶在購機或立新約時搭配該通信公司的特惠影視行銷方案等，更或著是對於未申租該通信公司行動門號的用戶，可為其規劃攜碼服務加上影視方案，藉此吸引高用量用戶購買更多該通信公司的服務及產品。 In this embodiment, according to the chi-square verification of step S320, a sixth saliency attribute 350 can be found, which is related to the attribute of the user action service plan, that is, in the high-volume broadband user, observe the purchase of the high-volume user. The company's business portfolio situation. The user who has subscribed to the mobile phone number of a certain communication company can be specially marked so that the communication company's customer service can quickly find the customer's attribute at the time of contact, in order to try to recommend the user to match the purchase or the new contract. The communication company's special movie marketing plan, etc., or the user who has not rented the communication company's mobile number, can plan a code-carrying service plus a film and television program to attract high-volume users to purchase more of the communication company. Services and products.

於本實施例中，根據步驟S320的卡方檢定，也可以找出關於用戶上網速率之第七種顯著屬性350，如圖9的第五範例圖所示，觀察可發現申請高速上網的用戶有較高的比例為高用量用戶，而且用戶的下行資料量也明顯較高，是其他用量用戶的1.5倍。另外，影視高用量付費用戶每次連線的平均資料量較影視高用量免費用戶來的高，其中，下行量高出104%，上行量高出68%。另外，影視高用量付費用戶的平均的連線時間也較長、連線次數較少。 In this embodiment, according to the chi-square verification of step S320, the seventh significant attribute 350 about the user's Internet access rate can also be found. As shown in the fifth example diagram of FIG. 9, the user who can find the application for high-speed Internet access is observed. The higher proportion is for high-volume users, and the amount of downlink data of users is also significantly higher, 1.5 times that of other users. In addition, the average amount of data connected by high-volume paying users is higher than that of high-volume users. Among them, the downlink volume is 104% higher and the uplink volume is 68% higher. In addition, the average connection time for film and television high-paying users is longer and the number of connections is smaller.

最後，根據步驟S330的變異數分析，可找出第八種顯著屬性350，係為寬頻影視用戶數的地理分佈區域，例如，影視高用量用戶較密集集中於台灣的北部區域，包含基隆、臺北、新北、桃園、新竹等地區，該等區域的用戶人數佔全體用戶44.59%，其中，可歸納出每五位觀看影視的高用量用戶中，就有一位是付費的用戶(佔該地區用戶數21%~26.97%)。 Finally, according to the analysis of the variance of step S330, the eighth significant attribute 350 can be found, which is a geographical distribution area of the number of broadband video users. For example, high-definition video users are densely concentrated in the northern part of Taiwan, including Keelung and Taipei. , Xinbei, Taoyuan, Hsinchu and other regions, the number of users in these areas accounted for 44.59% of all users, which can be concluded that one out of every five high-volume users who watch movies and TV is paid users (accounting for the number of users in the region) 21%~26.97%).

承前，於本實施例中，藉由本發明之方法，針對該影視應用類別的高用量用戶，最後可以找出八種顯著屬性350，該八種顯著屬性可用以描述該影視應用類別的高用量用戶之用戶輪廓360。 In the present embodiment, by the method of the present invention, for the high-volume users of the video application category, finally, eight significant attributes 350 can be found, which can be used to describe the high-volume users of the video application category. User profile 360.

由此可見，本發明係為一種令使用者可以將用戶瀏覽的URL網址記錄依照其欲分析的應用類別進行一系列關聯分析，進而描繪出用戶輪廓的方法，根據本發明之方法，可以有效率地挖掘出潛在客戶，以達到精準行銷之效益。 It can be seen that the present invention is a method for the user to perform a series of association analysis on the URL of the URL browsed by the user according to the application category to be analyzed, thereby depicting the outline of the user, and the method according to the present invention can be effective. Excavate potential customers to achieve the benefits of accurate marketing.

本發明與先前技術相比，有著以下的優點： Compared with the prior art, the present invention has the following advantages:

1.提供一種可透過使用者定義的URL網址，來尋找特定族群的用戶輪廓之方法。 1. Provide a way to find the outline of a particular group of users through a user-defined URL.

2.提出一個用戶分數計算的方法，包含一權重值分析方法，該權重值分析方法可透過神經元網路對各個應用類別進行單層模型訓練，以產出包含各輸入變量的權重，可提供分數計算方法較佳的權重設定，使用戶分數的計算方法可應用於多個應用領域，並維持較佳的用戶識別性，而該權重值分析方法亦可使用邏輯迴歸或其他演算法來取代。 2. A method for calculating a user score, comprising a weight value analysis method, which can perform a single layer model training on each application category through a neural network to generate a weight including each input variable, which can be provided The score calculation method preferably sets the weight so that the calculation method of the user score can be applied to multiple application fields and maintains better user identification, and the weight value analysis method can also be replaced by logistic regression or other algorithms.

3.提供找出特定族群用戶輪廓的方法，可設定預定義好的數百項用戶屬性或使用者自行定義的屬性，透過卡方檢定與變異數分析進行交叉分析，以找出高用量用戶與其他用量用戶在統計上有顯著差異的屬性，以藉由這樣的輪廓樣貌以設計適合的行銷方案。 3. Provide a method for finding the outline of a specific group of users. You can set pre-defined hundreds of user attributes or user-defined attributes, cross-analyze through chi-square verification and variance analysis to find high-volume users and Other usage users have statistically significant differences in attributes to design a suitable marketing solution with such contours.

4.透過邏輯迴歸演算法分析顯著屬性相對於高用量用戶係呈現正相關或負相關。 4. Analysis of significant attributes through a logistic regression algorithm is positively or negatively correlated with respect to high-volume user lines.

5.結合現行巨量資料運算技術與本發明之方法，可解決傳統單機難以處理大量用戶或長期累積資料的問題，提供較先前技術更佳的用戶輪廓分析方法。 5. Combining the current huge amount of data computing technology with the method of the present invention, it can solve the problem that the traditional single machine is difficult to process a large number of users or long-term accumulated data, and provides a better user profile analysis method than the prior art.

綜上所述，本發明於技術思想上實屬創新，也具備先前技術不及的多種功效，已充分符合新穎性及進步性之法定發明專利要件，爰依法提出專利申請，懇請貴局核准本件發明專利申請案以勵發明，至感德便。 In summary, the present invention is innovative in terms of technical ideas, and also has various functions that are not in the prior art, and has fully complied with the statutory invention patent requirements of novelty and progressiveness, and has filed a patent application according to law, and invites you to approve the invention. The patent application was inspired to invent, and it was a matter of feeling.

Claims

A method for analyzing a URL to generate a user profile, the method comprising the steps of: selecting a user group according to an application category in an original browsing log, wherein the application category is a predefined content homogenous or approximate URL (Uniform Resource Locator) a location class; analyzing usage information such as the number, time, or period of the URL address of the user group link or browsing the application category to generate a user record; corresponding to the known positive or negative URL address Using information to perform machine training to obtain a weight value corresponding to the application category; calculating a score for the user record according to the weight value and sorting, to obtain at least one user with a high usage amount from the user group according to the sorted score; The at least one user performs profile analysis to generate a user profile corresponding to the application category, wherein the profile analysis sets a complex attribute to be analyzed for the application category, and the attributes are statistically analyzed to generate at least one of the at least one user a significant attribute, and more than a correlation of the at least one significant attribute with respect to the at least one user, At least a significant attribute describes the user profile.

The method of claim 1, wherein the weight value is generated by performing a variable analysis training on the use information by a Neural Network or a Logistic Regression model.

The method of claim 1, wherein the at least one user selects a user whose number of percentages is higher since the score recorded by the user is reduced.

The method of claim 1, wherein the statistical analysis method used in the contour analysis comprises at least a Chi-square Test process, and the chi-square verification process is used for performing cross-analysis in the category. A variable analysis is performed to output a significantly larger χ ² value as the at least one significant attribute.

The method of claim 1, wherein the statistical analysis method used in the contour analysis comprises at least an analysis of variance (ANOVA) process for performing cross-analysis. Continuous variable analysis to output a significantly larger F-test value to the at least one significant attribute.

The method of claim 1, wherein the profile analysis determines that the at least one significant attribute is positively or negatively correlated with the at least one user by a Logistic Regression model.