TWI714213B

TWI714213B - User type prediction system and method thereof

Info

Publication number: TWI714213B
Application number: TW108128909A
Authority: TW
Inventors: 陳俊宏; 林宜佳; 楊少夫; 江易倫
Original assignee: 東方線上股份有限公司
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2020-12-21
Also published as: TW202107301A

Abstract

A user type prediction system includes a data transmission module, a data classification module, a user type tag module, and a user type prediction module. The data transmission module is configured to receive first browsing domain data of complete data users and corresponding webpage content, and receive second browsing domain data of a partial data user. The data classification module is configured to classify the first browsing domain data and the corresponding webpage content, and generate classification results. The user type tag module is configured to generate a type tag according to the first browsing domain data and the classification results. The user type prediction module is configured to compare the second browsing domain data with the first browsing domain data to generate a first comparison result, and label the partial data user according to the first comparison result to predict the user type of the partial data user.

Description

User profile prediction system and method

本發明係關於一種用戶型態預測系統及方法，尤指一種針對僅有部分用戶資料之用戶的生活型態進行預測之系統及方法。 The present invention relates to a system and method for predicting user patterns, and in particular to a system and method for predicting the lifestyle patterns of users with only partial user data.

在這大量利用大數據進行分析以針對消費者進行行為分析之技術的時代，電信業者也相繼開始針對用戶之瀏覽網頁之內容進行大數據的分析，以了解各用戶之用戶型態，藉以對該用戶做進一步的服務，如：廣告推播。由於電信業者係透過用戶時寄到訪頁面接觸的文本及關鍵字詞，作為判斷消費者生活型態的偏好依據。因此，電信業者需要足量的網頁資料才能對用戶所接觸的字頻來分析對應關鍵字的強弱。若僅以有限的資料量進行分析，將會發生以偏概全之風險，故需以足量的網頁內容作為分析對象，例如：每月所瀏覽之網頁內容至少約30篇。 In this era of technology that uses big data for analysis to conduct behavioral analysis for consumers, telecommunications companies have also begun to conduct big data analysis for the content of users’ web browsing in order to understand the user profile of each user. Users do further services, such as: advertising. Since telecom operators use the text and keyword terms that users contact when they visit the page, they are used as the basis for judging consumers' lifestyle preferences. Therefore, telecom operators need sufficient web page data to analyze the strength of the corresponding keywords based on the frequency of the characters contacted by users. If only a limited amount of data is used for analysis, there will be a risk of partiality. Therefore, it is necessary to use a sufficient amount of web content as the object of analysis, for example: at least about 30 web pages browsed every month.

然而，目前電信業者所能得到之分析資料量，實在有限。由於受限部分網頁內容無法取得，例如：需個人登入資料之頁面，無相關權限獲取的網址內容(如Facebook、Gmail...)、因權杖(token)安全協定下，無法取得爬文內容、具有高風險之網站(如色情、暴力)，不符合電信資安規範、跳轉頁面/中繼頁面而無實質資料。若以10萬用戶資料解析其網路行為特徵，每月具足量瀏覽網頁文本內容之用戶數約占總用戶數之10%、僅有網域瀏覽紀錄但每月網頁文本內容不足量者約占總用戶數之75%，剩餘用戶數則無上網相關紀錄。以上述之資料量，實在難以讓電信業者做出準確之分析，故此是以確有必須加以改善之課題。 However, the amount of analytical data currently available to telecom operators is really limited. Because some restricted web content cannot be obtained, such as pages that require personal login information, URL content obtained without relevant permissions (such as Facebook, Gmail...), crawling content cannot be obtained due to the token security agreement , High-risk websites (such as pornography, violence), do not comply with telecommunications information security regulations, jump pages/relay pages without substantial information. If we analyze its network behavior characteristics based on the data of 100,000 users, the number of users with sufficient amount of web page text content per month accounts for about 10% of the total number of users, and only the domain Browsing records but the monthly webpage text content is insufficient accounted for about 75% of the total number of users, and the remaining number of users has no online related records. With the amount of data mentioned above, it is really difficult for the telecommunications industry to make an accurate analysis, so there are indeed issues that must be improved.

有鑑於此，本發明之一範疇在於提供一種用戶型態預測系統以解決現有技術之問題。本發明之用戶型態預測系統係用以對使用網路服務之部分資料用戶進行用戶型態之預測，部分資料用戶使用網路服務而產生複數個第二瀏覽網域資料。用戶型態預測系統包含資料傳輸模組、資料分類模組、用戶型態標籤模組以及用戶型態預測模組。資料傳輸模組係用以接收複數個完整資料用戶之複數個第一瀏覽網域資料以及對應第一瀏覽網域資料之複數個網頁內容，以及接收部分資料用戶之第二瀏覽網域資料。資料分類模組係連接資料傳輸模組。資料分類模組係用以分類第一瀏覽網域資料以及對應之網頁內容，並產生分別對應第一網域資料之複數個分類結果。用戶型態標籤模組係連接資料分類模組。用戶型態標籤模組係用以根據完整資料用戶之第一瀏覽網域資料以及對應之分類結果產生對應完整資料用戶之型態標籤。用戶型態預測模組係連接用戶型態標籤模組。用戶型態預測模組係用以比對部分資料用戶之第二瀏覽網域資料與完整資料用戶之第一瀏覽網域資料，以產生第一比對結果，並根據第一比對結果對部分資料用戶貼上複數個型態標籤中之至少一者，進而預測部分資料用戶之用戶型態。 In view of this, one scope of the present invention is to provide a user profile prediction system to solve the problems of the prior art. The user profile prediction system of the present invention is used to predict the user profile of some data users who use the network service, and some data users use the network service to generate a plurality of second browsing domain data. The user pattern prediction system includes a data transmission module, a data classification module, a user pattern labeling module, and a user pattern prediction module. The data transmission module is used for receiving plural first browsing domain data of a plurality of complete data users and plural web content corresponding to the first browsing domain data, and receiving second browsing domain data of some data users. The data classification module is connected to the data transmission module. The data classification module is used to classify the first browsed domain data and the corresponding webpage content, and generate a plurality of classification results respectively corresponding to the first domain data. The user type label module is connected to the data classification module. The user type tag module is used to generate type tags corresponding to the complete data user based on the first browsed domain data of the complete data user and the corresponding classification result. The user profile prediction module is connected to the user profile tag module. The user pattern prediction module is used to compare the second browsing domain data of the partial data user with the first browsing domain data of the complete data user to generate the first comparison result, and to compare parts according to the first comparison result Data users paste at least one of a plurality of type tags to predict the user types of some data users.

其中，用戶型態預測模組可用以比對第一瀏覽網域資料與第二瀏覽網域資料。當第二瀏覽網域資料之組成落於完整資料用戶中之第一完整資料用戶之第一瀏覽網域資料之涵蓋範圍內時，則用戶型態標籤模組將對應第一完整資料用戶之型態標籤貼至部分資料用戶，進而預測部份資料用戶之用戶型態。 Among them, the user type prediction module can be used to compare the first browsing domain data with the second browsing domain data. When the composition of the second browsing domain data falls first among users with complete data When the first browsing domain data of the complete data user is covered, the user type tag module will paste the type label corresponding to the first complete data user to some data users, and then predict the user type of some data users .

其中，用戶型態預測模組可進一步包含預測模型，以比對第二瀏覽網域資料與第一瀏覽網域資料，並根據第一比對結果對部分資料用戶貼上複數個型態標籤中之至少一者。預測模型可利用近鄰演算法(KNN)進行比對計算。 Among them, the user type prediction module may further include a prediction model to compare the second browsing domain data with the first browsing domain data, and according to the first comparison result, a plurality of type tags are attached to some data users At least one of them. The prediction model can use the nearest neighbor algorithm (KNN) for comparison calculation.

其中，近鄰演算法(KNN)之K可為1。 Among them, the K of the nearest neighbor algorithm (KNN) can be 1.

其中，預測模型可根據完整資料用戶之第一瀏覽網域資料以及對應第一瀏覽網域資料之網頁內容以機器學習進行更新。 Among them, the prediction model can be updated by machine learning based on the first browsed domain data of the user with complete data and the webpage content corresponding to the first browsed domain data.

其中，本發明之用戶型態預測系統可進一步包含用戶資料庫連接用戶型態預測模組。用戶資料庫可用以儲存完整資料用戶及部分資料用戶之複數個用戶資料。用戶型態預測模組根據完整資料用戶之用戶資料比對該部分資料用戶之該用戶資料，以產生第二比對結果，並根據第二比對結果修正各該型態標籤以修正該部分資料用戶之該用戶型態，進而提高對該部分資料用戶之該用戶型態的預測準確度。 Among them, the user type prediction system of the present invention may further include a user database connected to a user type prediction module. The user database can be used to store multiple user data of complete data users and partial data users. The user type prediction module compares the user data of the part of the data user with the user data of the complete data user to generate a second comparison result, and revises each type tag according to the second comparison result to correct the part of the data The user type of the user, thereby improving the prediction accuracy of the user type of the part of the data user.

本發明之另一範疇在於提供一種用戶型態預測方法以解決現有技術之問題。用戶型態預測方法係用以對使用網路服務之部分資料用戶進行用戶型態之預測，部分資料用戶使用網路服務而產生複數個第二瀏覽網域資料。用戶型態預測方法包含以下步驟：接收複數個完整資料用戶之複數個第一瀏覽網域資料及對應之複數個網頁內容，以及部分資料用戶之第二瀏覽網域資料；分類第一瀏覽網域資料以及相對應之網頁內容，並產生分別對應第一網域資料之複數個分類結果；根據完整資料用戶之第一瀏覽網域資料以及對應之分類結果，以產生對應完整資料用戶之型態標籤；比對部分資料用戶之第二瀏覽網域資料與完整資料用戶之第一瀏覽網域資料，以產生第一比對結果；根據第一比對結果對部分資料用戶貼上複數個型態標籤中之至少一者，進而預測部分資料用戶之用戶型態。 Another category of the present invention is to provide a user profile prediction method to solve the problems of the prior art. The user type prediction method is used to predict the user type of some data users who use the network service, and some data users use the network service to generate plural second browsing domain data. The user type prediction method includes the following steps: receiving plural first browsing domain data and corresponding plural web content of a plurality of complete data users, and second browsing domain data of some data users; classifying the first browsing domain Data and corresponding web content, and Generate a plurality of classification results corresponding to the first domain data; according to the first browsed domain data of the complete data user and the corresponding classification results, generate the type label corresponding to the complete data user; compare the second of some data users Browsing the domain data and the first browsing domain data of the complete data user to generate the first comparison result; according to the first comparison result, paste at least one of a plurality of type tags on part of the data users to predict the part The user type of the data user.

其中，於預測部分資料用戶之用戶型態的步驟中，更包含以下子步驟：當第一比對結果為第二瀏覽網域資料之組成落於完整資料用戶中之第一完整資料用戶之第一瀏覽網域資料之涵蓋範圍內時，將對應第一完整資料用戶之型態標籤貼至部分資料用戶，以預測部分資料用戶之用戶型態。 Among them, the step of predicting the user types of some data users further includes the following sub-steps: when the first comparison result is that the composition of the second browsing domain data falls on the first complete data user among the complete data users 1. When browsing the coverage of the domain data, paste the type label corresponding to the first complete data user to some data users to predict the user type of some data users.

其中，於預測部分資料用戶之用戶型態的步驟後，更包含以下步驟：比對完整資料用戶之複數個用戶資料與部分資料用戶之用戶資料，以產生第二比對結果；根據第二比對結果修正型態標籤以修正部分資料用戶之用戶型態，進而提高對部分資料用戶之用戶型態的預測準確度。 Among them, after the step of predicting the user profile of some data users, it further includes the following steps: comparing the plural user data of the complete data user with the user data of some data users to generate a second comparison result; according to the second comparison Revise the type label to the result to modify the user type of some data users, thereby improving the prediction accuracy of the user type of some data users.

其中，於比對第一瀏覽網域資料與第二瀏覽網域資料之步驟中，係利用近鄰演算法(KNN)進行比對計算。 Wherein, in the step of comparing the first browsed domain data with the second browsed domain data, the nearest neighbor algorithm (KNN) is used for comparison calculation.

相較於現有技術，本發明之用戶型態預測系統及方法具有以下優點：1.藉由少量用戶數建立預測模型，降低計算機運算乘載量。2.突破既有爬文僅能取得部分資料的門檻限制，本發明之用戶型態預測系統可在既有之爬文基礎上，分析占總用戶10%之完整資料用戶的用戶型態，並以建立預測模型，以預測占總用戶75%之部分資料用戶的用戶型態，進而得以有效擴大應用(如：廣告推播)。 Compared with the prior art, the user profile prediction system and method of the present invention have the following advantages: 1. By establishing a prediction model with a small number of users, the computer computing load is reduced. 2. Breaking through the threshold limitation that the existing crawling text can only obtain part of the data, the user profile prediction system of the present invention can analyze the user profile of users with complete data that account for 10% of the total users based on the existing crawling text, and To establish a predictive model to predict the user profile of some data users who account for 75% of the total users, thereby effectively expanding applications (such as advertising).

1‧‧‧用戶型態預測系統 1‧‧‧User profile prediction system

11‧‧‧資料傳輸模組 11‧‧‧Data Transmission Module

12‧‧‧資料分類模組 12‧‧‧Data Classification Module

13‧‧‧用戶型態標籤模組 13‧‧‧User Type Label Module

14‧‧‧用戶型態預測模組 14‧‧‧User profile prediction module

141‧‧‧預測模型 141‧‧‧Predictive model

15、21‧‧‧用戶資料庫 15, 21‧‧‧User database

2‧‧‧電信資料庫 2‧‧‧Telecom Database

S1-S7‧‧‧步驟 S1-S7‧‧‧Step

S51‧‧‧子步驟 S51‧‧‧Substep

圖1為本發明之一具體實施例之用戶型態預測系統的功能方塊圖。 Fig. 1 is a functional block diagram of a user profile prediction system according to a specific embodiment of the present invention.

圖2為本發明之另一具體實施例之用戶型態預測系統的功能方塊圖。 Fig. 2 is a functional block diagram of a user profile prediction system according to another specific embodiment of the present invention.

圖3為本發明之一具體實施例之用戶型態預測方法的步驟流程圖。 Figure 3 is a flow chart of the steps of a user profile prediction method according to a specific embodiment of the present invention.

圖4為本發明之另一具體實施例之用戶型態預測方法的步驟流程圖。 FIG. 4 is a flowchart of the steps of a user profile prediction method according to another specific embodiment of the present invention.

圖5為本發明之再一具體實施例之用戶型態預測方法的步驟流程圖。 Fig. 5 is a flowchart of the steps of a user profile prediction method according to another specific embodiment of the present invention.

為了讓本發明的優點，精神與特徵可以更容易且明確地了解，後續將以實施例並參照所附圖式進行詳述與討論。值得注意的是，這些實施例僅為本發明代表性的實施例。但是其可以許多不同的形式來實現，並不限於本說明書所描述的實施例。相反地，提供這些實施例的目的是使本發明的公開內容更加透徹且全面。 In order to make the advantages, spirit and features of the present invention easier and clearer to understand, the following embodiments will be used for detailed and discussion with reference to the accompanying drawings. It should be noted that these examples are only representative examples of the present invention. However, it can be implemented in many different forms and is not limited to the embodiments described in this specification. On the contrary, the purpose of providing these embodiments is to make the disclosure of the present invention more thorough and comprehensive.

在本發明公開的各種實施例中使用的術語僅用於描述特定實施例的目的，並非用於限制本發明所公開的各種實施例。如在此所使用單數形式係也包括複數形式，除非上下文清楚地另外指示。除非另有限定，否則在本說明書中使用的所有術語(包含技術術語和科學術語)具有與本發明公開的各種實施例所屬領域普通技術人員通常理解的涵義的相同涵義。上述術語(諸如在一般使用的辭典中限定的術語)將被解釋為具有與在相同技術領域中的語境涵義相同的涵義，並且將不被解釋為具有理想化的涵義或過於正式的涵義，除非在本發明公開的各種實施例中被清楚地限定。 The terms used in the various embodiments disclosed in the present invention are only used for the purpose of describing specific embodiments, and are not used to limit the various embodiments disclosed in the present invention. The singular form as used herein also includes the plural form, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used in this specification have the same meanings as commonly understood by those of ordinary skill in the art to which various embodiments disclosed in the present invention belong. The above-mentioned terms (such as those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the same technical field, and will not be interpreted as having idealized or overly formal meanings, Unless clearly defined in the various embodiments disclosed in the present invention.

相較於現有技術，本發明之用戶型態預測系統1及方法係透過用戶實際瀏覽頁面接觸的文本及關鍵字詞，作為判斷用戶型態標籤的偏好依據。本發明之用戶型態預測系統1及方法係先行針對占總用戶10%的具有足量瀏覽網頁爬文內容之完整資料用戶進行分類並貼上型態標籤，以確定完整資料用戶之用戶型態。接著，將完整資料用戶之分析結果作為種子點，與占總用戶75%的不足量資料之部分資料用戶進行比對，並以此預測占總用戶75%之部分資料用戶的用戶型態。 Compared with the prior art, the user profile prediction system 1 and method of the present invention are transparent The text and keyword words that the user actually browses on the page are used as the basis for judging the preference of the user type label. The user type prediction system 1 and method of the present invention first classifies and affixes type tags to 10% of the total users who have complete data users with sufficient browsing content to browse web pages to determine the user type of complete data users . Then, the analysis results of complete data users are used as seed points to compare with some data users who account for 75% of total users with insufficient data, and use this to predict the user patterns of some data users who account for 75% of total users.

請參閱圖1，圖1為本發明之一具體實施例之用戶型態預測系統1的功能方塊圖。於本具體實施例中，用戶型態預測系統1包含有資料傳輸模組11、資料分類模組12、用戶型態標籤模組13以及用戶型態預測模組14。資料傳輸模組11係與電信業者之電信資料庫2連接，用以接收儲存於電信資料庫2中之複數個完整資料用戶之複數個第一瀏覽網域資料以及對應第一瀏覽網域資料之複數個網頁內容，以及接收部分資料用戶之複數個第二瀏覽網域資料。其中，第一瀏覽網域資料、網頁內容及第二瀏覽網域資料係分別由完整資料用戶及部分資料用戶於使用網路服務所產生。資料分類模組12係連接資料傳輸模組11。資料分類模組12係用以分類第一瀏覽網域資料以及對應之網頁內容，並產生分別對應第一網域資料之複數個分類結果。用戶型態標籤模組13係連接資料分類模組12。用戶型態標籤模組13係用以根據完整資料用戶之第一瀏覽網域資料以及對應之分類結果產生對應完整資料用戶之型態標籤。用戶型態預測模組14係連接用戶型態標籤模組13。用戶型態預測模組14係用以比對部分資料用戶之第二瀏覽網域資料與完整資料用戶之第一瀏覽網域資料，以產生第一比對結果，並根據第一比對結果對部分資料用戶貼上複數個型態標籤中之至少一者，進而預測部分資料用戶之用戶型態。於實際應用上，資料傳輸模組11、資料分類模組12、用戶型態標籤模組13以及用戶型態預測模組14可建立於電腦或伺服器之中央處理器或系統處理晶片。 Please refer to FIG. 1. FIG. 1 is a functional block diagram of a user profile prediction system 1 according to a specific embodiment of the present invention. In this specific embodiment, the user profile prediction system 1 includes a data transmission module 11, a data classification module 12, a user profile labeling module 13, and a user profile prediction module 14. The data transmission module 11 is connected to the telecommunication database 2 of the telecommunication company, and is used to receive the plural first browsing domain data of the plural complete data users stored in the telecommunication database 2 and the corresponding first browsing domain data Multiple web content, and multiple second browsing domain data from users receiving partial data. Among them, the first browsing domain data, web content, and second browsing domain data are generated by full data users and some data users using network services, respectively. The data classification module 12 is connected to the data transmission module 11. The data classification module 12 is used to classify the first browsed domain data and the corresponding webpage content, and generate a plurality of classification results respectively corresponding to the first domain data. The user type label module 13 is connected to the data classification module 12. The user type tag module 13 is used to generate type tags corresponding to the complete data user based on the first browsed domain data of the complete data user and the corresponding classification result. The user profile prediction module 14 is connected to the user profile tag module 13. The user pattern prediction module 14 is used to compare the second browsing domain data of the partial data user with the first browsing domain data of the complete data user to generate the first comparison result, and compare it according to the first comparison result Some data users affix at least one of a plurality of type tags, and then the forecasting department The user type of the data user. In practical applications, the data transmission module 11, the data classification module 12, the user profile labeling module 13, and the user profile prediction module 14 can be built on the central processing unit or system processing chip of a computer or server.

進一步來說，資料分類模組12可包含有分類詞典，將完整資料用戶經由電信業者所提供之網路服務後所產生之第一瀏覽網域資料以及對應之網頁內容進行爬文、分詞及利用字頻所分析出各個網頁內容，與分類詞典比對以進行區分其類別，如：網球就被分類為運動類、麥可喬登就被分類為運動明星類、唇膏就被分類為美妝類、總統就被分類為政治類等等。接著，用戶型態標籤模組13藉由分類結果的組成對完整資料用戶貼上型態標籤，如：分類結果組成包含有彩妝類、女性保養類及女性用品類即被貼上女性之型態標籤等等之有關於年齡、性別、想法及偏好之生活型態類標籤。用戶型態預測模組14中可進一步包含有預測模型141。此預測模型141即是利用完整資料用戶之第一瀏覽網域資料以及對應第一瀏覽網域資料之網頁內容來建立，並以機器學習進行更新。預測模型141將完整資料用戶作為種子點，並以完整資料用戶之第一瀏覽網域資料比對部分資料用戶之第二瀏覽網域資料，以產生第一比對結果。 Furthermore, the data classification module 12 may include a classification dictionary for crawling, segmenting and using the first browsed domain data and corresponding web content generated by the complete data user through the network service provided by the telecommunications company The content of each webpage is analyzed by the word frequency and compared with the classification dictionary to distinguish its categories. For example, tennis is classified as sports, Michael Jordan is classified as sports stars, and lipsticks are classified as beauty. , Presidents are classified as political and so on. Next, the user profile labeling module 13 uses the composition of the classification result to label the user with complete data. For example, the composition of the classification result includes makeup, feminine care, and feminine products, which are labeled as female. Labels and so on include lifestyle labels related to age, gender, thoughts and preferences. The user profile prediction module 14 may further include a prediction model 141. The prediction model 141 is established by using the first browsing domain data of the user with complete data and the web content corresponding to the first browsing domain data, and is updated by machine learning. The prediction model 141 uses the complete data user as a seed point, and compares the second browsing domain data of some data users with the first browsing domain data of the complete data user to generate the first comparison result.

實際應用中，預測模型141可根據第一比對結果，將複數個第二瀏覽網域資料之組成落於完整資料用戶中之第一完整資料用戶之第一瀏覽網域資料之涵蓋範圍內時，即此部分資料用戶之第二瀏覽網域資料之組成與第一完整資料用戶之第一瀏覽網域資料之組成相近時，用戶型態標籤模組13即將第一完整資料用戶所對應之型態標籤貼至部分資料用戶，進而預測部分資料用戶之用戶型態。 In actual application, the prediction model 141 can, based on the first comparison result, put the composition of the plural second browsing domain data within the coverage of the first browsing domain data of the first complete data user among the complete data users , That is, when the composition of the second browsing domain data of this part of the data user is similar to the composition of the first browsing domain data of the first complete data user, the user type tag module 13 is the type corresponding to the first complete data user Status tags are attached to some data users to predict the user types of some data users.

除了上述之預測方式外，預測模型141可將落於各個第一瀏覽網域資料之涵蓋範圍內之各個第二網域資料分類成與落於對應之第一瀏覽網域資料相同之分類結果，再由用戶型態標籤模組13藉由分類結果對部分資料用戶貼上型態標籤。最後，預測模型141藉由部分資料用戶之型態標籤的分佈預測其用戶型態。上述兩種預測流程，一種係以複數個瀏覽網域資料之組成進行比對以預測部分資料用戶之用戶型態，而另一種係以單一瀏覽網域資料進行比對以預測部分資料用戶之用戶型態。本領域之通常知識者可依需求選擇適合之方式。 In addition to the aforementioned prediction methods, the prediction model 141 can classify each second domain data that falls within the coverage of each first browse domain data into the same classification result as the corresponding first browse domain data. Then, the user profile labeling module 13 uses the classification result to attach profile labels to some data users. Finally, the prediction model 141 predicts the user profile of some data users based on the distribution of their profile tags. In the above two prediction processes, one is to compare the composition of multiple browsing domain data to predict the user types of some data users, and the other is to compare a single browsing domain data to predict the users of some data users Type. Those with ordinary knowledge in the field can choose the appropriate method according to their needs.

本發明之預測模型141係嘗試不同機器學習之計算方法：近鄰演算法(K-Nearest Neighbor，KNN)、決策樹法(Decision Tree)以及羅吉斯迴歸分析(Logistic regression)，分別以某月份電信業者所蒐集到的整體瀏覽資料分作訓練集與測試集進行比較與評估。評估指標公式分別包含有準確率(Precision)=資料中真實條件用戶數/識別條件用戶數；召回率(Recall)=資料中真實條件用戶數/識別條件用戶數；識別率(Accuracy)=(資料中真實用戶數+資料中真實非條件用戶數)/條件總用戶數；F1-SCORE=2/1/召回率+1/準確率OR(2*準確率*召回率)/準確率+召回率)。其中，準確率(Precision)為精確性的評估指標，即識別為正類的資料實際為正類的百分比。召回率(Recall)為正確識別的正類資料，在實際正類資料中的百分比。識別率(Accuracy)的缺點是其不能表現任何有關測試資料的潛在分佈，分類不平衡問題無法考慮。F1-SCORE為準確率和召回率的調和均值之加權評估指標。 The predictive model 141 of the present invention attempts to calculate different machine learning methods: K-Nearest Neighbor (KNN), Decision Tree (Decision Tree), and Logistic regression. The overall browsing data collected by the industry is divided into training set and test set for comparison and evaluation. The evaluation index formulas respectively include accuracy (Precision) = number of users with real conditions in the data/number of users with identification conditions; Recall = number of users with real conditions in the data/number of users with identification conditions; Accuracy = (data The number of real users + the number of real unconditional users in the data)/the total number of users; F1-SCORE=2/1/recall rate+1/accuracy rate OR (2*accuracy rate*recall rate)/accuracy rate+recall rate ). Among them, the accuracy rate (Precision) is an evaluation index of accuracy, that is, the percentage of the data identified as the positive class actually being the positive class. Recall is the percentage of correctly identified positive data in the actual positive data. The disadvantage of Accuracy is that it cannot represent the potential distribution of any relevant test data, and the imbalance of classification cannot be considered. F1-SCORE is a weighted evaluation index of the harmonic mean of accuracy and recall.

以近鄰演算法(K-Nearest Neighbor，KNN)進行參數測試，K 分別採1、3、5、7、9，並以7：3將資料分作訓練集與測試集，並重複10次(CV=10)進行比較與評估而得到以下測試數據。 Perform parameter test with K-Nearest Neighbor (KNN), K Take 1, 3, 5, 7, 9 respectively, and divide the data into training set and test set 7:3, and repeat 10 times (CV=10) for comparison and evaluation to obtain the following test data.

最後如上表所示，針對F1-score進行綜合評估，可以發現K=1時，雖然針對男性與壯年之預測結果都比K=3、5、7、9差，但差別不大，且針對女性、年輕以及老年之預測結果都有比較好的表現，因此，以K=1的距離所得到的指標表現最為理想。

Finally, as shown in the above table, a comprehensive evaluation of F1-score can be found when K=1, although the prediction results for men and adults are worse than K=3, 5, 7, and 9, but the difference is not large, and for women , The prediction results of young and old have better performance. Therefore, the performance of the index obtained with the distance of K=1 is the most ideal.

同樣以7：3將資料分作訓練集與測試集，並重複10次(CV=10)的條件下，以羅吉斯迴歸分析(Logistic regression)進行測試。 Similarly, divide the data into training set and test set at 7:3, and repeat the test 10 times (CV=10) with Logistic regression analysis.

以F1-score評估指標來看，羅吉斯迴歸分析雖然於男性及壯年之預測結果比近鄰演算法好，但近鄰演算法與羅吉斯迴歸分析的差異不超過16%。然而，羅吉斯迴歸分析卻於女性、年輕及老年之預測結果表現與近鄰演算法差異500%以上。因此，近鄰演算法相較於羅吉斯迴歸分析具有較佳表現。綜上所述，於最佳實施例中，本發明之預測模型141選用近鄰演算法(KNN)，且K=1進行比對計算。

According to the F1-score evaluation index, although the logistic regression analysis predicts better than the nearest neighbor algorithm for men and middle-aged people, the difference between the nearest neighbor algorithm and the logistic regression analysis does not exceed 16%. However, Logis regression analysis differs by more than 500% in the prediction results of women, young and old, and the nearest neighbor algorithm. Therefore, the nearest neighbor algorithm has better performance than Logis regression analysis. In summary, in the preferred embodiment, the prediction model 141 of the present invention uses the nearest neighbor algorithm (KNN), and K=1 for comparison calculation.

為了提高本發明之用戶型態預測系統1所預測出來的結果之準確度，本發明之用戶型態預測系統1進一步可包含用戶資料庫15，以用戶資料庫15中所儲存之完整資料用戶及部分資料用戶之用戶資料，對預測模型141所預測出來之部分資料用戶之用戶型態作進一步的修正。於圖1之實施例中，用戶資料庫15連接用戶型態預測模組14。用戶型態預測模組14根據完整資料用戶之用戶資料比對部分資料用戶之用戶資料，以產生第二比對結果，並根據第二比對結果修正各型態標籤，進而修正部分資料用戶之用戶型態。此外，請參閱圖2，圖2為本發明之一具體實施例之用戶型態預測系統的功能方塊圖。與圖1之實施例不同的是，圖2之實施例之用戶資料庫21係設置於電信資料庫2中，以節省用戶型態預測系統之儲存空間。圖2之實施例中與圖1相同之裝置，將不再加以贅述。 In order to improve the accuracy of the results predicted by the user profile prediction system 1 of the present invention, the user profile prediction system 1 of the present invention may further include a user database 15, which uses the complete data stored in the user database 15 and For the user data of some data users, the user types of some data users predicted by the prediction model 141 are further revised. In the embodiment in FIG. 1, the user database 15 is connected to the user profile prediction module 14. 14 user profile prediction modules Compare the user data of some data users according to the user data of the complete data user to generate a second comparison result, and modify each type label according to the second comparison result, and then modify the user type of some data users. In addition, please refer to FIG. 2, which is a functional block diagram of a user profile prediction system according to a specific embodiment of the present invention. The difference from the embodiment of FIG. 1 is that the user database 21 of the embodiment of FIG. 2 is set in the telecommunication database 2 to save the storage space of the user profile prediction system. The device in the embodiment of FIG. 2 that is the same as that of FIG. 1 will not be described again.

詳細來說，用戶資料包含有用戶之性別、年齡、職業、學歷等，預測模型141可進一步利用完整資料用戶之用戶資料與第一瀏覽網域資料及網頁內容進行機器學習，進而找出用戶資料與第一瀏覽網域資料及網頁內容間的關係。本發明之用戶型態預測系統1以此預測模型141進一步對以預測出用戶型態之部分資料用戶，計算用戶資料與第二瀏覽網域資料間的關係，並以此對部分資料用戶之用戶型態進行修正。 In detail, the user information includes the user's gender, age, occupation, education, etc. The prediction model 141 can further use the user information of the user with complete data, the first browsed domain data and the content of the webpage for machine learning, and then find the user information The relationship with the first browse domain data and web content. The user profile prediction system 1 of the present invention uses the prediction model 141 to further predict the user profile of some data users, calculate the relationship between the user data and the second browsing domain data, and use the prediction model 141 to calculate the relationship between the user data and the second browsing domain data, and use the prediction model 141 to calculate The shape is revised.

請參閱圖3，圖3為本發明之一具體實施例之用戶型態預測方法的步驟流程圖。如圖2所示，用戶型態預測方法包含以下步驟：步驟S1：接收複數個完整資料用戶之複數個第一瀏覽網域資料及對應之複數個網頁內容，以及部分資料用戶之第二瀏覽網域資料；步驟S2：分類第一瀏覽網域資料以及相對應之網頁內容，並產生分別對應第一網域資料之複數個分類結果；步驟S3：根據完整資料用戶之第一瀏覽網域資料以及對應之分類結果，以產生對應完整資料用戶之型態標籤；步驟S4：比對部分資料用戶之第二瀏覽網域資料與完整資料用戶之第一瀏覽網域資料，以產生第一比對結果；步驟S5：根據第一比對結果對部分資料用戶貼上複數個型態標籤中之至少一者，進而預測部分資料用戶之用戶型態。其中，用戶型態預測方法可以用前述之用戶型態預測系統達成，因此，與前述相同之內容，在此將不再贅述。 Please refer to FIG. 3, which is a flow chart of the steps of a user profile prediction method according to a specific embodiment of the present invention. As shown in Figure 2, the user profile prediction method includes the following steps: Step S1: Receive a plurality of first browsing domain data and corresponding plural web content of a plurality of complete data users, and a second browsing network of some data users Domain data; Step S2: Classify the first browsed domain data and the corresponding web content, and generate a plurality of classification results corresponding to the first domain data; Step S3: According to the complete data user’s first browsed domain data and Corresponding classification results to generate type tags corresponding to users with complete data; Step S4: Compare the second browsing domain data of partial data users with the first browsing domain data of complete data users to generate a first comparison result Step S5: According to the first comparison result, paste at least one of a plurality of type tags on part of the data users, and then predict the user type of some data users. Among them, user profile prediction The method can be achieved with the aforementioned user profile prediction system, so the same content as the aforementioned will not be repeated here.

詳細的來說，請參閱圖4，圖4為本發明之另一具體實施例之用戶型態預測方法的步驟流程圖。如圖4所示，於步驟S5中，更包含子步驟S51：當第一比對結果為第二瀏覽網域資料之組成落於完整資料用戶中之第一完整資料用戶之第一瀏覽網域資料之涵蓋範圍內時，將對應第一完整資料用戶之型態標籤貼至部分資料用戶，以預測部分資料用戶之用戶型態。如圖3之實施例係以前述之第一種預測方式，以複數個第二瀏覽網域資料之組成作為比對基礎，比對複數個第一瀏覽網域資料之組成。當比對到第二瀏覽網域資料之組成落於完整資料用戶中之第一完整資料用戶之涵蓋範圍內時，則可以認為此部份資料用戶之型態標籤應與第一完整資料用戶之型態標籤相同。其中，本發明之最佳實施例係以近鄰演算法(K-Nearest Neighbor，KNN)，且K=1定義第一瀏覽網域資料之涵蓋範圍。然，本領域通常知識者亦可以前述之第二種預測方式進行比對，並不以此為限。 In detail, please refer to FIG. 4, which is a flowchart of the steps of a user profile prediction method according to another specific embodiment of the present invention. As shown in FIG. 4, in step S5, it further includes a sub-step S51: when the first comparison result is that the composition of the second browsing domain data falls in the first browsing domain of the first complete data user among the complete data users When the data is within the scope of coverage, the type label corresponding to the first complete data user is attached to some data users to predict the user type of some data users. The embodiment shown in FIG. 3 uses the aforementioned first prediction method, and uses the composition of a plurality of second browsing domain data as a comparison basis to compare the composition of a plurality of first browsing domain data. When it is compared that the composition of the second browsing domain data falls within the coverage of the first complete data user among the complete data users, it can be considered that the type label of this part of the data user should be the same as that of the first complete data user. The type labels are the same. Among them, the preferred embodiment of the present invention uses K-Nearest Neighbor (KNN), and K=1 to define the coverage of the first browsed domain data. Of course, those who are generally knowledgeable in the field can also perform the comparison in the second prediction method mentioned above, and it is not limited to this.

為了提高本發明之用戶型態預測方法的準確度，本發明之用戶型態預測方法更包含有修正步驟。請參閱圖5，圖5為本發明之再一具體實施例之用戶型態預測方法的步驟流程圖。如圖5所示，於步驟S5後，更包含步驟S6：比對完整資料用戶之複數個用戶資料與部分資料用戶之用戶資料，以產生第二比對結果；步驟S7：根據第二比對結果修正型態標籤以修正部分資料用戶之用戶型態，進而提高對部分資料用戶之用戶型態的預測準確度。詳細來說，本發明之用戶型態預測方法係先針對用戶之瀏覽網域資料進行比對，以預測部分資料用戶之用戶型態。接著，再以用戶資料做進一步的修正，以確保預測之準確度。 In order to improve the accuracy of the user profile prediction method of the present invention, the user profile prediction method of the present invention further includes a correction step. Please refer to FIG. 5, which is a flowchart of the steps of the user profile prediction method according to another specific embodiment of the present invention. As shown in FIG. 5, after step S5, it further includes step S6: comparing the user data of the complete data user with the user data of some data users to generate a second comparison result; step S7: according to the second comparison As a result, the type label is modified to modify the user type of some data users, thereby improving the accuracy of the prediction of the user type of some data users. In detail, the user profile prediction method of the present invention first compares the browsing domain data of the user to predict the user profile of some data users. Then, do it with user information Further corrections to ensure the accuracy of the forecast.

相較於現有技術，本發明之用戶型態預測系統1及方法係藉由少量用戶數建立預測模型141，進而降低了計算機運算乘載量。再者，本發明之用戶型態預測系統1及方法更突破既有爬文僅能取得部分資料的門檻限制，本發明之用戶型態預測系統1可在既有之爬文基礎上，分析占總用戶10%之完整資料用戶的用戶型態，並以建立預測模型141，以預測占總用戶75%之部分資料用戶的用戶型態，進而得以有效擴大應用(如：廣告推播)。本發明之用戶型態預測系統1將提供大數據作缺失資料的預測填補，以提高大數據所能計算的數據量以及準確度。此外，本發明之用戶型態預測系統1及方法更進一步包含修正功能，以利用用戶資料對用戶型態作進一步修正，進而確保預測之準確度。 Compared with the prior art, the user profile prediction system 1 and method of the present invention establishes the prediction model 141 with a small number of users, thereby reducing the computational load of the computer. Furthermore, the user profile prediction system 1 and method of the present invention breaks through the threshold limitation that the existing crawling text can only obtain part of the data. The user profile prediction system 1 of the present invention can analyze and account for the existing crawling text. The user profile of 10% of the total users with complete data, and the establishment of a prediction model 141 to predict the user profile of some data users accounting for 75% of the total users, which can effectively expand the application (such as advertising). The user profile prediction system 1 of the present invention will provide big data for prediction and filling of missing data, so as to improve the data volume and accuracy that can be calculated by big data. In addition, the user profile prediction system 1 and method of the present invention further includes a correction function to further modify the user profile using user data, thereby ensuring the accuracy of the prediction.

藉由以上具體實施例之詳述，係希望能更加清楚描述本發明之特徵與精神，而並非以上述所揭露的具體實施例來對本發明之範疇加以限制。相反地，其目的是希望能涵蓋各種改變及具相等性的安排於本發明所欲申請之專利範圍的範疇內。 Through the detailed description of the above specific embodiments, it is hoped that the characteristics and spirit of the present invention can be described more clearly, and the scope of the present invention is not limited by the specific embodiments disclosed above. On the contrary, its purpose is to cover various changes and equivalent arrangements within the scope of the patent application for the present invention.

1‧‧‧用戶型態預測系統 1‧‧‧User profile prediction system

11‧‧‧資料傳輸模組 11‧‧‧Data Transmission Module

12‧‧‧資料分類模組 12‧‧‧Data Classification Module

13‧‧‧用戶型態標籤模組 13‧‧‧User Type Label Module

14‧‧‧用戶型態預測模組 14‧‧‧User profile prediction module

141‧‧‧預測模型 141‧‧‧Predictive model

15‧‧‧用戶資料庫 15‧‧‧User Database

Claims

A user type prediction system for predicting a user type of a part of data users who use a network service. The part of the data user uses the network service to generate a plurality of second browsing domain data. The user type The state prediction system includes: a data transmission module for receiving a plurality of first browsing domain data of a plurality of complete data users and a plurality of web content corresponding to the first browsing domain data, and receiving the partial data users The second browsing domain data; a data classification module connected to the data transmission module, the data classification module is used to classify the first browsing domain data and the corresponding web content, and generate separate Corresponding to a plurality of classification results of the first domain data; a user type tag module connected to the data classification module, and the user type tag module is used for the first browsing of each user with the complete data The domain data and the corresponding classification results generate a type label corresponding to each complete data user; and a user type prediction module connected to the user type tag module, and the user type prediction module is used for Compare the second browsing domain data of the partial data users with the first browsing domain data of the complete data users to generate a first comparison result, and the first comparison result is based on the first comparison result Some data users attach at least one of the type tags to predict the user type of each part of the data users.

For example, the user type prediction system described in item 1 of the scope of patent application, wherein the user type prediction module is used to compare the first browsing domain data with the second browsing domain data, when the When the composition of the second browsing domain data falls within the coverage of the first browsing domain data of one of the complete data users, the user type tag module will The type label of the first complete data user should be affixed to this part of the data user to predict the user type of the part of the data user.

For example, the user type prediction system described in item 1 of the scope of patent application, wherein the user type prediction module further includes a prediction model to compare the second browsing domain data with the first browsing domain data. The first comparison result is generated, and at least one of the type labels is attached to the part of the data users according to the first comparison result, and the prediction model uses the nearest neighbor algorithm (KNN) for comparison calculation.

The user profile prediction system described in item 3 of the scope of patent application, wherein the K of the nearest neighbor algorithm (KNN) is 1.

For example, the user type prediction system described in item 3 of the scope of patent application, wherein the prediction model is based on the first browsing domain data of the complete data users and the web pages corresponding to the first browsing domain data The content is updated with machine learning.

For example, the user type prediction system described in item 1 of the scope of the patent application further includes a user database connected to the user type prediction module, and the user database is used to store the complete data users and the partial data A plurality of user data of a user, the user type prediction module compares the user data of the partial data users with the user data of the complete data users to generate a second comparison result, and according to the second comparison result The comparison result modifies each of the type tags to modify the user type of the part of the data user, thereby improving the prediction accuracy of the user type of the part of the data user.

A user type prediction method for predicting a user type of a part of data users who use a network service. The part of the data user uses the network service to generate a plurality of second browsing domain data. The user type The state prediction method includes the following steps: Receive plural first browsed domain data and corresponding plural webpage contents of plural complete data users, and the second browsed domain data of this part of the data users; classify the first browsed domain data and corresponding And generate a plurality of classification results corresponding to the first domain data; according to the first browse domain data of each complete data user and the corresponding classification results, the corresponding classification results are generated A type tag of the complete data user; compare the second browsing domain data of the partial data users with the first browsing domain data of the complete data users to generate a first comparison result; And, according to the first comparison result, at least one of the type tags is attached to the part of the data users to predict the user type of each part of the data users.

For example, in the user type prediction method described in item 7 of the scope of patent application, the step of predicting the user type of each part of the data user further includes the following sub-steps: when the first comparison result is the first comparison result 2. When the composition of the browsing domain data falls within the coverage of the first browsing domain data of one of the first complete data users, it will correspond to the type tag of the first complete data user Post to this part of the data user to predict the user type of the part of the data user.

For example, the user type prediction method described in item 7 of the scope of patent application, after the step of predicting the user type of each part of the data user, further includes the following step: comparing the plural user data of the complete data user Generate a second comparison result with the user data of the part of the data user; and revise each type tag according to the second comparison result to modify the user type of the part of the data user, thereby improving the part The data user’s prediction of the user type is accurate degree.

For example, the user type prediction method described in item 8 of the scope of patent application, wherein in the step of comparing the first browsing domain data with the second browsing domain data, the nearest neighbor algorithm (KNN) is used to perform Comparison calculation.