TW201822030A - Website classification method and computer program product thereof capable of classifying the target website into various categories according to the relevancy - Google Patents

Website classification method and computer program product thereof capable of classifying the target website into various categories according to the relevancy Download PDF

Info

Publication number
TW201822030A
TW201822030A TW105140768A TW105140768A TW201822030A TW 201822030 A TW201822030 A TW 201822030A TW 105140768 A TW105140768 A TW 105140768A TW 105140768 A TW105140768 A TW 105140768A TW 201822030 A TW201822030 A TW 201822030A
Authority
TW
Taiwan
Prior art keywords
url
category
keywords
website
classification
Prior art date
Application number
TW105140768A
Other languages
Chinese (zh)
Other versions
TWI667581B (en
Inventor
楊富丞
呂栢頤
Original Assignee
中華電信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中華電信股份有限公司 filed Critical 中華電信股份有限公司
Priority to TW105140768A priority Critical patent/TWI667581B/en
Publication of TW201822030A publication Critical patent/TW201822030A/en
Application granted granted Critical
Publication of TWI667581B publication Critical patent/TWI667581B/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a website classification method and a computer program product thereof. The website classification method comprises the steps: capturing one or a plurality of website description information under the condition that the website description content can be obtained from a target website; analyzing the relevancy between each website description information and one or more category keywords of categories to be classified so as to classify the target website into a category according to the relevancy. Automatic website classification operation can be realized by the aforementioned operations, to thereby replace the manual classification employed by prior art.

Description

網址分類方法及其電腦程式產品    URL classification method and computer program product   

本發明係一種網址分類方法及其電腦程式產品,尤指一種可毋需藉由人工即可自動化進行分類之網址分類方法及其電腦程式產品。 The invention relates to a method for classifying a website address and a computer program product thereof, and in particular to a method for classifying a website address and a computer program product thereof that can be automatically classified without human labor.

在諸多的網路管理應用中,例如:垃圾或色情網頁過濾等,需由網路服務業者對大量的網路藉由人工進行分類,而隨著網站數量日益增加,使得習知分類方式在執行上面臨極大的挑戰。 In many network management applications, such as spam or pornographic web filtering, a large number of networks need to be manually classified by network service providers. As the number of websites increases, the conventional classification method is implemented. Face great challenges.

此外,習知分類方案通常在訓練階段時,需要提供完整網頁與類別,並配置一對一對的人工標記結果,而當類別更動時則需全部重新標記,使得執行網址分類作業之成本一直居高不下。 In addition, the conventional classification scheme usually needs to provide complete web pages and categories during the training phase, and configure one-to-one manual tagging results. When the categories are changed, all of them need to be re-marked, so that the cost of performing URL classification operations has always High.

綜上所述,如何提供一種可解決前述問題之方案乃本領域亟需解決之技術問題。 In summary, how to provide a solution that can solve the foregoing problems is a technical problem that needs to be solved in the art.

為解決前揭之問題,本發明之目的係提供一種用於網址分類之技術方案。 In order to solve the problems disclosed previously, the object of the present invention is to provide a technical solution for URL classification.

為達上述目的,本發明提出一種網址分類方法。其包含在目標網址可取得網址描述內容條件下,則擷取一個或複數個網址描述資訊,並分析各網址描述資訊與欲分類類別之一個或多個類別關鍵字之間的關聯度,以依據關聯度將目標網址分類至類別。 In order to achieve the above object, the present invention proposes a method for classifying a website. It includes the condition that the destination URL can obtain the URL description content, then extract one or more URL description information, and analyze the correlation between each URL description information and one or more category keywords of the category to be classified, based on Relevance categorizes destination URLs into categories.

為達上述目的,本發明提出一種網址分類電腦程式產品,當電腦裝置載入並執行電腦程式產品,可完成前述方法所述之步驟。 To achieve the above object, the present invention provides a computer program product for categorizing web addresses. When a computer device loads and executes the computer program product, the steps described in the foregoing method can be completed.

綜上所述,本發明之網址分類方法及其電腦程式產品透過分析網址描述資訊與類別關鍵字之間的關聯度,並依據關聯度自動將目標網址分類至特定類別,得以降低一對一網址與類別人工標記之成本。 In summary, the URL classification method and computer program product of the present invention analyze the correlation between the URL description information and the category keywords, and automatically classify the target URL into a specific category based on the correlation, thereby reducing the one-to-one URL. The cost of manual tagging with categories.

S1000~S1200‧‧‧步驟 S1000 ~ S1200‧‧‧step

圖1為本案實施範例之建立分類模式流程圖。 FIG. 1 is a flowchart of establishing a classification mode according to an implementation example of the present case.

圖2為本案實施範例之網址分類模型建立流程圖。 FIG. 2 is a flowchart of establishing a website classification model according to an embodiment of the present invention.

圖3為網址描述擷取與前處理之步驟流程圖。 Figure 3 is a flowchart of the steps of fetching and preprocessing the URL description.

圖4為本案網址分類模型建立與執行之步驟流程圖。 FIG. 4 is a flowchart of steps for establishing and implementing a website classification model in this case.

圖5為本案判斷新網址類別並產生新網址分類結果之步驟流程圖。 FIG. 5 is a flowchart of steps for judging a new URL category and generating a classification result of the new URL for this case.

以下將描述具體之實施例以說明本發明之實施態樣,惟其並非用以限制本發明所欲保護之範疇。 The following describes specific embodiments to illustrate the implementation of the present invention, but it is not intended to limit the scope of the present invention.

本發明於第一實施例提供一種網址分類方法。此方法包含在目標網址可取得網址描述內容條件下,則擷取一個或複數個網址描述資訊,並分析各網址描述資訊與欲分類類別之一個或多個類別關鍵字之間的關聯度,以依據關聯度將目標網址分類至類別。 The first embodiment of the present invention provides a method for classifying a website. This method includes extracting one or more URL description information under the condition that the destination URL can obtain the URL description content, and analyzing the correlation between each URL description information and one or more category keywords of the category to be classified. Classify destination URLs into categories based on relevance.

於另一實施例中,前述方法在目標網址未能取得網址描述內容條件下,則由目標網址萃取一個或多個網址關鍵字,並將網址關鍵字與一個或多個訓練集網址關鍵字進行比對,以將目標網址分類至具關聯性之 訓練集網址關鍵字所對應的類別內。 In another embodiment, under the condition that the destination URL fails to obtain the description content of the URL, the destination URL extracts one or more URL keywords and compares the URL keywords with one or more training set URL keywords. Match to classify the destination URL into the category corresponding to the relevant training set URL keywords.

於另一實施例中,前述方法依據網址關鍵字與訓練集網址關鍵字其吻合字串之數量以分析其關聯性。於另一實施例中,前述方法係依據吻合字串所座落之網址層級以分析其關聯度。於另一實施例中,前述方法依據吻合字串之數量以及所座落之網址層級進行權重計算,以分析其關聯度。 In another embodiment, the aforementioned method analyzes the relevance of the URL keywords and the training set URL keywords by the number of matching strings. In another embodiment, the aforementioned method analyzes the relevance based on the URL level of the matching string. In another embodiment, the foregoing method performs weight calculation based on the number of matching strings and the level of the web site where it is located to analyze its relevance.

於另一實施例中,前述方法係分析各網址描述資訊與類別關鍵字間的機率分佈,以計算關聯度。於另一實施例中,前述方法係基於潛在狄利克雷分配模型分析機率分佈。於另一實施例中,前述方法之潛在狄利克雷分配模型係採用Gibbs sampling推估關聯度。於另一實施例中,前述方法更對目標網址之關鍵字進行翻譯,以進行分類作業。 In another embodiment, the aforementioned method analyzes the probability distribution between the description information of each URL and the category keywords to calculate the degree of relevance. In another embodiment, the aforementioned method analyzes the probability distribution based on a potential Dirichlet allocation model. In another embodiment, the potential Dirichlet allocation model of the aforementioned method uses Gibbs sampling to estimate the degree of association. In another embodiment, the foregoing method further translates keywords of the destination URL for classification.

本發明於第二實施例更提供一種網址分類電腦程式產品,當電腦裝置載入並執行電腦程式產品,可完成前述方法所述之步驟。 The second embodiment of the present invention further provides a computer program product for categorizing web addresses. When a computer device loads and executes the computer program product, the steps described in the foregoing method can be completed.

以下本發明茲以第一實施例之網址分類方法進行範例說明,惟第二實施例之網址分類電腦程式產品亦可達到相同或相似之技術功效。 In the following, the present invention is exemplified by the URL classification method of the first embodiment. However, the computer program product of the URL classification of the second embodiment can also achieve the same or similar technical effects.

請參閱圖1,其為本案實施範例之建立分類模式流程圖。其步驟說明如下: Please refer to FIG. 1, which is a flowchart of establishing a classification mode according to an implementation example of this case. The steps are explained as follows:

S1000:建立分類模型並輸出其模型訓練結果。 S1000: Establish a classification model and output its model training results.

S1200:判斷新網址類別,並產生新網址分類結果。 S1200: Determine the new URL category and generate a classification result of the new URL.

請參閱圖2,為本案網址分類模型建立之流程圖,包括下列步驟: Please refer to Figure 2 for the flowchart of establishing the URL classification model of this case, including the following steps:

S1110:建立網址訓練集。 S1110: Create a training set of URLs.

S1120:網址描述擷取與前處理。 S1120: URL description retrieval and pre-processing.

S1130:產生網址描述集。 S1130: Generate a URL description set.

S1140:建立類別知識。 S1140: Establish category knowledge.

S1150:網址分類模型建立與執行等步驟。 S1150: Establish and execute the URL classification model.

前述S1100至S1130為資料整備階段。首先,S1110建立訓練集網址中,收集一群可擷取網址描述的網址。接著,S1120執行擷取網址描述內容並對其內容進行前處理。圖3表示S1120之詳細步驟流程圖。首先,S1121針對網址訓練集中的每個網址擷取其網址描述,網址描述可透過不同方法獲得,例如利用網路爬蟲技術擷取網址內容等。S1122則利用自然語言技術將網址描述內容進行斷詞並且進行詞性還原。S1123則判斷已還原詞性之斷字是否為停用字(停用字通常被認為不重要或無鑑別力之字眼)。最後S1124進一步去除特定詞性的字眼,只保留有興趣之特定詞性。本案稱經過S1122到1124處理後所留下的字為關鍵字。S1130收集所有的網址描述關鍵字後集合產生單一檔案。 The aforementioned S1100 to S1130 are the data preparation stage. First, S1110 establishes a training set URL, and collects a group of URLs that can retrieve URL descriptions. Next, S1120 executes pre-processing to extract the description content of the URL. Figure 3 shows a detailed flow chart of S1120. First, S1121 extracts the URL description for each URL in the URL training set. The URL description can be obtained by different methods, such as using web crawler technology to retrieve the URL content. S1122 uses natural language technology to segment words and restore parts of speech. S1123 judges whether the hyphenation of the restored part of speech is a stop word (the stop word is usually considered to be unimportant or discriminating words). Finally, S1124 further removes the words of a specific part of speech, and only retains the specific part of speech of interest. In this case, the words left after S1122 to 1124 are treated as keywords. S1130 collects all the URL description keywords and then generates a single file.

S1140建立類別知識中可提供興趣類別以及其各個類別數個關鍵字、已知類別網址分類知識與已知類似網址庫,此步驟可協助S1150網址分類模型建立與執行,可視為半監督式方法。 The S1140 establishment category knowledge can provide interest categories and several keywords of each category, known category URL classification knowledge, and known similar URL database. This step can assist in the establishment and implementation of the S1150 website classification model, which can be regarded as a semi-supervised method.

在獲得S1130網址描述集以及S1140類別知識後,則一同視為模型輸入資料並執行S1150以建立網址分類模型,此處之網址分類建模機制是基於潜在狄利克雷分配模型(Latent Dirichlet allocation;LDa)。LDA是一種可用來識別大規模文件集中潛藏的主題訊息的機率生成模型,該模型假 設一篇文件是由一組詞構成的一個集合,且詞與詞之間沒有先後順序關係,而一篇文件可以包含多個主題,文件中每一個詞都由其中的一個主題生成。原始LDA是一種非監督式學習算法,即不需要手工標記資訊,需要的僅僅是文件集以及指定主題的數量即可。本案運用此一相同概念於找出大規模網址描述中所隱藏的數個類別,即文件為網址、文件集為網址描述集、主題為類別。其模型最終產出兩個矩陣,分別為網址-類別關係矩陣θ和類別-關鍵字關係矩陣φ,而此兩矩陣可通過機率推論方法獲得。 After obtaining the S1130 URL description set and S1140 category knowledge, they are considered together as model input data and S1150 is executed to establish the URL classification model. The URL classification modeling mechanism here is based on the Latent Dirichlet allocation model (LDa). ). LDA is a probabilistic generation model that can be used to identify hidden topic messages in large-scale document sets. The model assumes that a document is a set of words, and there is no sequential relationship between words. It can contain multiple topics, and each word in the file is generated by one of them. The original LDA is an unsupervised learning algorithm, that is, there is no need to manually tag information, all that is needed is the file set and the number of specified topics. This case uses this same concept to find several categories hidden in large-scale URL descriptions, that is, files are URLs, file sets are URL description sets, and topics are categories. The model finally produces two matrices, namely the website-category relationship matrix θ and the category-keyword relationship matrix φ, and these two matrices can be obtained by the probability inference method.

本案用於LDA機率推論方法為Gibbs sampling,其執行過程中不斷透過重覆估計每個關鍵字屬於各類別的機率,並基於此類別機率分佈最終挑選一個類別並給此關鍵字(即形成關鍵字與類別之對應),這整個流程則是S1150所執行。給定K={k1,k2,...,kn}為所有類別種類,W={w1,w2,...,wv}為所有關鍵字之集合,D={d1,d2,...,dm}為所有網址描述之集合,以下為簡化符號均使用k代表某個類別kj,1jn;w代表某個關鍵字wi,1iv;d代表某篇網址描述df,1fm。如方程式(1)所示,關鍵字w屬於類別k的機率p(w→k)是由兩項式子共同估計,p(w|k)為關鍵字w於類別k的機率;;p(k|dw)為被指定給類別k之關鍵字數量佔w所屬的網址描述dw所有關鍵字之比例。 The method used for LDA probability inference in this case is Gibbs sampling. During the implementation process, it repeatedly estimates the probability that each keyword belongs to each category, and finally selects a category based on the probability distribution of this category and gives this keyword (that is, forms a keyword Corresponding to the category), this whole process is executed by S1150. Given K = {k 1 , k 2 , ..., k n } is all category types, W = {w 1 , w 2 , ..., w v } is a set of all keywords, and D = {d 1 , d 2 , ..., d m } is the set of all URL descriptions. For the following simplified symbols, k is used to represent a certain category k j , 1 j n; w represents a certain keyword w i , 1 i v; d represents a URL description d f , 1 f m. As shown in equation (1), the probability p (w → k) of the keyword w belonging to the category k is estimated by two expressions, and p (w | k) is the probability of the keyword w in the category k; ; P (k | d w ) is the ratio of the number of keywords assigned to category k to all keywords in the URL description d w to which w belongs.

於S1150中,當所有的關鍵字都透過公式(1)與類別建立對應連結時稱作為一回合,當訓練回合數達到設定門檻值時則終止訓練階段,依據每個關鍵字與類別之對應,可計算出網址-類別關係矩陣θ和類別-關鍵字關係矩陣φ,其中θ描述每個網址的類別機率分佈,φ則描述每個關鍵字於 類別機率分佈。 In S1150, when all keywords are associated with categories through formula (1), it is called a round. When the number of training rounds reaches a set threshold, the training phase is terminated. According to the correspondence between each keyword and category, The website-category relationship matrix θ and the category-keyword relationship matrix φ can be calculated, where θ describes the category probability distribution of each website, and φ describes the category probability distribution of each keyword.

於S1140類別知識應先收集興趣類別與其關鍵字(已知類別關鍵字),一個興趣類別可由多個關鍵字所組成。此步驟亦可利用其他已知知識進一步提升分類模型之效能,包括網址分類知識以及類似網址庫。已知網址分類知識描述網址與其類別之對應,如www.youtube.com.tw被認為屬於影音娛樂類別;已知類似網址庫記錄網址與其他類似網址之對應,舉例來說,www.youtube.com類似網址有www.vimeo.com與www.dailymotion.com。 In S1140 category knowledge, interest categories and their keywords (known category keywords) should be collected first. An interest category can be composed of multiple keywords. This step can also use other known knowledge to further improve the performance of the classification model, including URL classification knowledge and similar URL databases. Known URL classification Knowledge description URL corresponds to its category. For example, www.youtube.com.tw is considered to belong to the category of audiovisual entertainment; Known similar URL library record URLs correspond to other similar URLs. Similar sites are www.vimeo.com and www.dailymotion.com.

在S1150網址分類模型建立與執行中,本案則利用S1140類別知識中興趣類別與其關鍵字之對應取代方程式(1)的機率方式,直接產生對應;利用已知網址類別與已知類似網址庫等知識來調整方程式(1)。S1150細節如圖4所示,S1150A針對訓練集中每個關鍵字w,S1150B則在w已經被包含於類別k中的關鍵字集合,則執行S1151直接將w指定給類別k;在如果沒有,但如果同時擁有已知網址類別與已知類似網址庫知識(S1150C),則執行S1152,利用公式(2)決定w的類別並對訓練集下一個關鍵字重複進行一樣步驟,為已知網址分類知識,為已知類似網址庫知識(S1150D),其中代表已知機率。如果均無類別知識,則執行S1153利用公式(1)決定w的類別並對訓練集下一個關鍵字重複進行一樣步驟;如果只擁有已知網址分類知識,則執行S1154利用公式(3)決定w的類別並對訓練集下一個關鍵字重複進行一樣步驟;如果只擁有已知類似網址庫,則執行S1155利用公式(4)決定w的類別並對訓練集下一個關鍵字重複進行一樣步驟。 In the establishment and implementation of the S1150 URL classification model, this case uses the probability method of replacing the equation (1) with the correspondence between the interest category and its keywords in the S1140 category knowledge to directly generate the correspondence; using the knowledge of known URL categories and known similar URL libraries To adjust equation (1). The details of S1150 are shown in Figure 4. S1150A is for each keyword w in the training set, and S1150B is the keyword set where w is already included in category k. Then execute S1151 to directly assign w to category k; if not, but If you have both known URL category and known similar URL database knowledge (S1150C), execute S1152, use formula (2) to determine the category of w, and repeat the same steps for the next keyword in the training set. Classify knowledge for known URLs, Known for similar URL library knowledge (S1150D), where Represents a known probability. If there is no category knowledge, execute S1153 to determine the category of w using formula (1) and repeat the same steps for the next keyword in the training set. If you only have knowledge of known URL classifications, execute S1154 to determine w using formula (3). And repeat the same steps for the next keyword in the training set; if there is only a known similar URL database, execute S1155 to determine the category of w using formula (4) and repeat the same steps for the next keyword in the training set.

概括來說,網址分類模型的輸入網址描述集,搭配已知類別關鍵字、已知網址類別知識與已知類似網址庫來協助改善模型,其最終產出為分類模型,即θ與φ。S1140類別知識中,已知興趣類別與其關鍵字直接決定Gibbs sampling中關鍵字與類別之對應,使得該關鍵字不會被分配到其他不相關的類別,當此對應關係之形成也會連帶影響到處於同一個網址描述的其他關鍵字之類別對應,又接著影響到擁有相同關鍵字的其他網址描述之類別分佈。即使當關鍵字不屬於任何已知類別關鍵字,透過已知網址分類知識以及類似網址庫可協助此關鍵字被分配到更合適的類別。透過本案之設計改良可增進分類模型之效能。同時,θ記錄每個網址屬於各類別的機率,打破過去一網址一個類別的限制。 In summary, the input URL description set of the URL classification model, combined with known category keywords, known URL category knowledge, and known similar URL libraries to help improve the model, its final output is a classification model, namely θ and φ. In S1140 category knowledge, the known interest category and its keywords directly determine the correspondence between keywords and categories in Gibbs sampling, so that the keywords will not be assigned to other unrelated categories, and the formation of this correspondence relationship will also be affected. The category mapping of other keywords in the same URL description then affects the category distribution of other URL descriptions with the same keyword. Even when a keyword does not belong to any known category keyword, knowledge of known URL classifications and similar URL libraries can help this keyword be assigned to a more appropriate category. The design improvement of this case can improve the performance of the classification model. At the same time, θ records the probability that each URL belongs to each category, breaking the limit of one category per URL in the past.

請參閱圖5,其為表示S1200判斷新網址類別之流程圖。針對新網址的分類問題時,基於S1100所獲得的模型訓練結果,本案提供兩種分類機制,分別為線上分類機制與離線分類機制。若新網址擁有可擷取網址描述時(S1200A,S1200B),則使用線上式分類機制;反之,則使用離線式分類機制。 Please refer to FIG. 5, which is a flowchart showing that S1200 determines a new URL category. For the classification problem of the new website, based on the model training results obtained in S1100, this case provides two classification mechanisms, namely an online classification mechanism and an offline classification mechanism. If the new URL has a retrievable URL description (S1200A, S1200B), the online classification mechanism is used; otherwise, the offline classification mechanism is used.

當拿到新網址時,S1210如同S1120將網址描述進行前處理後萃取關鍵字,只留下較有意義之字稱為關鍵字集合d。接著,S1220中使用翻譯技術將所有在d的關鍵字翻譯成對應語言並將結合原網址描述之關鍵字形成一個擴增網址描述d'。最後,S1230將針對每個在d'中的關鍵字計算它屬於各類別的機率,如方程式(5)所示,|d'|為d'中關鍵字的數量,E(wg) 為第g個關鍵字wg的權重,其值應介於[0,1],p(wg|k)則是從θ中獲得。權重計算應視使用情境決定,能達到相同功效之權重變形應視為本方法的等效實施。換言之,公式(5)反應如果d'中多數關鍵字出現在社交類的機率很大,則代表d'在社交類機率也比較大。值得注意的是,因為經S1100所得到模型結果中每個網址或關鍵字都可以屬於多個類別,所以新網址透過此設計也可以被歸屬多個類別,如果使用者只希望將一個網址歸類到一個類別,則可直接選擁有最大機率之類別即可。 When a new URL is obtained, S1210 extracts keywords after pre-processing the URL description like S1120, leaving only the more meaningful words called the keyword set d. Next, in S1220, a translation technique is used to translate all keywords in d into the corresponding language, and the keywords described in the original URL are combined to form an augmented URL description d ' . Finally, S1230 will calculate the probability that each keyword in d ' belongs to each category, as shown in equation (5), | d ' | is the number of keywords in d ' , and E (w g ) is the The weight of g keywords w g should be between [0,1], and p (w g | k) is obtained from θ. The weight calculation should be determined according to the use situation, and the weight deformation that can achieve the same effect should be regarded as the equivalent implementation of this method. In other words, formula (5) reflects that if most of the keywords in d ' appear in the social category, it means that d ' also has a higher probability in the social category. It is worth noting that because each URL or keyword in the model result obtained by S1100 can belong to multiple categories, the new URL can also be assigned to multiple categories through this design. If the user wants to classify only one URL To a category, you can directly select the category with the highest probability.

當需要即時網址類別辨識或者動態的網址描述擷取並非允許時,則使用離線式分類機制。其背後設計精神是對於兩個相似的網址,其所屬類別應該也相似。因此,離線式分類機制將在所有訓練集中訓練網址與新網址比對最相似的網址,並將其最相似網址之類別分佈指定給其新網址。本案將網址中每個用”.”或”/”隔開的都視為一個字串,而兩個網址之相似度則是定義成最長字串吻合之數量,其最長吻合的循序字串又稱為最長共同子序列,其序列中的字串個數定義成長度。舉例來說,如表一所示,假設新網址mod.cht.com.tw/channelrw/01.php,其分別與s1、s2、s3與s4相似度為6、4、3、6。另外,本案定義一個網址能被”/”分割分段的單位為層,如mod.cht.com.tw/channelrw/01.php共有三層,分別為mod.cht.com.tw、channelrw與01.php。字串在網址中不同層也有不同重要性,一般而言,通常屬於越前面層的字串越具代表性與獨特性。因此,本案考慮一個依照吻合字串所處的層級之權重計算機制來調整相似度,其關鍵字w 的權重為{1}/{w所屬之層級}。如表二,新網址與s1、s2、s3與s4的相似度則更改為4.83、4、3與4.67。能達到相同功效之權重變形應視為本方法的等效實施。 When real-time URL category identification or dynamic URL description retrieval is not allowed, an offline classification mechanism is used. The design spirit behind it is that for two similar URLs, their categories should be similar. Therefore, the offline classification mechanism will compare the training URL with the new URL in all training sets, and assign the category distribution of its most similar URL to its new URL. In this case, each of the URLs separated by "." Or "/" is regarded as a string, and the similarity between the two URLs is defined as the number of longest string matches, and the longest matching sequential string is It is called the longest common subsequence, and the number of strings in the sequence is defined as the length. For example, as shown in Table 1, suppose the new website mod.cht.com.tw/channelrw/01.php has similarities with s1, s2, s3, and s4 of 6, 4, 3, and 6, respectively. In addition, this case defines a unit whose URL can be divided by "/" as layers. For example, mod.cht.com.tw/channelrw/01.php has three layers, which are mod.cht.com.tw, channelrw and 01. .php. Strings also have different importance in different layers of the URL. Generally speaking, strings that belong to the upper layers are more representative and unique. Therefore, this case considers a weighted computer system that adjusts the similarity according to the level at which the matching strings are located. The weight of the keyword w is {1} / {the level to which w belongs}. As shown in Table 2, the similarity between the new website and s1, s2, s3, and s4 is changed to 4.83, 4, 3, and 4.67. The weight deformation that can achieve the same effect should be regarded as the equivalent implementation of this method.

總結來說,S1240將針對一個新網址,辨識它與訓練集中每個網址的最長子序列長度與吻合字串之層級來計算相似度。最後,S1250找出擁有最高相似度之網址並將其網址於類別分佈指定給新網址。表二例子中s1與新網址最為相似,因此新網址的類別機率分佈與s1相同。當擁有共同最大相似度的網址不只一個時,則加總平均這些網址的類別分佈再指定給新的網址。 In summary, S1240 will calculate the similarity for a new URL, identifying the longest subsequence length of each URL in the training set and the level of matching strings. Finally, S1250 finds the URL with the highest similarity and assigns its URL in the category distribution to the new URL. In the example in Table 2, s1 is the most similar to the new URL, so the category probability distribution of the new URL is the same as s1. When there is more than one URL with the same maximum similarity, the category distribution of these URLs is added up and assigned to the new URL.

S1110中先建立一堆訓練集網址,如表三中第一欄所表示,在此僅列出三筆代表。經過S1120擷取網址描述與前處理後,可以得到S1130的網址描述集,即表三的第二欄所表示。 In S1110, a bunch of training set URLs are first established. As indicated in the first column of Table III, only three representatives are listed here. After extracting the URL description and pre-processing in S1120, the URL description set of S1130 can be obtained, which is shown in the second column of Table 3.

表四、五與六為S1140類別知識例子。表四為已知類別關鍵字例子,共分為社交類、影音類與電子商務類三類,其關鍵字之應視應用與目標制訂。表五為一已知網址分類知識例子,假設是經由多人標記網址所屬之類別獲得,而類別機率之計算方法為{認為類別k的人數}/{參與回饋的人數}。表六為一已知類似網址庫例子,假設透過google搜尋結果獲得。 Tables 4, 5 and 6 are examples of S1140 category knowledge. Table 4 is an example of keywords of known categories, which are divided into three categories: social, audio and video, and e-commerce. The keywords should be determined by application and target. Table 5 is an example of known URL classification knowledge. It is assumed that it is obtained through the category to which the multi-person tagged URL belongs, and the calculation method of the category probability is {number of people who think of category k} / {number of people participating in feedback}. Table 6 is an example of a known similar URL library, assuming Google search results.

接著,執行S1150模型建立,模型建立過程就是依據不同給定條件不斷調整每個關鍵字與類別之對應關係。流程如圖4所示,先判別該 關鍵字是否已經被包含於某個類別的關鍵字裡,如果是則直接產生類別與關鍵字對應;如果沒有,應依照搭配已知知識條件選擇使用公式(1)、(2)、(3)或(4)來決定其對應關係,如該關鍵字並非於關鍵字時且其所處之網址描述皆無已知分類知識與類似網址庫時,則使用公式(1)來決定其類別;當該關鍵字並非於關鍵字且其所處之網址同時擁有已知分類知識與已知類似網址庫時,則使用公式(2);當該關鍵字並非於關鍵字時且該字所處之網址只擁有已知網址分類知識時,則使用公式(3);當該關鍵字並非於關鍵字且所處之網址只擁有已知類似網址庫時,則使用公式(4)。經執行S1150,類別-關鍵字矩陣φ為其結果之一。 Next, the S1150 model is established. The model establishment process is to continuously adjust the correspondence between each keyword and category according to different given conditions. The process is shown in Figure 4. First, determine whether the keyword is already included in a certain category of keywords. If it is, then the category is directly corresponding to the keyword; if not, the formula should be selected according to the known knowledge conditions ( 1), (2), (3), or (4) to determine the corresponding relationship. If the keyword is not in the keyword and the URL description in which it is located has no known classification knowledge and similar URL database, then use the formula (1) to determine its category; when the keyword is not a keyword and the URL where it is located has both known classification knowledge and a known similar URL database, then formula (2) is used; when the keyword is not critical When the word is located and the URL where the word is located only has knowledge of known URL classifications, then formula (3) is used; when the keyword is not in a keyword and the URL where it is located has only known similar URL libraries, the formula is used (4). After executing S1150, the category-key matrix φ is one of its results.

表七中利用所獲得的φ,各類別中本案可以找出具代表性的關鍵字來表示表示這個類別,而類別關鍵字代表性是由該類別中此關鍵字之機率決定,換言之,機率越大越能代表此類別,此表僅列出機率前五高的關鍵字表示。表八則為分類模型結果之二,即網址-類別關係矩陣θ,列出各網址的類別機率分佈。特別注意的是,此類別機率分佈與已知網址類別分類知識不同,此類別機率是由網址描述中的關鍵字與類別對應中所計算獲得。因模型假設關係,每個網址均有一定機率屬於每個類別,但如果機率太低,在此暫用0表示。 In Table 7, the obtained φ is used. In this case, a representative keyword in each category can be found to represent this category. The representativeness of a category keyword is determined by the probability of this keyword in the category. In other words, the greater the probability Can represent this category, this table only lists the top five keyword representations. Table 8 is the second result of the classification model, that is, the website-category relationship matrix θ, which lists the category probability distribution of each website. It is particularly noted that the probability distribution of this category is different from the known classification knowledge of URL categories. This category probability is calculated from the keywords and category correspondence in the URL description. Because the model assumes a relationship, each URL has a certain probability of belonging to each category, but if the probability is too low, it is temporarily represented by 0 here.

最後所獲得的模型結果在S1200辨識新網址的類別。假設欲辨識新網址shopping.friday.tw屬於哪個類別,首先S1210先擷取其網址描述,表九第一欄所示(此表示僅舉例,網址描述可依個別情境不同),經過S1220斷詞與去除特殊詞性後形成關鍵字,表九第二欄所示。S1230利用自動翻譯技術將關鍵字進行翻譯,翻譯之語言應視情境決定,在此使用英文為例。最後,再將原關鍵字與翻譯的關鍵字一起形成擴增網址描述,表九第三欄所表示。假設關鍵字w的權重是採用{w於dw中出現次數×}公式計算得到,如表七中各類別之第三欄,並針對擴增網址 描述之中的每個關鍵字再去表七尋找於各類別之機率,最後利用依照公式(5)加權平均獲得此網址於類別之機率。舉例來說,shopping.friday.tw根據擴增網址描述利用公式(5)計算,其分類機率分佈可能為{社交=0.05,影音=0.05,電子商務=0.9},因為其關鍵字在電子商務類出現機率較大。另外,假設新網址www.momoshop.com.tw/category/food並無(或無法獲取)網址描述,S1240則去對表八跟所有的訓練集網址計算相似度,其相似度分別為0、0與1。 Finally, the model result obtained in S1200 identifies the category of the new website. Suppose you want to identify which category the new URL shopping.friday.tw belongs to. First, S1210 first retrieves the URL description, as shown in the first column of Table 9 (this is just an example, the URL description can be different according to individual scenarios). The keywords are formed after removing the special part of speech, as shown in the second column of Table 9. S1230 uses automatic translation technology to translate keywords. The language of translation should be determined according to the situation. Here, English is used as an example. Finally, the original keywords and the translated keywords are used to form an augmented URL description, as shown in the third column of Table 9. Suppose the weight of the keyword w is the number of occurrences of {w in d w × } The formula is calculated, as shown in the third column of each category in Table 7, and for each keyword in the description of the expanded URL, go to Table 7 to find the probability of each category, and finally use the weighted average according to formula (5) to obtain Probability of this URL in category. For example, shopping.friday.tw is calculated using formula (5) based on the description of the augmented URL. The classification probability distribution may be {Social = 0.05, Video = 0.05, E-commerce = 0.9}, because its keywords are in the e-commerce category. There is a higher probability. In addition, assuming that the new website www.momoshop.com.tw/category/food does not have (or cannot obtain) a URL description, S1240 calculates the similarity between Table 8 and all the training set URLs, and the similarities are 0 and 0, respectively. With 1.

因此,S1250則將www.momoshop.com.tw的類別機率指定給此新網址。最終,此新網址類別機率則為{社交=0.01,影音=0.01,電子商務=0.98}。如果每個網址只分成一類,其分類結果可設定為擁有機率最大的類別,即電子商務類。 Therefore, S1250 assigned the category probability of www.momoshop.com.tw to this new URL. In the end, the probability of this new URL category is {social = 0.01, video = 0.01, e-commerce = 0.98}. If each URL is divided into only one category, the classification result can be set to the category with the highest probability of ownership, namely, the e-commerce category.

上列詳細說明係針對本發明之一可行實施例之具體說明,惟該實施例並非用以限制本發明之專利範圍,凡未脫離本發明技藝精神所為之等效實施或變更,均應包含於本案之專利範圍中。 The above detailed description is a specific description of a feasible embodiment of the present invention, but this embodiment is not intended to limit the patent scope of the present invention. Any equivalent implementation or change that does not depart from the technical spirit of the present invention should be included in Within the scope of the patent in this case.

Claims (10)

一種網址分類方法,包含:在目標網址可取得網址描述內容條件下,則擷取一個或複數個網址描述資訊,並分析各該網址描述資訊與欲分類類別之一個或多個類別關鍵字之間的關聯度,以依據該關聯度將該目標網址分類至該類別。     A method for classifying a URL, comprising: if the destination URL can obtain the URL description content, extracting one or more URL description information, and analyzing each URL description information and one or more category keywords of the category to be classified Of relevance to categorize the destination URL into that category based on that relevance.     如請求項1所述之網址分類方法,更在該在目標網址未能取得該網址描述內容條件下,則由該目標網址萃取一個或多個網址關鍵字,並將該網址關鍵字與一個或多個訓練集網址關鍵字進行比對,以將該目標網址分類至具關聯性之該訓練集網址關鍵字所對應的該類別內。     According to the URL classification method described in claim 1, and in the condition that the destination URL fails to obtain the description content of the URL, one or more URL keywords are extracted from the destination URL, and the URL keywords are combined with one or Multiple training set URL keywords are compared to classify the target URL into the category corresponding to the training set URL keywords that are relevant.     如請求項2所述之網址分類方法,係依據該網址關鍵字與該訓練集網址關鍵字其吻合字串之數量以分析其關聯性。     The URL classification method described in claim 2 is based on the number of matching strings between the URL keywords and the training set URL keywords to analyze their relevance.     如請求項3所述之網址分類方法,係依據該吻合字串所座落之網址層級以分析其關聯度。     The URL classification method described in claim 3 is based on analyzing the relevance of the URL level where the matching string is located.     如請求項4所述之網址分類方法,係依據該吻合字串之數量以及所座落之網址層級進行權重計算,以分析其關聯度。     The URL classification method described in claim 4 is based on the number of matching strings and the URL level of the website to calculate the weight to analyze its relevance.     如請求項1所述之網址分類方法,係分析各該網址描述資訊與該類別關鍵字間的機率分佈,以計算該關聯度。     The URL classification method described in claim 1, analyzes the probability distribution between each URL description and the keywords in the category to calculate the relevance.     如請求項6所述之網址分類方法,係基於潛在狄利克雷分配模型分析該機率分佈。     The URL classification method described in claim 6, analyzes the probability distribution based on a potential Dirichlet allocation model.     如請求項7所述之網址分類方法,其中該潛在狄利克雷分配模型係採用Gibbs sampling推估關聯度。     The URL classification method according to claim 7, wherein the latent Dirichlet allocation model uses Gibbs sampling to estimate the degree of relevance.     如請求項1所述之網址分類方法,更對目標網址之關鍵字進行翻譯,以進 行分類作業。     According to the URL classification method described in claim 1, the keywords of the destination URL are translated to perform the classification operation.     一種網址分類電腦程式產品,當電腦裝置載入並執行該電腦程式產品,可完成如請求項1至9任一項所述之方法。     A website classification computer program product. When a computer device loads and executes the computer program product, the method described in any one of claims 1 to 9 can be completed.    
TW105140768A 2016-12-09 2016-12-09 URL classification method and computer program product TWI667581B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW105140768A TWI667581B (en) 2016-12-09 2016-12-09 URL classification method and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW105140768A TWI667581B (en) 2016-12-09 2016-12-09 URL classification method and computer program product

Publications (2)

Publication Number Publication Date
TW201822030A true TW201822030A (en) 2018-06-16
TWI667581B TWI667581B (en) 2019-08-01

Family

ID=63258388

Family Applications (1)

Application Number Title Priority Date Filing Date
TW105140768A TWI667581B (en) 2016-12-09 2016-12-09 URL classification method and computer program product

Country Status (1)

Country Link
TW (1) TWI667581B (en)

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6691163B1 (en) * 1999-12-23 2004-02-10 Alexa Internet Use of web usage trail data to identify related links
US6721747B2 (en) * 2000-01-14 2004-04-13 Saba Software, Inc. Method and apparatus for an information server
US7340446B2 (en) * 2000-12-11 2008-03-04 Microsoft Corporation Method and system for query-based management of multiple network resources
US9519726B2 (en) * 2011-06-16 2016-12-13 Amit Kumar Surfacing applications based on browsing activity

Also Published As

Publication number Publication date
TWI667581B (en) 2019-08-01

Similar Documents

Publication Publication Date Title
CN108052593B (en) Topic keyword extraction method based on topic word vector and network structure
CN108009228B (en) Method and device for setting content label and storage medium
Qian et al. Social event classification via boosted multimodal supervised latent dirichlet allocation
CN109408743B (en) Text link embedding method
US7987417B2 (en) System and method for detecting a web page template
US8620837B2 (en) Determination of a basis for a new domain model based on a plurality of learned models
Chen et al. Velda: Relating an image tweet’s text and images
Jotheeswaran et al. OPINION MINING USING DECISION TREE BASED FEATURE SELECTION THROUGH MANHATTAN HIERARCHICAL CLUSTER MEASURE.
US10135723B2 (en) System and method for supervised network clustering
CN108959305A (en) A kind of event extraction method and system based on internet big data
CN111563373B (en) Attribute-level emotion classification method for focused attribute-related text
CN107391565B (en) Matching method of cross-language hierarchical classification system based on topic model
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN104008177B (en) Rule base structure optimization and generation method and system towards linguistic indexing of pictures
CN112989208A (en) Information recommendation method and device, electronic equipment and storage medium
Agrawal et al. Scalable, semi-supervised extraction of structured information from scientific literature
JP2007219947A (en) Causal relation knowledge extraction device and program
Shaikh Keyword Detection Techniques: A Comprehensive Study.
Song et al. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop
Qian et al. Boosted multi-modal supervised latent Dirichlet allocation for social event classification
TW201243627A (en) Multi-label text categorization based on fuzzy similarity and k nearest neighbors
Guo [Retracted] Intelligent Sports Video Classification Based on Deep Neural Network (DNN) Algorithm and Transfer Learning
JP2006285419A (en) Information processor, processing method and program
CN116822491A (en) Log analysis method and device, equipment and storage medium
Maylawati et al. Feature-based approach and sequential pattern mining to enhance quality of Indonesian automatic text summarization