TWI484431B

TWI484431B - Multi - source heterogeneous network data community analysis method

Info

Publication number: TWI484431B
Application number: TW102133784A
Authority: TW
Original assignee: Chunghwa Telecom Co Ltd
Priority date: 2013-09-18
Filing date: 2013-09-18
Publication date: 2015-05-11
Also published as: TW201513010A

Description

Multi-source heterogeneous network data community analysis method

本發明係一種於不同網路中透過多項特徵屬性對用戶進行身份辨識以產生歸戶機率，並分析找出具有跨網影響力用戶的多來源異質網路資料社群分析方法。The invention relates to a method for identifying a user through a plurality of characteristic attributes in different networks to generate a household probability, and analyzing a multi-source heterogeneous network data community analysis method for users with cross-network influence.

社群網路的發展應用是近幾年新形態熱門發展的新型應用，目前尚處在市場起步階段，然而，社群網路的發展應用不僅僅是新型態的商業模式，更替代了傳統社交活動所需的交流，因此未來只會有更多產業投入社群網路的開發，以及社群網路分析應用。The development and application of social networking is a new application of new forms of development in recent years. It is still in the initial stage of the market. However, the development and application of social networks is not only a new business model, but also a replacement for the traditional The communication required for social activities, so in the future, only more industries will invest in the development of social networks and social network analysis applications.

美國專利US20120284384 A1「Computer processing method and system for Network data」，是依據篩選關係再利用任何分群演算法產生子社群之技術特徵，但其在資料來源上僅限於單一來源網路資料分析，無法針對多資料來源分析，且該專利於關係篩選中數值決定也僅為使用者於單一網路中所擁有的連結數目，同樣無法於多網路環境下進行篩選。US Patent No. 20120284384 A1 "Computer processing method and system for Network data" is a technical feature of generating a sub-community based on a screening relationship and reusing any grouping algorithm, but its data source is limited to single source network data analysis, and cannot be targeted. Multiple data source analysis, and the value of the patent in the relationship screening is only the number of links that users have in a single network, and it is also impossible to filter in a multi-network environment.

本發明之主要目的在於提供一種於多來源異質網路中辨識出異質網路中的使用者為相同用戶機率的多來源異質網路資料社群分析方法The main purpose of the present invention is to provide a multi-source heterogeneous network data community analysis method for identifying users in a heterogeneous network with the same user probability in a multi-source heterogeneous network.

本發明之次要目的在於提供一種分析出具有聯繫較頻繁、緊密的混合(mashup)社群叢集以及找出具有跨網影響力用戶的多來源異質網路資料社群分析方法A secondary object of the present invention is to provide an analysis that has a more frequent and tight connection A dense mashup community cluster and a multi-source heterogeneous network data community analysis method for users with cross-network influence

為達上述目的，本發明之多來源異質網路資料社群分析方法，係由異質網路資料中獲取用戶的網路行為與特徵屬性，透過比對網路行為與特徵屬性，辨識出不同網路中的用戶是否為同一用戶，以產生歸戶機率進行歸戶，然後將確認為同一用戶建立關係連線，將與候選關係用戶建立候選關係連線，並將各別網路內的每條用戶關係連線進行正規化後，與關係連線進行比對以得到關係權重，再以社群網路分群演算法進行分群，產生複數個混合社群叢集，且依照歸戶機率將混合社群叢集中，同一用戶機率高者進行歸戶動作，接著將混合社群叢集中各關係連線出現的異質網路數量/總異質網路數量，產生異質權重，再利用社群網路分析方法之(因評估指標有多種，連結度數只是其中之一種方法，故刪除)評估指標計算社群內影響力狀況，用以找出社群內重要的影響力用戶。To achieve the above objective, the multi-source heterogeneous network data community analysis method of the present invention obtains the user's network behavior and characteristic attributes from heterogeneous network data, and identifies different networks by comparing network behaviors and feature attributes. Whether the users in the road are the same user, and the entrants are generated to generate the affiliation, and then the connection is confirmed as the same user, and the candidate relationship is established with the candidate relationship, and each of the individual networks is connected. After the user relationship connection is normalized, it is compared with the relationship connection to obtain the relationship weight, and then grouped by the social network grouping algorithm to generate a plurality of mixed community clusters, and the mixed community according to the probability of returning the household. In the cluster, the same user has the highest chance of returning to the household, and then the heterogeneous network number/total heterogeneous network number in the mixed community clusters is generated, and the heterogeneous weight is generated, and then the social network analysis method is used. (Because there are many evaluation indicators, the connection degree is only one of them, so delete) The evaluation indicators calculate the influence status within the community to identify important images within the community. Power users.

100‧‧‧計算多資料來源歸戶機率100‧‧‧ Calculate the probability of multiple sources of information

200‧‧‧建立虛擬概念網200‧‧‧Creating a virtual concept network

300‧‧‧計算關係權重300‧‧‧Computational relationship weights

400‧‧‧篩選關係400‧‧‧ screening relationship

500‧‧‧產生混合社群叢集500‧‧‧Generating a mixed community cluster

600‧‧‧歸戶異質網路關係600‧‧‧Home-to-house heterogeneous network relationship

700‧‧‧計算異質權重700‧‧‧ Calculate heterogeneous weights

800‧‧‧分析用戶影響力800‧‧‧Analysis of user influence

101‧‧‧由複數筆異質網路資料中獲取各用戶及具辨識功能的網路行為與特徵屬性，並賦予特徵屬性比重101‧‧‧ Obtain the network behavior and feature attributes of each user and identification function from multiple heterogeneous network data, and give the characteristics attribute weight

102‧‧‧將獲取特徵屬性以透過比對，比對出特徵屬性相同之用戶，並將辨識出的唯一用戶進行歸戶102‧‧‧ will obtain the feature attributes to compare the users with the same feature attributes, and will identify the unique users

103‧‧‧扣除已辨識出的唯一用戶，將尚未辨識出的用戶設為候選歸戶組合，利用未辨識出的用戶間重複的特徵屬性或相似的網路行為求得歸戶機率103‧‧‧Deduct the unique users identified, set the unrecognized users as candidate homecoming combinations, and use the unidentified repetitive feature attributes or similar network behaviors to obtain the household probability

104‧‧‧歸戶機率小於歸戶門檻值，不進行歸戶動作，而歸戶機率高於歸戶門檻值者，則進行歸戶動作104‧‧‧The probability of returning to the household is less than the threshold of the household registration, and the household registration action is not carried out, and the household registration rate is higher than the household registration threshold, then the household registration action is performed.

第1圖為本發明之簡易流程圖；第2圖為本發明計算多資料來源歸戶機率之流程圖。1 is a simplified flow chart of the present invention; FIG. 2 is a flow chart of calculating the probability of multi-source source attribution according to the present invention.

請參閱第1圖，本發明之多來源異質網路資料社群分析方法，是一種應用於媒體程式之分析方法，步驟流程包括步驟1、計算多資料來源歸戶機率100；步驟2、建立虛擬概念網200；步驟3、計算關係權重300；步驟4、篩選關係400；步驟5、產生混合社群叢集500；步驟6、歸戶異質網路關係600；步驟7、計算異質權重700；步驟8、分析用戶影響力800。Referring to FIG. 1 , the multi-source heterogeneous network data community analysis method of the present invention is an analysis method applied to a media program, and the step process includes the steps 1. calculating a multi-source source attribution probability 100; and step 2: establishing a virtual Concept network 200; step 3, computing relationship weight 300; step 4, screening relationship 400; step 5, generating mixed community cluster 500; step 6, home heterogeneous network relationship 600; step 7, calculating heterogeneous weight 700; Analyze user impact 800.

請參閱第2圖，該步驟1、計算多資料來源歸戶機率100，係指將多來源異質網路資料包含同一用戶使用相同或相異帳號登入異質網路做操作，其步驟流程包括：步驟1-1、由複數筆異質網路資料中獲取各用戶及具辨識功能的網路行為與特徵屬性，並賦予特徵屬性比重101；透過每一用戶於異質網路的特徵屬性以及網路行為，辨識出異質網路中的使用者為相同用戶的機率，每個特徵屬性均具有一個比重，且特徵屬性為可用來描述形容用戶特徵的屬性，舉凡姓名、證號、電子信箱、手機號碼、居住地、工作地點、畢業學校、興趣、…等，可利用用戶於各網站上所提供的資訊取得，而網路行為是指重複的共同好友、相同時間相同地點相同的發文內容…等，此部分可利用網路爬網等技術達成；步驟1-2、將獲取特徵屬性以透過比對，比對出特徵屬性相同之用戶，並將辨識出的唯一用戶進行歸戶102；步驟1-3、扣除已辨識出的唯一用戶，將尚未辨識出的用戶設為候選歸戶組合，利用未辨識出的用戶間重複的特徵屬性或相似的網路行為求得歸戶機率103；步驟1-4、設定一歸戶門檻值用來判定為是否進行歸戶動作之界線。歸戶機率介於0.0~1.0之間，而歸戶門檻值介於0.5~1.0之間。歸戶機率小於歸戶門檻值之用戶，不進行歸戶動作，而歸戶機率高於歸戶門檻值之用戶，則進行歸戶動作104，並產生歸戶機率表，該歸戶機率表，其內容記載異質網路間用戶歸戶機率。Please refer to FIG. 2, step 1. Calculate the multi-source source attribution probability 100, which means that the multi-source heterogeneous network data includes the same user using the same or different accounts to log in to the heterogeneous network, and the process steps include: 1-1. Obtaining the network behavior and feature attributes of each user and the identification function from the plurality of heterogeneous network data, and assigning the characteristic attribute proportion 101; through each user's characteristic attribute and network behavior in the heterogeneous network, Identify the probability that the user in the heterogeneous network is the same user, each feature attribute has a specific gravity, and the feature attribute is an attribute that can be used to describe the user feature, such as name, license number, e-mail address, mobile phone number, residence The place, the place of work, the graduate school, the interest, ..., etc., can be obtained by using the information provided by the user on each website, and the network behavior refers to the repeated common friends, the same content of the same place at the same time, etc., etc. Can be achieved by techniques such as web crawling; Steps 1-2, the feature attributes are acquired to be compared, and the users with the same feature attributes are compared, and The identified unique user is logged in to the household 102; Step 1-3, deducting the identified unique user, setting the unrecognized user as the candidate homing combination, using the unrecognized duplicated feature attribute or similar The network behavior is determined by the probability of returning home 103; Steps 1-4, setting a threshold value of the household account is used to determine whether the boundary of the home registration action is performed. The probability of returning home is between 0.0 and 1.0, and the threshold for returning home is between 0.5 and 1.0. If the user whose return rate is less than the threshold value of the household, the user who does not perform the household registration operation and the household registration rate is higher than the household registration threshold, the household registration action 104 is performed, and the household income rate table is generated, and the household income rate table is Its content describes the probability of user homing between heterogeneous networks.

該步驟1-2及步驟1-3係透過直接辨識及間接機率辨識產生多資料來源歸戶機率；直接辨識係於異質網路間共同特徵屬性中能識別出唯一用戶者進行歸戶，歸戶後的用戶機率為1，該特徵屬性可以為單一屬性或複合屬性，而間接辨識是扣除已辨識出的用戶，將尚未辨識出的用戶歸納為候選歸戶組合，利用用戶間重複的特徵屬性或相似的網路行為比例求得歸戶機率；最後計算個別特徵屬性間的近似程度，比對之特徵屬性相同的用戶，其歸戶機率為1，並乘上每個特徵屬性的屬性比重，而部分相似可以獲得相似歸戶機率，藉此計算出差異機率值，最後將每個屬性的相似機率值做加總平均得到每個候選歸戶組合的歸戶機率。Steps 1-2 and 1-3 generate multi-source source attribution probability through direct identification and indirect probability identification; direct identification is to identify a unique user in the common feature attribute between heterogeneous networks, and to return to the household. The user probability is 1, the feature attribute can be a single attribute or a composite attribute, and the indirect identification is to deduct the identified users, and to classify the unrecognized users into candidate homecoming combinations, using the repeated feature attributes of the users or The similar network behavior ratio is used to obtain the probability of returning to the household; finally, the degree of approximation between the individual characteristic attributes is calculated, and the user with the same characteristic attribute is the ratio of the household ownership rate of 1, and multiplied by the attribute weight of each characteristic attribute. Partial similarity can obtain the similar probability of household registration, thereby calculating the difference probability value, and finally summing the similar probability values of each attribute to obtain the attribution probability of each candidate homing combination.

建立虛擬概念網200，其係根據計算多資料來源歸戶機率100推算出各用戶的歸戶機率，將多社群網路連線轉換為等價的虛擬概念，歸戶機率為1者表示為百分百確認的用戶，已確認的用戶需使用統一的歸戶代號替代原網中的關係，並建立關係連線，而歸戶機率不為1者，表示為候選關係的用戶，可保留原歸戶代號，但須建立候選關係連線，用戶於原網路中與其他用戶的關係需移轉給另一候選用戶，以建立異質網路連線，最終產生一關係結構表，內容記錄每個用戶間的連線關係。The virtual concept network 200 is established, which calculates the probability of each user's attribution based on the calculation of the multi-source source attribution probability 100, and converts the multi-community network connection into an equivalent virtual concept, and the return-to-home ratio is expressed as one. 100% confirmed users, confirmed users need to use a unified registrar code to replace the relationship in the original network, and establish a relationship connection, and the probability of returning to the household is not one, the user expressed as a candidate relationship, can retain the original The registrar code, but the candidate relationship connection must be established. The relationship between the user and other users in the original network needs to be transferred to another candidate user to establish a heterogeneous network connection, and finally a relationship structure table is generated, and the content record is The connection between users.

計算關係權重300，用以決定用戶間關係的強度，用戶間關係強度係依據用戶在網路內互動程度決定，當多資料來源時須分別考慮各個網路大小，避免較大網路影響力過度稀釋小網路影響力，也避免小網路影響力過度被放大，因此須對每條用戶關係進行正規化，方法如下：首先將各別網路內的每條用戶關係連線的關係強度除以各別網路內最大關係強度，使每條關係強度介於0~1之間，然後與建立虛擬概念網200得到的關係連線進行比對，若是移轉的關係連線，將該關係連線的關係強度與歸戶機率相乘得到關係權重，若是重複用戶連線關係，則將相同的歸戶連線做加總平均，以取得關係權重，然後依序計算所有的關係權重，最後得一關係權重表，其內容記錄各戶間的關係權重以及相關的網絡。The calculation relationship weight 300 is used to determine the strength of the relationship between users. The strength of the relationship between users is determined according to the degree of interaction of the users in the network. When multiple sources are used, the size of each network must be considered separately to avoid excessive network influence. Diluting the influence of small networks and avoiding excessive amplification of small network influences, so each user relationship must be normalized by first dividing the relationship strength of each user relationship in each network. With the maximum relationship strength in each network, the strength of each relationship is between 0 and 1, and then the relationship with the virtual concept network 200 is established. The line is compared. If the relationship is transferred, the relationship strength of the relationship is multiplied by the probability of returning to obtain the relationship weight. If the user connection relationship is repeated, the same home connection is added to the average. In order to obtain the relationship weight, then calculate all the relationship weights in order, and finally get a relationship weight table, the content of which records the relationship weight between each household and the related network.

篩選關係400，根據計算關係權重300產生的關係權重表，每條關係強度介於0.0~1.0之間，再以一種關係門檻篩選法，設定關係門檻篩選值，用來判定是否保留關係連線，關係門檻篩選值介於0.5~1.0之間。將關係權重小於關係門檻篩選值的關係連線予以刪除，保留關係權重高於關係門檻篩選值的關係連線，關係門檻值篩選值可直接以一個絕對數值進行篩選，或者可將所有關係權重值進行降冪排序，取前百分比數值使門檻值介於0.5~1.0之間。The relationship 400 is selected according to the relationship weight table generated by the calculation relationship weight 300. Each relationship strength is between 0.0 and 1.0, and then a relationship threshold is used to set the relationship threshold value to determine whether to retain the relationship connection. The relationship threshold is between 0.5 and 1.0. The relationship between the relationship weight and the relationship threshold is deleted. The relationship weight is higher than the relationship threshold. The threshold value can be directly filtered by an absolute value, or all relationship weights can be used. Perform a power-down sort, taking the previous percentage value to make the threshold between 0.5 and 1.0.

產生混合(Mashup)社群叢集500，依據篩選關係400篩選出權重高於0.5的關係連線可能涵蓋多個網路，再以社群網路分群演算法進行分群，找出聯繫較頻繁、緊密的混合社群叢集。熟習該項技藝者，可使用n -clique、n -clan、與k -core等社群網路分群演算法來達成。A hybrid (Mashup) community cluster 500 is generated. According to the screening relationship 400, a relationship with a weight greater than 0.5 may cover multiple networks, and then grouping by social network grouping algorithm to find out that the contact is frequent and close. Mixed community clusters. Those skilled in the art can use the social network grouping algorithms such as n- clique, n- clan, and k- core.

歸戶異質網路關係600，依據產生混合(Mashup)社群叢集500產生的混合社群叢集內可能包含相同的用戶只是在不同網路中建立關係，為了避免子社群內用戶皆為同一用戶，產生無意義的混合(Mashup)社群叢集，因此需於同一混合(Mashup)社群叢集中，根據歸戶機率表，把可能為同一用戶的用戶進行歸戶。The home heterogeneous network relationship 600, according to the mixed community (Mashup) cluster 500 generated in the mixed community cluster may contain the same user only to establish relationships in different networks, in order to avoid users in the sub-community are the same user To create a meaningless mix of mashups, it is necessary to categorize users who may be the same user in the same mashup community cluster according to the registrar probability table.

計算異質權重700，得到歸戶後的社群叢集後，由於每一個混合(Mashup)社群叢集內的用戶與連結關係可能來自於混合網路，跨網路的影響力較同一網路小，因此，必須針對混合(Mashup)社群叢集每條關係權重進行重新計算，其計算方式係將混合社群叢集中各關係連線出現的異質網路數量/總異質網路數量，即可產生異質權重，並得到一異質權重表，該異質權重表係紀錄每一混合(Mashup)社群叢集內的用戶關係以及異質權重。Calculate the heterogeneous weight of 700, after getting the community cluster after the home, because the user and connection relationship in each mixed (Mashup) community cluster may come from the hybrid network, across the network The impact is smaller than the same network. Therefore, each relationship weight must be recalculated for the Mashup community cluster. The calculation method is the number of heterogeneous networks/total heterogeneity in the mixed community clusters. The number of networks, you can generate heterogeneous weights, and get a heterogeneous weight table that records the user relationships and heterogeneous weights within each mashup community cluster.

分析用戶影響力800，根據歸戶異質網路關係600的歸戶結果及計算異質權重700產生異質權重表，利用社群網路分析方法之評估指標，將每條用戶關係連線的異質權重相加後，與歸戶機率相乘以產生影響力分數，用以找出社群內重要的影響力用戶，該社群網路分析方法之評估指標包括連結度數(Degree)、中介度(betweenness)、特徵性(eigenvector)、最短路徑(shortest path)等。The user influence 800 is analyzed, and the heterogeneous weight table is generated according to the homing result of the home heterogeneous network relationship 600 and the calculated heterogeneous weight 700, and the heterogeneous weight of each user relationship is used by using the evaluation index of the social network analysis method. After the addition, multiply the probability of returning to the household to generate an influence score to identify important influence users in the community. The evaluation indicators of the social network analysis method include the degree of association and the degree of continuity. , eigenvector, shortest path, etc.

以下特舉一實施例，對本發明多來源異質網路資料社群分析方法進行詳細說明：假設一家電子商務公司欲推展新產品，該公司已由三個網路取得異質網路資料，因此該公司可運用本發明在異質網路資料中利用社群分析方法找出關係較強的混合社群以及異質網路中具有跨網影響力的人物，進而展開一系列口碑行銷活動。In the following, an embodiment of the present invention provides a detailed description of the multi-source heterogeneous network data community analysis method: assuming that an e-commerce company wants to promote a new product, the company has obtained heterogeneous network data from three networks, so the company The present invention can be used in a heterogeneous network of materials to identify a mixed community with strong relationships and cross-network influences in a heterogeneous network, and then launch a series of word-of-mouth marketing activities.

步驟一、計算多資料來源歸戶機率100，表一、表二、表三分別為該公司取得的異質網路資料，首先每個網路內的特徵屬性具有一個屬性比重，其中，具有唯一識別功能的特徵屬性(如電子信箱、手機號碼)，其屬性比重為1，其他特徵屬性的屬性比重則小於1；再來先挑選兩個異質網路做歸戶計算，首先，採直接辨識，使用能辨別用戶唯一特徵屬性進行歸戶，也就是屬性比重為1者，得知(N1,A)與(N2,M)為同一用戶，且歸戶機率為1，接著進行間接機率辨識，扣除已直接辨識出的用戶，剩下用戶組合成候選歸戶組合，以N1的D用戶產生的候選歸戶組合為，[(N1.D)(N2.J)]、[(N1.D)(N2.K)]、[(N1.D)(N2.I)]、[(N1.D)(N2.L)]，依各個候選用戶中的特徵屬性計算機率值，N1與N2重複屬性有4個，且每個屬性比重不等，擁有相同屬性者得屬性比重分最後再除上重複屬性數目，計算後得到：[(N1.D)(N2.I)]為同一用戶的機率為(0.9)/4=0.225Step 1: Calculate the probability of multiple data source attributions. Tables 1, 2, and 3 are the heterogeneous network data obtained by the company. First, the characteristic attributes in each network have an attribute weight, among which, unique identification The characteristic attributes of the function (such as e-mail address and mobile phone number) have a property weight of 1, and the attribute weight of other feature attributes is less than 1. Secondly, two heterogeneous networks are selected to calculate the household. First, the direct identification is used. It can identify the user's unique feature attribute and assign it to the household, that is, the attribute weight is one, and it is known that (N1, A) and (N2, M) are the same user, and the attribution rate is 1, and then the indirect probability identification is performed. The directly identified users, the remaining users are combined into a candidate homing combination, and the candidate homing combinations generated by the D users of N1 are [(N1.D)(N2.J)], [(N1.D)(N2 .K)], [(N1.D)(N2.I)], [(N1.D)(N2.L)], according to the characteristic attribute computer rate value in each candidate user, the N1 and N2 repetition attributes have 4 And each attribute has a different proportion. If you have the same attribute, you will get the attribute weight and then divide the number of duplicate attributes. After calculation, you get: [(N 1.D)(N2.I)] is the same user's chance (0.9)/4=0.225

[(N1.D)(N2.J)]為同一用戶的機率為(0.9)/4=0.225[(N1.D)(N2.J)] is the same user's chance (0.9) / 4 = 0.225

[(N1.D)(N2.K)]為同一用戶的機率為(0.9+0.9+0.8)/4=0.625[(N1.D)(N2.K)] is the same user's chance (0.9+0.9+0.8)/4=0.625

[(N1.D)(N2.L)]為同一用戶的機率為(0.9/4=0.224[(N1.D)(N2.L)] is the probability of the same user (0.9/4=0.224

能使用的特徵屬性越多，所產生機率越準確。本實施例中認定機率值小於0.5者，不具有歸戶價值，因此不進行歸戶動作。再挑選下一個用戶進行推算，直到網路內所有用戶都完成歸戶計算；兩個異質網路完成歸戶計算後，再加入第三個網路用戶繼續計算，直到全部計算完畢，得到如表四所示之歸戶機率表。The more feature attributes that can be used, the more accurate the resulting probabilities. In this embodiment, it is determined that the probability value is less than 0.5, and there is no homing value, so the homing action is not performed. Then select the next user to calculate, until all users in the network have completed the home calculation; two heterogeneous networks are finished After the home registration is calculated, the third network user is added to continue the calculation until all calculations are completed, and the home rate table as shown in Table 4 is obtained.

步驟2、建立虛擬概念網200，根據表四之各用戶歸戶機率，歸戶機率為1者需使用統一的歸戶代號替代原網中用戶代號，[(N1.A)，(N2,M)]為同一用戶，歸戶代號為A，原N2中的M用戶代號都需替換為A，因此原(M->K)，(M->J)連線關係都需替換為(A->K)(A->J)；歸戶機率不為1者，表示為候選歸戶組合，可保留原代號，但須建立候選關係連線，候選歸戶組合內的用戶於原網路中與其他用戶的關係須相互移轉給另一人，以建立異質網路連線，[(N2.L)，(N3,Y)]候選歸戶組合其歸戶機率為0.8，原N2(L->I)有連結關係，需移轉給(N3.Y)，形成(I->Y)，N3中(Y->X)有連結關係，也須移轉給(N2.L)，形成(L->X)，依序計算，直到所有用戶都已建立異質網路關係連線，最終產生一關係連線表(表五)，內容記錄每個用戶間的連線關係，以及重複的網路數。Step 2: Establish a virtual concept network 200. According to the probability of each user's registrar in Table 4, the registrar rate is one, and the unified registrar code is used to replace the user code in the original network, [(N1.A), (N2, M )] is the same user, the home code is A, the M user code in the original N2 needs to be replaced with A, so the original (M->K), (M->J) connection relationship needs to be replaced with (A- >K)(A->J); the probability of returning to the household is not one, which is expressed as a candidate household combination, and the original number can be retained, but the candidate relationship connection must be established, and the users in the candidate household combination are in the original network. Relationships with other users must be transferred to another person to establish a heterogeneous network connection. [(N2.L), (N3, Y)] candidate homing households have a return rate of 0.8, the original N2 (L- >I) There is a connection relationship, which needs to be transferred to (N3.Y) to form (I->Y), N3 (Y->X) has a connection relationship, and must also be transferred to (N2.L) to form ( L->X), calculated in order, until all users have established a heterogeneous network relationship connection, and finally generate a relationship connection table (Table 5), the content records the connection relationship between each user, and the duplicate network Number of roads.

步驟3、計算關係權重300，首先須將各別網路內的每條用戶關係連線進行正規化，使每條關係權重於0~1之間，本實施例將各網原關係強度除上網內最大關係強度，則網內所有關係強度就介於0~1之間。再與步驟200得到關係連線表進行比對，若是移轉的關係連線，需將表四之用戶歸戶機率納入考慮，原(I->Y)關係強度為0.6，但因為移轉關係，需再將關係強度程上確認機率為0.6*0.8=0.48；若是重複用戶連線關係，須將相同的歸戶連線做加總平均，(A->B)關係強度有0.6、1，則(A->B)最後關係權重為0.8。依序計算所有的關係權重，最後得一關係權重表(表六)，其內容記錄各戶間的關係權重以及相關的網絡。Step 3: Calculate the relationship weight 300. First, each user relationship connection in the respective network must be normalized, so that each relationship weight is between 0 and 1. In this embodiment, the strength of each network relationship is excluded from the Internet. The maximum relationship strength within the network, then the strength of all relationships in the network is between 0 and 1. Then, in step 200, a relationship connection table is obtained for comparison. If the relationship of the transfer is connected, the probability of user attribution in Table 4 needs to be taken into consideration, and the original (I->Y) relationship strength is 0.6, but because of the transfer relationship The probability of confirming the relationship strength process is 0.6*0.8=0.48; if the user connection relationship is repeated, the same household connection must be aggregated to average, and the (A->B) relationship strength is 0.6, 1. Then (A->B) the final relationship weight is 0.8. Calculate all the relationship weights in order, and finally get a relationship weight table (Table 6), and record the contents of each household. Relationship weights and related networks.

步驟4、篩選關係400，將關係權重小於0.5者皆給予刪除。Step 4: Filter the relationship 400, and delete the relationship weight less than 0.5.

步驟5、產生混合(Mashup)社群叢集500，依據篩選關係400篩選出較緊密的關係，使用社群網路分群演算法的k-core方法進行分群中取k值等於2進行計算，得到用戶代號A、B、C、I、J、L為同一社群。Step 5: Generate a mashup community cluster 500, select a closer relationship according to the screening relationship 400, and use a k-core method of the social network grouping algorithm to perform a k-value equal to 2 in the group to obtain a user. Codes A, B, C, I, J, and L are the same community.

步驟6、歸戶異質網路關係600，根據步驟100產生的用戶歸戶表，知社群內C和I可能為同一用戶機率為0.625，需進行歸戶，方法同建立虛擬概念網200中歸戶機率為1者的歸戶方式，歸戶代號為C，原I用戶代號都需替換為C，因此原(I->J)、(I->L)連線關係都需替換為(C->J)、(C->L)，新的社群叢集為A、B、C、J、L，產生一新異質關係表(表七)。Step 6, the home heterogeneous network relationship 600, according to the user affiliation table generated in step 100, the C and I in the community may be the same user probability of 0.625, need to be vested, the method is the same as establishing the virtual concept network 200 The household registration rate is one, the attribution code is C, and the original I user code needs to be replaced with C. Therefore, the original (I->J) and (I->L) connection relationships need to be replaced with (C). ->J), (C->L), the new community clusters are A, B, C, J, L, resulting in a new heterogeneous relationship table (Table 7).

步驟7、計算異質權重700，步驟600得到異質關係表，可知每一個叢集內的用戶間連結關係可能來自於多個網路，計算每一個邊能影響的網路，當每條邊能涵蓋的網路越多，其權重應愈重，本例中將每條關係連線所涵蓋的網路數目除上總網路數，最終得到一異質權重表(表八)，其內容記錄每一社群叢集內的用戶關係以及異質權重。Step 7. Calculate the heterogeneous weight 700. Step 600 obtains a heterogeneous relationship table. It can be seen that the inter-user connection relationship in each cluster may come from multiple networks, and the network that each side can influence is calculated. When each side can cover the network. The more the road, the heavier the weight should be. In this example, the number of networks covered by each relationship connection is divided by the total number of networks, and finally a heterogeneous weight table (Table 8) is obtained, and the content records each community cluster. User relationships and heterogeneous weights within.

步驟8、分析用戶影響力800，本例中係利用社群網路分析方法中連結度數(Degree)評估指標計算社群內的影響力狀態找出群內重要的影響力人物，但每個用戶都是經過歸戶過程所產生，故需將每個用戶的於步驟100產生用戶歸戶機率值列入考慮計算，與歸戶代號A有連結關係的異質權重相加分別為，2/3、1/3、1/3，再乘上表四中歸戶代號A的相關歸戶機率分別為1與1，得到影響力分數為1.67，依序計算，最終產生每一子社群叢集的用戶影響力表(表九)，內容記錄每一社群叢集內每個用戶的影響力分數，從中即能找出異質網路關係中具有跨網影響力的人物，在本實施例中即為歸戶代號A與B。Step 8. Analyze the user influence 800. In this example, the degree of influence in the community is calculated by using the degree of influence evaluation in the social network analysis method to find the important influence characters in the group, but each user All are generated by the household registration process, so each user’s step is required. In step 100, the probability of generating the user's household value is included in the calculation. The heterogeneous weights associated with the household code A are respectively added, 2/3, 1/3, 1/3, and then multiplied by the household code in Table 4. The relative attribution rates of A are 1 and 1, respectively, and the impact score is 1.67, which is calculated in order, and finally the user influence table of each sub-community cluster is generated (Table 9). The content records each community cluster. The impact scores of the users can be used to find out the characters with cross-network influence in the heterogeneous network relationship. In this embodiment, they are the home code A and B.

上列詳細說明係針對本發明之一可行實施例之具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本創作技藝精神所為之等效實施或變更，均應包含於本案之專利範圍中。The detailed description above is a detailed description of one of the possible embodiments of the present invention, and is not intended to limit the scope of the invention, and the equivalents and modifications may be included in the present invention. The patent scope of this case.

200‧‧‧建立虛擬概念網200‧‧‧Creating a virtual concept network

300‧‧‧計算關係權重300‧‧‧Computational relationship weights

400‧‧‧篩選關係400‧‧‧ screening relationship

700‧‧‧計算異質權重700‧‧‧ Calculate heterogeneous weights

800‧‧‧分析用戶影響力800‧‧‧Analysis of user influence

Claims

A multi-source heterogeneous network data analysis method, in particular, an analysis method applied to a media program, the steps of which include: Step 1: Obtaining network behaviors and characteristics of each user and identification function from a plurality of heterogeneous network data Attribute, and assign the weight of the feature attribute; Step 2: Through the comparison of the obtained feature attributes, if the unique user with the same attribute attribute is compared, the chance of returning the household is 1, and the process proceeds to step 4; otherwise, the process proceeds to step 3; Deducting the identified unique users, setting the unrecognized users as candidate homecoming combinations, and using the unrecognized repeated feature attributes of users or similar network behaviors to obtain the attribution probability; Step 4, returning The user whose occupancy rate is greater than the threshold of the household account is the household registration action, and the user whose household rate is less than the threshold of the household registration does not perform the household registration action, and generates the household registration probability table; Step 5: the household operation will be performed The user's home code in the heterogeneous network data of probability 1 is set to be the same, and the relationship connection is established, and the heterogeneous network with the probability of belonging to the home is not 1 The user's home code is reserved, and the candidate relationship is established. Step 6. Divide the relationship strength of each user relationship in each network by the maximum relationship strength in each network, so that each The relationship strength of the relationship connection is between 0 and 1, and is compared with the relationship connection. If the comparison result is a connection relationship, the relationship strength of the relationship connection is multiplied by the probability of returning the home. Relationship weight, if the comparison result is a duplicate relationship, the relationship strength of the relationship connection is summed with the probability of returning the household, and the relationship weight is obtained; Step 7: The threshold value is less than the relationship threshold by the threshold value screening method The relationship of the values is deleted, and the relationship weight is greater than the relationship threshold value. The social network is grouped. The algorithm is grouped to form a plurality of mixed community clusters; step eight, according to the household rate table, the mixed community clusters, and the same user may be assigned to the home operation; step IX, the mixed community clusters are connected The number of heterogeneous networks appearing on the line divided by the total number of heterogeneous networks, resulting in heterogeneous weights; Step 10, using the evaluation indicators of the social network analysis method, adding the heterogeneous weights of each user relationship connection, and the household The odds are multiplied to generate an impact score to identify important influential users within the community.

The multi-source heterogeneous network data analysis method according to claim 1, wherein the attribution rate is 0.0 to 1.0, the household threshold is between 0.5 and 1.0, and the relationship threshold is between 0.5 and 1.0.

The multi-source heterogeneous network data analysis method according to claim 1, wherein the characteristic attribute includes a name, a license number, an e-mail address, a mobile phone number, a place of residence, a work place, a graduate school, an interest, or use provided on another website. Information .

The multi-source heterogeneous network data analysis method according to claim 1, wherein the network behavior is a mode of operation behavior of the user on the network, including repeated common friends or the same content at the same time and place.

The multi-source heterogeneous network data analysis method according to claim 1, wherein the unique user having the same characteristic attribute is the same as the characteristic attribute of the user having the single characteristic of the license number, the mobile phone number or the electronic mail box. User.

The multi-source heterogeneous network data analysis method according to claim 1, wherein the repeated characteristic attribute or similar network behavior includes any two or more of a duplicate name, a place of residence, a work place, and a graduate school. , with the same feature attribute that can be identified as the same user The user, or a similar network behavior that has a common friend or the same content at the same time or place.

The multi-source heterogeneous network data analysis method according to claim 1, wherein the obtained attribution probability is to compare the attribute weights of the same feature attribute to the number of duplicate feature attributes in the total/heterogeneous network data.