TW201905733A

TW201905733A - Multi-source data fusion method and device

Info

Publication number: TW201905733A
Application number: TW107108813A
Authority: TW
Inventors: 徐喆昊
Original assignee: 香港商阿里巴巴集團服務有限公司
Priority date: 2017-06-28
Filing date: 2018-03-15
Publication date: 2019-02-01
Also published as: CN107341220B; CN107341220A; WO2019001429A1

Abstract

Embodiments of the specification provide a multisource data fusion method and apparatus. The multisource data fusion method can be used for obtaining data belonging to a same entity from a data set; for any entity, at least one associated attribute of an entity attribute of the entity can be obtained; an attribute similarity between associated attributes of two entities can be obtained; and if the attribute similarity is greater than a similarity threshold, it can be determined that the two entities are a same entity, and entity attributes of the two entities are associated with the same entity.

Description

Multi-source data fusion method and device

本揭露關於網路技術領域，特別關於一種多源資料融合方法和裝置。This disclosure relates to the field of network technology, in particular to a multi-source data fusion method and device.

在對一個實體進行資料分析時，需要獲取到描述該實體的大量屬性資料，這些屬性資料可以包括多種來源，例如，來源於用戶自己輸入的資訊，或者來源於網路爬蟲採集，或者來源於多家不同的渠道商。不同來源的資料可以具有不同的資料標準，對於同一個實體的描述可能不同，例如，假設兩個資料來源實際上是對同一實體的描述，但是這兩個來源的描述中對該實體的名稱描述不同或者地址描述不同。　　在對實體進行資料分析時，可以將描述同一實體的大量屬性資料都關聯至該實體，即進行該實體的多源資料融合，接著再根據這些多源資料對該實體進行分析。需要提供能夠更準確的將同一實體的多源資料進行融合的方案。When data analysis is performed on an entity, a large amount of attribute data describing the entity needs to be obtained. These attribute data can include multiple sources, for example, from information input by the user himself, or from web crawler collection, or from multiple sources Different channel providers. Data from different sources can have different data standards, and the description of the same entity may be different. For example, assume that two data sources are actually descriptions of the same entity, but the description of the name of the entity is described in the descriptions of the two sources. Different or different address descriptions. When analyzing data for an entity, you can associate a large amount of attribute data describing the same entity to the entity, that is, perform multi-source data fusion of the entity, and then analyze the entity based on these multi-source data. There is a need to provide a solution that can more accurately merge multiple sources of the same entity.

有鑑於此，本說明書實施例提供一種多源資料融合方法和裝置，以準確快速的進行多源資料融合。　　具體地，本揭露是通過如下技術方案實現的：　　第一態樣，提供一種多源資料融合方法，所述方法用於由資料集中獲取屬於同一實體的資料，所述資料集包括屬於多個實體的資料，每個實體的資料包括至少一個實體屬性；所述方法包括：　　對於任一個實體，分別獲取每個實體屬性的至少一個關聯屬性；　　獲得兩個實體的所述關聯屬性的屬性相似度；　　若所述屬性相似度大於相似度臨限值，則確定所述兩個實體是同一實體，將所述兩個實體的實體屬性均關聯至所述同一實體。　　第二態樣，提供一種多源資料融合裝置，所述裝置用於由資料集中獲取屬於同一實體的資料，所述資料集包括屬於多個實體的資料，每個實體的資料包括至少一個實體屬性；所述裝置包括：　　屬性獲取模組，用於對於任一個實體，分別獲取每個實體屬性的至少一個關聯屬性；　　相似度計算模組，用於獲得兩個實體的所述關聯屬性的屬性相似度；　　關聯處理模組，用於若所述屬性相似度大於相似度臨限值，則確定所述兩個實體是同一實體，將所述兩個實體的實體屬性均關聯至所述同一實體。　　本說明書實施例提供的多源資料融合方法和裝置，通過基於實體屬性的關聯屬性構建相似度計算方式，用以衡量兩個實體之間的相似性關係，使得實體屬性描述的不同不會影響相同實體的識別，可以快速準確的完成對同一實體的多源資料的獲取；對於資料格式不同的多源資料之間有了一種有效的衡量方式，能夠實現同一實體資料的識別與融合，從而使得實體的資料更加完善。In view of this, the embodiments of the present specification provide a multi-source data fusion method and device to accurately and quickly perform multi-source data fusion. Specifically, the present disclosure is implemented through the following technical solutions: The first aspect provides a multi-source data fusion method, which is used to obtain data belonging to the same entity from a data set, including multiple entities Data, each entity's data includes at least one entity attribute; the method includes: for any entity, obtain at least one associated attribute of each entity attribute; obtain the attribute similarity of the associated attributes of the two entities; If the attribute similarity is greater than the similarity threshold, it is determined that the two entities are the same entity, and the entity attributes of the two entities are related to the same entity. In a second aspect, a multi-source data fusion device is provided. The device is used to obtain data belonging to the same entity from a data set. The data set includes data belonging to multiple entities, and the data of each entity includes at least one entity attribute. The device includes: attribute acquisition module for acquiring any at least one associated attribute of each entity attribute for any entity; similarity calculation module for acquiring attribute similarity of the associated attributes of two entities Association processing module, used to determine that the two entities are the same entity if the attribute similarity is greater than the similarity threshold, and associate the entity attributes of the two entities to the same entity. The multi-source data fusion method and device provided in the embodiments of this specification construct a similarity calculation method based on the associated attributes of entity attributes to measure the similarity relationship between two entities, so that the differences in entity attribute descriptions will not affect the same The identification of entities can quickly and accurately complete the acquisition of multi-source data of the same entity; there is an effective measurement method for multi-source data with different data formats, which can realize the identification and fusion of the same entity data, so that the entity The information is more complete.

為了使本技術領域的人員更好地理解本說明書一個或多個實施例中的技術方案，下面將結合本說明書一個或多個實施例中的附圖，對本說明書一個或多個實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是一部分實施例，而不是全部的實施例。基於本說明書一個或多個實施例，本領域普通技術人員在沒有作出進步性勞動前提下所獲得的所有其他實施例，都應當屬於本揭露保護的範圍。　　在資料分析時，經常關於到採集來源於多種渠道的同一實體的資料，根據這些資料對該實體進行較為準確的業務分析。實際實施中，上述多種來源的資料即使都是對同一實體的特徵描述，也可能描述方式不同。例如，同一個實體商店，在來源L1中的名稱是m，在來源L2中的名稱是n，其實名稱m和n都是指代同一個商店，只是字面描述不同；又例如，上述同一個實體商店，在來源L1和來源L2中對商店地址的描述也可能不同。　　在進行多源資料融合時，需要獲取到屬於同一實體的資料，並將這些資料都關聯至該同一實體，以備根據這些資料對實體進行分析。然而上述例子中提到的不同來源的資料，由於對實體的描述方式不一致，導致可能無法關聯至同一實體。本說明書一個或多個實施例提供的多源資料融合方法，將用於解決這一問題，使得即使實體描述方式不同也能夠將同一實體的資料進行關聯。　　本說明書一個或多個實施例的多源資料融合方法中，將關於到“實體屬性”和“關聯屬性”。其中，實體屬性即由各個資料來源中採集到的一個實體的直接屬性，而關聯屬性可以是與實體屬性相關的其他屬性。如下舉例：　　例如，一個實體屬性可以是某個實體商店的地址。而該“地址”對應的“經緯度座標”是“地址”的關聯屬性，或者，“地址所屬的省市區資訊”也是“地址”的關聯屬性。　　又例如，一個實體屬性可以是某個實體商店的聯繫電話。而該“聯繫電話對應的號碼所屬省份”是“聯繫電話”的關聯屬性；或者，“該聯繫電話對應的常用收貨人姓名”也是“聯繫電話”的關聯屬性；又或者，“該聯繫電話對應的聯繫郵箱”也是“聯繫電話”的關聯屬性。　　關聯屬性的獲取方式可以有多種，可以是其他實體的實體屬性，或者可以是由歷史收集的大資料資訊中得到，例如，可以由收集儲存的歷史交易資料中，得到某個聯繫電話對應的常用收貨地址、或者常用收貨人姓名等資訊。一個實體屬性對應的關聯屬性的數量可以是至少一個。　　為了後續處理中獲取關聯屬性的快速和方便，可以預先建立一個圖資料庫。圖1實例了圖資料庫的一部分，圖資料庫中可以包括多個屬性節點，例如圖1中的屬性節點11、屬性節點12、屬性節點13和屬性節點14等。並且，存在關聯關係的屬性節點之間以邊連接，例如，屬性節點11和屬性節點12之間以邊連接，表示號碼所歸省份與號碼是有關係的；沒有關聯關係的屬性節點之間可以不以邊連接。　　圖資料庫中用於連接屬性節點的邊，可以輔助快速查找某一個屬性節點相關的屬性節點，應用於關聯屬性的查找中。例如，假設屬性節點11是一個實體屬性，那麼可以根據節點連接關係，將與屬性節點11邊連接的至少一個屬性節點對應的屬性，都確定為聯繫電話的關聯屬性，例如，號碼所歸省份、號碼對應的常用收貨人姓名等，都是聯繫電話的關聯屬性。圖資料庫的建立，即可以應用其他實體的實體屬性或者歷史收集的大資料資訊來構建。　　在上述說明“實體屬性”和“關聯屬性”的基礎上，如下結合圖2描述本說明書一個或多個實施例的多源資料融合方法，在該方法中，將基於不同實體之間的“關聯屬性”的相似度的計算，來衡量實體之間的相似性。如前面提到的，不同實體的描述方式可能不同（這裡的不同實體只是用於表示不同的資料來源，實際可能是同一實體），這種不同通常是實體的“實體屬性”的描述不同，而本例子的方法中的實體相似性判斷不依據實體屬性而是依據關聯屬性，從而實體屬性的不同描述不會導致實體差異的誤判，而實體相似通常會有更高的關聯屬性相似度。　　在步驟202中，將資料集中的資料進行資料格式統一化處理。　　對多源異構資料集，可以進行標準化和結構化預處理，以便規範化實體的描述屬性。由於資料來源不同，資訊的描述方式可能不同，資料的格式標準也可能不同，例如英文字母大小寫、分隔符號、簡繁體等，需要進行統一處理，提升資料品質。對於實體資訊可以進行相應的資料模型構建，例如，對於商店可以確定商店的標準屬性範圍，例如電話、營業執照、地址等資訊，盡可能多的提取有價值的資訊。　　在步驟204中，將符合預定條件的不同實體的資料，分入同一資料集。　　為了避免後續相似度計算造成的笛卡爾積導致資料計算量膨脹，可以對資料集進行一個初步分類，將實體相似的可能性更高的資料聚集在一起，這一過程可以稱為資料分桶。例如，對於唯一特徵完全一致的實體可以直接判定為同一實體，例如商店名稱、營業執照號等。而對於剩餘未被直接認定一致的資料可以通過強規則分類桶進行初步分類，例如，可以將符合預定條件的不同實體的資料，分入同一資料集，例如，被分到同一個資料集內的商店實體所在城市一致，座機號碼區域一致，或者商店服務類型（美食、服務、購物）一致。　　而強規則分類桶的多個預定條件，可以分批次執行，例如，在具體實施中，可以先按照商店所在城市一致劃分一個資料集，對該資料集執行步驟206至210的處理，提取出同一實體的資料；而後對該資料集的剩餘資料，可以再按照座機號碼區域一致得到一個子資料集，對該子資料集再次執行步驟206至210的處理，提取出同一實體的資料。　　在步驟206中，對於任一個實體，分別獲取每個實體屬性的至少一個關聯屬性。本步驟可以在圖1實例的圖資料庫中，根據屬性節點之間的連接關係，查找與實體屬性相關的至少一個關聯屬性。例如，可以先在圖資料庫中找到某一個實體屬性，該實體屬性是圖資料庫中的其中一個屬性節點，再將與該實體屬性邊連接的至少一個屬性節點的屬性，作為其關聯屬性。　　在步驟208中，獲得兩個實體的所述關聯屬性的屬性相似度。　　例如，假設實體A分別有屬性a₀ ，a₁ …a_n ，實體B有屬性b₀ ，b₁ …b_n 。通常，a₀ 和b₀ 可以是相同的屬性只是取值不同，例如，都是手機號，只是手機號碼不同。同樣，a₁ 和b₁ 也是相同的屬性，例如，兩者都是商店地址，只是具體的地址資訊不同。本例子中，可以將類似“a₀ 和b₀ ”、“a₁ 和b₁ ”的屬性對稱為兩個實體的“對應實體屬性”，即指代的是同一實體屬性。　　再以其中一個對應實體屬性為例，“a₀ 和b₀ ”，假設屬性a₀ 的關聯屬性包括：α₀ ，α₁ ，…α_n ；屬性b₀ 的關聯屬性包括：β₀ , β₁ , …β_n 。類似的，α₀ 和β₀ 可以是相同的屬性只是取值不同，例如，都是手機號關聯的郵箱，只是郵箱不同。本例子可以將類似“α₀ 和β₀ ”的屬性對稱為“對應關聯屬性”，即指代的是同一關聯屬性，並且“α₀ 和β₀ ”是“對應實體屬性”“a₀ 和b₀ ”的其中一個“對應關聯屬性”。　　基於上述的“對應實體屬性”和“對應關聯屬性”的概念，如下說明如何計算兩個實體的屬性相似度。　　可以分別計算任兩個對應關聯屬性之間的屬性相似度，計算公式可以如下面的公式（1）所示。α_i 與β_i 是兩個對應關聯屬性，當α_i 不等於β_i 時，相似度為0，當α_i =β_i 時，相似度。其中，e為自然底數，N為該對應關聯屬性關聯的其他屬性值個數，例如，a₀ ，b₀ 為手機號，α₀ ，β0為該手機號關聯的郵箱，當α₀ =β₀ 時，發現該郵箱有4個手機號與其有關係，則N=4。θ為集中度調節參數，對於熱點資料，例如手機對應的城市資訊，一個城市可能對應非常多關聯手機號，則θ值可以設置的較大，反之，如郵箱等資料重複可能性不高則θ值可設置較小。對於任一對應實體屬性的任一對應關聯屬性，都可以按照公式（1）進行計算。例如，對於其中一對應實體屬性“a₀ 和b₀ ”，可以計算α₀ 和β₀ 的屬性相似度，可以計算α₁ 和β₁ 的屬性相似度，等。　　接著，可以根據對應關聯屬性之間的屬性相似度、以及對應實體屬性的屬性權重，得到兩個實體的屬性相似度。　　例如，可以參見公式（2）所示，實例的是實體A與實體B的屬性相似度的計算。其中，m為A,B的有效屬性個數，即對應屬性都有值。上述的例子中，實體A分別有屬性a₀ ，a₁ …a_n ，實體B有屬性b₀ ，b₁ …b_n 。假設a₁ 和b₁ 至少一個沒有獲取到屬性值，那麼這個屬性是無效屬性，a₀ 和b₀ 均能獲取到屬性值，則為有效屬性，最多n個有效屬性。對於其中一對“對應實體屬性”（例如，a₀ 和b₀ ）來說，n為該對應實體屬性的有效對應關聯屬性的個數，同理，假設屬性a₀ 的關聯屬性包括：α₀ , α₁ , …α_n ；屬性b₀ 的關聯屬性包括：β₀ , β₁ , …β_n ，最多有n個有效的對應關聯屬性。為“對應關聯屬性”（例如，“”）關聯的“對應實體屬性”（例如，a₀ 和b₀ ）的屬性權重，對於重要的對應實體屬性可以設置權重較高，對於非重要的對應實體屬性可以設置權重較低。表示某一“對應實體屬性”關聯的“對應關聯屬性”的屬性相似度的平均值。在步驟210中，若所述屬性相似度大於相似度臨限值，則確定所述兩個實體是同一實體，將所述兩個實體的實體屬性均關聯至所述同一實體。　　例如，當sim(A,B)的值大於臨限值σ時，可以認為兩者是同一實體。識別到兩個是同一實體後，可以將這兩個實體的實體屬性均關聯至同一實體。　　本例子的多源資料融合方法，通過基於實體屬性的關聯屬性構建相似度計算方式，用以衡量兩個實體之間的相似性關係，使得實體屬性描述的不同不會影響相同實體的識別，可以快速準確的完成對同一實體的多源資料的獲取；對於資料格式不同的多源資料之間有了一種有效的衡量方式，能夠實現同一實體資料的識別與融合，從而使得實體的資料更加完善。　　上述圖2所示流程中的各個步驟，其執行順序不限制於流程圖中的順序。此外，各個步驟的描述，可以實現為軟體、硬體或者其結合的形式，例如，本領域技術人員可以將其實現為軟體代碼的形式，可以為能夠實現所述步驟對應的邏輯功能的電腦可執行指令。當其以軟體的方式實現時，所述的可執行指令可以儲存在記憶體中，並被設備中的處理器執行。　　例如，對應於上述方法，本說明書一個或多個實施例同時提供一種資料處理設備，該設備可以包括處理器、記憶體、以及儲存在記憶體上並可在處理器上運行的電腦指令，所述處理器通過執行所述指令，用於實現如下步驟：對於任一個實體，分別獲取每個實體屬性的至少一個關聯屬性；獲得兩個實體的所述關聯屬性的屬性相似度；若所述屬性相似度大於相似度臨限值，則確定所述兩個實體是同一實體，將所述兩個實體的實體屬性均關聯至同一實體。　　本說明書一個或多個實施例還提供了一種多源資料融合裝置，該裝置可以應用於實現本說明書一個或多個實施例的多源資料融合方法。如圖3所示，該裝置可以包括：屬性獲取模組31、相似度計算模組32和關聯處理模組33。　　屬性獲取模組31，用於對於任一個實體，分別獲取每個實體屬性的至少一個關聯屬性；　　相似度計算模組32，用於獲得兩個實體的所述關聯屬性的屬性相似度；　　關聯處理模組33，用於若所述屬性相似度大於相似度臨限值，則確定所述兩個實體是同一實體，將所述兩個實體的實體屬性均關聯至所述同一實體。　　在一個例子中，屬性獲取模組31，具體用於：由預先建立的圖資料庫中獲取所述實體屬性，所述實體屬性是所述圖資料庫中的其中一個屬性節點，所述圖資料庫中包括多個屬性節點，存在關聯關係的屬性節點之間以邊連接；將與所述實體屬性邊連接的至少一個屬性節點對應的屬性，確定為所述實體屬性的關聯屬性。　　在一個例子中，相似度計算模組32，具體用於：對於兩個實體的對應實體屬性，確定所述對應實體屬性的對應關聯屬性；分別計算任兩個對應關聯屬性之間的屬性相似度；根據所述對應關聯屬性之間的屬性相似度、以及所述對應實體屬性的屬性權重，得到所述兩個實體的所述屬性相似度。　　在一個例子中，如圖4所示，該裝置還可以包括：資料分類模組34，用於將符合預定條件的不同實體的資料，分入同一資料集。　　在一個例子中，如圖4所示，該裝置還可以包括：資料預處理模組35，用於將所述資料集中的資料進行資料格式統一化處理。　　上述實施例闡明的裝置或模組，具體可以由電腦晶片或實體實現，或者由具有某種功能的產品來實現。一種典型的實現設備為電腦，電腦的具體形式可以是個人電腦、膝上型電腦、蜂巢式電話、相機電話、智慧型電話、個人數位助理、媒體播放機、導航設備、電子郵件收發設備、遊戲控制台、平板電腦、可穿戴設備或者這些設備中的任意幾種設備的組合。　　為了描述的方便，描述以上裝置時以功能分為各種模組分別描述。當然，在實施本說明書一個或多個實施例時可以把各模組的功能在同一個或多個軟體和/或硬體中實現。　　本領域內的技術人員應明白，本說明書的一個或多個實施例可提供為方法、系統、或電腦程式產品。因此，本說明書的一個或多個實施例可採用完全硬體實施例、完全軟體實施例、或結合軟體和硬體態樣的實施例的形式。而且，本說明書的一個或多個實施例可採用在一個或多個其中包含有電腦可用程式碼的電腦可用儲存媒體（包括但不限於磁碟記憶體、CD-ROM、光學記憶體等）上實施的電腦程式產品的形式。　　還需要說明的是，術語“包括”、“包含”或者其任何其他變體意在涵蓋非排他性的包含，從而使得包括一系列要素的過程、方法、商品或者設備不僅包括那些要素，而且還包括沒有明確列出的其他要素，或者是還包括為這種過程、方法、商品或者設備所固有的要素。在沒有更多限制的情況下，由語句“包括一個……”限定的要素，並不排除在包括所述要素的過程、方法、商品或者設備中還存在另外的相同要素。　　本說明書一個或多個實施例可以在由電腦執行的電腦可執行指令的一般上下文中描述，例如程式模組。一般地，程式模組包括執行特定任務或實現特定抽象資料類型的常式、程式、物件、元件、資料結構等等。也可以在分散式運算環境中實踐本說明書一個或多個實施例，在這些分散式運算環境中，由通過通訊網路而被連接的遠端處理設備來執行任務。在分散式運算環境中，程式模組可以位於包括存放裝置在內的本地和遠端電腦儲存媒體中。　　本說明書一個或多個實施例均採用遞進的方式描述，各個實施例之間相同相似的部分互相參見即可，每個實施例重點說明的都是與其他實施例的不同之處。尤其，對於資料處理設備實施例而言，由於其基本相似於方法實施例，所以描述的比較簡單，相關之處參見方法實施例的部分說明即可。　　以上所述僅為本說明書一個或多個實施例而已，並不用以限制本揭露，凡在本揭露的精神和原則之內，所做的任何修改、等同替換、改進等，均應包含在本揭露保護的範圍之內。In order to enable those skilled in the art to better understand the technical solutions in one or more embodiments of this specification, the following will be combined with the drawings in one or more embodiments of this specification. The technical solution is described clearly and completely. Obviously, the described embodiments are only a part of the embodiments, but not all the embodiments. Based on one or more embodiments of this specification, all other embodiments obtained by a person of ordinary skill in the art without making progressive labor shall fall within the scope of protection of the present disclosure. In data analysis, it is often about collecting data from the same entity from multiple channels, and conducting a more accurate business analysis of the entity based on these data. In actual implementation, even if the data from the above multiple sources all describe the characteristics of the same entity, they may be described in different ways. For example, for the same physical store, the name in source L1 is m, and the name in source L2 is n. In fact, the names m and n refer to the same store, but the literal description is different; for example, the same entity above For stores, the description of the store address in source L1 and source L2 may also be different. When performing multi-source data fusion, it is necessary to obtain data belonging to the same entity and associate these data with the same entity in order to analyze the entity based on these data. However, the data from different sources mentioned in the above example may not be related to the same entity due to the inconsistent description of the entity. The multi-source data fusion method provided by one or more embodiments of this specification will be used to solve this problem, so that even if the entity description methods are different, the data of the same entity can be associated. In the multi-source data fusion method of one or more embodiments of this specification, "entity attributes" and "associated attributes" will be referred to. Among them, the entity attributes are the direct attributes of an entity collected from various sources, and the related attributes may be other attributes related to the entity attributes. The following example: For example, an entity attribute can be the address of a physical store. The "latitude and longitude coordinates" corresponding to the "address" are the associated attributes of the "address", or the "information of the province and city where the address belongs" is also the associated attribute of the "address". For another example, an entity attribute may be a contact number of a physical store. And the "province of the number corresponding to the contact phone" is the associated attribute of the "contact phone"; or, "the name of the common consignee corresponding to the contact phone" is also the associated attribute of the "contact phone"; or, "the contact phone The corresponding contact mailbox is also an associated attribute of "contact phone". There are many ways to obtain related attributes, which can be the entity attributes of other entities, or they can be obtained from the big data information collected by history. For example, the collected and stored historical transaction data can be used to obtain the common use of a contact phone. Information such as the address of the consignee or the name of the commonly used consignee. The number of associated attributes corresponding to one entity attribute may be at least one. For quick and convenient access to related attributes in subsequent processing, a graph database can be established in advance. FIG. 1 illustrates a part of a graph database. The graph database may include multiple attribute nodes, such as attribute node 11, attribute node 12, attribute node 13, and attribute node 14 in FIG. In addition, the attribute nodes that have an association relationship are connected by an edge. For example, the attribute node 11 and the attribute node 12 are connected by an edge, indicating that the province to which the number belongs is related to the number; between the attribute nodes that have no association relationship can be Not connected by edges. The edges used to connect attribute nodes in the graph database can help to quickly find the attribute nodes related to a certain attribute node, and be used in the search for related attributes. For example, assuming that the attribute node 11 is an entity attribute, the attributes corresponding to at least one attribute node connected to the attribute node 11 can be determined as the associated attributes of the contact phone according to the node connection relationship, for example, the number of the province, The common consignee name corresponding to the number, etc., are all related attributes of the contact phone. The establishment of the graph database can be constructed by applying the entity attributes of other entities or the big data information collected by history. On the basis of the above description of "entity attributes" and "associated attributes", the multi-source data fusion method of one or more embodiments of this specification will be described below with reference to FIG. 2, in which the method will be based on the "association between different entities" The calculation of similarity of "attributes" to measure the similarity between entities. As mentioned earlier, different entities may be described differently (different entities here are only used to represent different data sources, and may actually be the same entity). This difference is usually the description of the entity's "entity attributes", and The entity similarity judgment in the method of this example is not based on entity attributes but on associated attributes, so that different descriptions of entity attributes will not lead to misjudgment of entity differences, and entity similarity usually has a higher degree of similarity in associated attributes. In step 202, the data in the data set is processed in a unified data format. For multi-source heterogeneous data sets, standardization and structured preprocessing can be carried out in order to standardize the description attributes of entities. Due to different data sources, the description of the information may be different, and the format standards of the data may also be different. For example, English letters, separators, simplified and traditional characters, etc., need to be processed uniformly to improve the quality of the data. For entity information, a corresponding data model can be constructed. For example, for a store, the standard attribute range of the store can be determined, such as phone, business license, address, etc., and extract as much valuable information as possible. In step 204, the data of different entities that meet the predetermined conditions are divided into the same data set. In order to avoid the expansion of the amount of data calculation caused by the Cartesian product caused by the subsequent similarity calculation, a preliminary classification of the data set can be performed to gather the data with higher likelihood of similarity. This process can be called data bucketing. For example, entities with unique characteristics that are completely consistent can be directly determined to be the same entity, such as store name, business license number, etc. The remaining data that are not directly determined to be consistent can be preliminarily classified by a strong rule classification bucket. For example, the data of different entities that meet the predetermined conditions can be divided into the same data set, for example, the data in the same data set The store entity is in the same city, the landline number area is the same, or the store service type (food, service, shopping) is the same. The multiple predetermined conditions of the strong rule classification bucket can be implemented in batches. For example, in a specific implementation, a data set can be divided according to the city where the store is located, and the processing of steps 206 to 210 can be performed on the data set to extract The data of the same entity; and then the remaining data of the data set can be consistent with the landline number area to obtain a sub-data set, and perform the processing of steps 206 to 210 again on the sub-data set to extract the data of the same entity. In step 206, for any entity, at least one associated attribute of each entity attribute is obtained. In this step, at least one associated attribute related to the attribute of the entity can be found in the graph database of the example in FIG. 1 according to the connection relationship between the attribute nodes. For example, you can find an entity attribute in the graph database first, and the entity attribute is one of the attribute nodes in the graph database, and then use the attribute of at least one attribute node connected to the attribute edge of the entity as its associated attribute. In step 208, the attribute similarity of the associated attributes of the two entities is obtained. For example, suppose entity A has attributes a ₀ , a ₁ … an _n , and entity B has attributes b ₀ , b ₁ … b _n . Generally, a ₀ and b ₀ can be the same attribute, but the values are different, for example, both are mobile phone numbers, but the mobile phone numbers are different. Similarly, a ₁ and b _{1 have} the same attributes. For example, both are store addresses, but the specific address information is different. In this example, attribute pairs like “a ₀ and b ₀ ” and “a ₁ and b ₁ ” may be referred to as “corresponding entity attributes” of the two entities, that is, referring to the same entity attributes. Taking one of the corresponding entity attributes as an example, "a ₀ and b ₀ ". Assume that the associated attributes of attribute a ₀ include: α ₀ , α ₁ , ... α _n ; the associated attributes of attribute b ₀ include: β ₀ , β ₁ ,… Β _n . Similarly, α ₀ and β ₀ can be the same attribute but have different values. For example, both are mailboxes associated with mobile phone numbers, but the mailboxes are different. In this example, attribute pairs like “α ₀ and β ₀ ” can be called “corresponding association attributes”, that is, referring to the same association attribute, and “α ₀ and β ₀ ” are “corresponding entity attributes” “a ₀ and b ₀ "One of the" corresponding associated attributes ". Based on the above concepts of "corresponding entity attributes" and "corresponding associated attributes", the following explains how to calculate the attribute similarity of two entities. The attribute similarity between any two corresponding associated attributes can be calculated separately, and the calculation formula can be as shown in the following formula (1). α _i and β _i are two corresponding correlation attributes, when α _{i is} not equal to β _i , the similarity is 0, when α _i = β _i , the similarity . Where e is the natural base, N is the number of other attribute values associated with the corresponding associated attribute, for example, a ₀ , b ₀ are the mobile phone number, α ₀ , and β _{0 are} the mailboxes associated with the mobile phone number, when α ₀ = β ₀ When it is found that the mailbox has 4 mobile phone numbers related to it, N = 4. θ is the concentration adjustment parameter. For hotspot data, such as the city information corresponding to the mobile phone, a city may correspond to many related mobile phone numbers, then the value of θ can be set larger, otherwise, if the possibility of repeated data such as mailbox is not high, θ The value can be set smaller. For any corresponding associated attribute of any corresponding entity attribute, it can be calculated according to formula (1). For example, for one of the corresponding entity attributes "a ₀ and b ₀ ", the attribute similarity of α ₀ and β ₀ can be calculated, the attribute similarity of α ₁ and β ₁ can be calculated, and so on. Next, the attribute similarity of the two entities can be obtained according to the attribute similarity between the corresponding associated attributes and the attribute weight of the corresponding entity attribute. For example, as shown in formula (2), the instance is the similarity between the attributes of entity A and entity B Calculation. Among them, m is the number of effective attributes of A and B, that is, the corresponding attributes have values. In the above example, entity A has attributes a ₀ , a ₁ ... an _{n respectively} , and entity B has attributes b ₀ , b ₁ ... b _n . Assuming that at least one of a ₁ and b ₁ does not obtain an attribute value, then this attribute is an invalid attribute. If both a ₀ and b ₀ can obtain the attribute value, it is a valid attribute, and at most n valid attributes. For a pair of "corresponding entity attributes" (for example, a ₀ and b ₀ ), n is the number of effective corresponding associated attributes of the corresponding entity attribute. Similarly, suppose the associated attributes of attribute a ₀ include: α ₀ , α ₁ ,… α _n ; the associated attributes of attribute b ₀ include: β ₀ , β ₁ , ... β _n , and there are at most n valid corresponding associated attributes. Is "corresponding associated attribute" (for example, " ") The associated" corresponding entity attributes "(for example, a ₀ and b ₀ ) attribute weights can be set higher for important corresponding entity attributes and lower for non-important corresponding entity attributes. Represents the average value of the attribute similarity of the "corresponding associated attributes" associated with a "corresponding entity attribute". In step 210, if the attribute similarity is greater than the similarity threshold, it is determined that the two entities are the same entity, and the entity attributes of the two entities are related to the same entity. For example, when the value of sim (A, B) is greater than the threshold σ, the two can be considered as the same entity. After recognizing that two are the same entity, you can associate the entity attributes of the two entities to the same entity. The multi-source data fusion method of this example builds a similarity calculation method based on the associated attributes of entity attributes to measure the similarity relationship between two entities, so that the difference in entity attribute descriptions does not affect the identification of the same entity. Quickly and accurately complete the acquisition of multi-source data of the same entity; there is an effective measurement method for multi-source data with different data formats, which can realize the identification and fusion of the same entity data, thereby making the entity data more complete. The execution order of each step in the flow shown in FIG. 2 above is not limited to the order in the flowchart. In addition, the description of each step can be implemented in the form of software, hardware, or a combination thereof. For example, those skilled in the art can implement it in the form of software code, which can be a computer capable of implementing the logical functions corresponding to the steps. Execute instructions. When implemented in software, the executable instructions can be stored in memory and executed by the processor in the device. For example, corresponding to the above method, one or more embodiments of this specification simultaneously provide a data processing device, which may include a processor, a memory, and computer instructions stored on the memory and executable on the processor. The processor executes the instructions to implement the following steps: for any entity, obtain at least one associated attribute of each entity attribute; obtain the attribute similarity of the associated attributes of the two entities; if the attribute If the similarity is greater than the similarity threshold, it is determined that the two entities are the same entity, and the entity attributes of the two entities are related to the same entity. One or more embodiments of this specification also provide a multi-source data fusion device, which can be applied to implement a multi-source data fusion method of one or more embodiments of this specification. As shown in FIG. 3, the device may include: an attribute acquisition module 31, a similarity calculation module 32, and an association processing module 33. The attribute obtaining module 31 is used to obtain at least one associated attribute of each entity attribute for any entity; the similarity calculation module 32 is used to obtain the attribute similarity of the associated attributes of the two entities; association processing The module 33 is configured to determine that the two entities are the same entity if the attribute similarity is greater than the similarity threshold, and associate the entity attributes of the two entities to the same entity. In one example, the attribute acquisition module 31 is specifically used to: obtain the entity attribute from a pre-built graph database, the entity attribute is one of the attribute nodes in the graph database, the graph data The library includes multiple attribute nodes, and the attribute nodes that have an association relationship are connected by edges; the attributes corresponding to at least one attribute node connected to the entity attribute edges are determined as the associated attributes of the entity attributes. In one example, the similarity calculation module 32 is specifically used to: for the corresponding entity attributes of two entities, determine the corresponding associated attributes of the corresponding entity attributes; respectively calculate the attribute similarity between any two corresponding associated attributes Obtaining the attribute similarity of the two entities according to the attribute similarity between the corresponding associated attributes and the attribute weight of the corresponding entity attribute. In one example, as shown in FIG. 4, the device may further include: a data classification module 34 for classifying data of different entities that meet predetermined conditions into the same data set. In one example, as shown in FIG. 4, the device may further include: a data preprocessing module 35, which is used to unify the data in the data set. The device or module explained in the above embodiments may be realized by a computer chip or entity, or by a product with a certain function. A typical implementation device is a computer, and the specific form of the computer may be a personal computer, a laptop computer, a cellular phone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an e-mail sending and receiving device, a game A console, tablet, wearable device, or any combination of these devices. For the convenience of description, when describing the above device, the functions are divided into various modules and described separately. Of course, when implementing one or more embodiments of this specification, the functions of each module may be implemented in one or more software and / or hardware. Those skilled in the art should understand that one or more embodiments of this specification may be provided as a method, system, or computer program product. Therefore, one or more embodiments of this specification may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware aspects. Moreover, one or more embodiments of this specification can be used on one or more computer-usable storage media (including but not limited to disk memory, CD-ROM, optical memory, etc.) containing computer-usable program code The form of the implementation of computer program products. It should also be noted that the terms "include", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, commodity or device that includes a series of elements includes not only those elements, but also Other elements not explicitly listed, or include elements inherent to this process, method, commodity, or equipment. Without more restrictions, the element defined by the sentence "include one ..." does not exclude that there are other identical elements in the process, method, commodity, or equipment that includes the element. One or more embodiments of this specification may be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types. One or more embodiments of this specification can also be practiced in distributed computing environments in which tasks are performed by remote processing devices connected through a communication network. In a distributed computing environment, program modules can be located in local and remote computer storage media including storage devices. One or more embodiments of this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referred to each other. Each embodiment focuses on the differences from other embodiments. In particular, for the embodiment of the data processing device, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method embodiment. The above is only one or more embodiments of this specification and is not intended to limit this disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this disclosure should be included in this Expose the scope of protection.

11‧‧‧屬性節點11‧‧‧Attribute node

12‧‧‧屬性節點12‧‧‧Attribute node

13‧‧‧屬性節點13‧‧‧Attribute node

14‧‧‧屬性節點14‧‧‧Attribute node

202‧‧‧步驟202‧‧‧Step

204‧‧‧步驟204‧‧‧Step

206‧‧‧步驟206‧‧‧Step

208‧‧‧步驟208‧‧‧Step

210‧‧‧步驟210‧‧‧Step

31‧‧‧屬性獲取模組31‧‧‧Attribute acquisition module

32‧‧‧相似度計算模組32‧‧‧Similarity calculation module

33‧‧‧關聯處理模組33‧‧‧Related processing module

34‧‧‧資料分類模組34‧‧‧Data classification module

35‧‧‧資料預處理模組35‧‧‧Data preprocessing module

為了更清楚地說明本說明書一個或多個實施例或現有技術中的技術方案，下面將對實施例或現有技術描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本說明書一個或多個實施例中記載的一些實施例，對於本領域普通技術人員來講，在不付出進步性勞動性的前提下，還可以根據這些附圖獲得其他的附圖。　　圖1為本說明書一個或多個實施例提供的圖資料庫的部分示意圖；　　圖2為本說明書一個或多個實施例提供的多源資料融合方法的流程示意圖；　　圖3為本說明書一個或多個實施例提供的多源資料融合裝置的結構示意圖；　　圖4為本說明書一個或多個實施例提供的多源資料融合裝置的結構示意圖。In order to more clearly explain one or more embodiments of the specification or the technical solutions in the prior art, the following will briefly introduce the drawings required in the description of the embodiments or the prior art. Obviously, the appended The drawings are only some of the embodiments described in one or more embodiments of this specification. For those of ordinary skill in the art, without paying progressive labor, other drawings can also be obtained based on these drawings. 1 is a partial schematic diagram of a graph database provided by one or more embodiments of the specification; FIG. 2 is a schematic flowchart of a multi-source data fusion method provided by one or more embodiments of the specification; FIG. 3 is one or more of the specification FIG. 4 is a schematic structural diagram of a multi-source data fusion device provided by one or more embodiments of this specification.

Claims

A multi-source data fusion method, the method is used to obtain data belonging to the same entity from a data set, the data set includes data belonging to multiple entities, and the data of each entity includes at least one entity attribute; the method includes: For any entity, obtain at least one associated attribute of each entity attribute; Obtain the attribute similarity of the associated attributes of the two entities; If the attribute similarity is greater than the similarity threshold, determine the two The entity is the same entity, and the entity attributes of the two entities are related to the same entity.

According to the method described in item 1 of the patent application scope, the acquiring associated attributes of each entity attribute includes: acquiring the entity attribute from a pre-established graph database, the entity attribute is in the graph database One of the attribute nodes in the graph database includes a plurality of attribute nodes, and the attribute nodes that have an association relationship are connected by edges; The attribute corresponding to at least one attribute node connected to the entity attribute edge is determined to be The associated attributes of the entity attributes.

According to the method described in item 1 of the patent application scope, the obtaining the attribute similarity of the associated attributes of the two entities includes: For the corresponding entity attributes of the two entities, determining the corresponding associated attributes of the corresponding entity attributes; Calculate the attribute similarity between any two corresponding associated attributes separately; According to the attribute similarity between the corresponding associated attributes and the attribute weights of the corresponding entity attributes, obtain the attribute similarity of the two entities .

According to the method described in item 1 of the patent application scope, the method further includes: dividing the data of different entities that meet predetermined conditions into the same data set.

According to the method described in item 1 of the patent application scope, the method further includes: unifying the data in the data set in a data format.

A multi-source data fusion device, the device is used to obtain data belonging to the same entity from a data set, the data set includes data belonging to multiple entities, and the data of each entity includes at least one entity attribute; the device includes: Attribute acquisition module, used to obtain at least one associated attribute of each entity attribute for any entity; Similarity calculation module, used to obtain attribute similarity of the associated attributes of two entities; Association processing module , Used to determine that the two entities are the same entity if the attribute similarity is greater than the similarity threshold, and associate the entity attributes of the two entities to the same entity.

According to the device described in item 6 of the patent application scope, the attribute acquisition module is specifically configured to: acquire the entity attribute from a pre-established graph database, and the entity attribute is one of the graph databases An attribute node, the graph database includes a plurality of attribute nodes, and the attribute nodes that have an association relationship are connected by edges; the attribute corresponding to at least one attribute node connected to the entity attribute edge is determined as the entity The associated attribute of the attribute.

According to the device described in item 6 of the patent application scope, the similarity calculation module is specifically used to: for the corresponding entity attributes of two entities, determine the corresponding associated attributes of the corresponding entity attributes; calculate any two correspondences respectively Attribute similarity between related attributes; according to the attribute similarity between the corresponding related attributes and the attribute weight of the corresponding entity attribute, the attribute similarity of the two entities is obtained.

According to the device described in item 6 of the scope of the patent application, the device further includes: a data classification module for dividing data of different entities that meet predetermined conditions into the same data set.

According to the device described in item 6 of the patent application scope, the device further includes: a data preprocessing module, which is used for unifying the data format of the data in the data set.

A data processing device, the device includes a memory, a processor, and computer instructions stored on the memory and executable on the processor. The processor implements the following steps when executing the instructions: For any entity, obtain separately At least one associated attribute of each entity attribute; obtain the attribute similarity of the associated attributes of the two entities; if the attribute similarity is greater than the similarity threshold, determine that the two entities are the same entity, The entity attributes of the two entities are related to the same entity.