CN104268270A - Map Reduce based method for mining triangles in massive social network data - Google Patents

Map Reduce based method for mining triangles in massive social network data Download PDF

Info

Publication number
CN104268270A
CN104268270A CN201410539880.XA CN201410539880A CN104268270A CN 104268270 A CN104268270 A CN 104268270A CN 201410539880 A CN201410539880 A CN 201410539880A CN 104268270 A CN104268270 A CN 104268270A
Authority
CN
China
Prior art keywords
node
less
mapreduce
degrees
export
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410539880.XA
Other languages
Chinese (zh)
Inventor
周小平
赵云涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN201410539880.XA priority Critical patent/CN104268270A/en
Publication of CN104268270A publication Critical patent/CN104268270A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Health & Medical Sciences (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Map Reduce based method for mining triangles in massive social network data. The Map Reduce based method comprises transforming a social network to a graph structure with a user as a node and with the user relation as an edge to perform expression; performing sorting on the nodes from small to large according to node degrees and node names; dividing the nodes into two portions according to the node degrees, wherein the degree of one portion is smaller than k and the degree of the other portion is not smaller than k; achieving the triangle mining through two rounds of Map Reduce. According to the Map Reduce based method for mining the triangles in the massive social network data, the nodes with the degree values which are not smaller than k and the neighbor nodes with the degree values are smaller than k are not loaded into a Reduce memory in the first round of Map Reduce in the Map Reduce treatment stage and accordingly the problem that triangles in massive data cannot be mined due to the fact that the degrees of part of nodes are too high in the traditional Map Reduce is solved.

Description

The method of magnanimity social network data intermediate cam shape is excavated based on MapReduce
Technical field
The present invention relates to Intelligent Information Processing and Data Mining, specifically a kind of method utilizing MapReduce to excavate magnanimity figure intermediate cam shape and quantity thereof.
Background technology
Along with the development of social networks and universal, it is increasing on the impact of people's daily life.Application and research for social networks also gets more and more.In these investigation and application processes, first social networks is often converted into figure by people, builds social networks model, is beneficial to better conduct a research further and apply.Usually, the user in social networks is a node in figure, and customer relationship is the limit in figure.
In one drawing, three points that triangle all has limit by any two nodes are formed.The research of research to social networks of triangle problem has important theory and practice meaning.In theory, the calculating that the information transfer efficiency of social networks, cluster are sparse and connective etc. is all directly related with triangle number.In practical application, people apply triangle and carry out community discovery, and use triangle distinguishes the fictitious users (the corpse powder etc. as in micro blog network) in social networks, utilizes triangle to carry out friend recommendation etc.Therefore, the triangle in excavation figure not only in theory, more all has important effect in concrete practice application.
The network of a social networks normally flood tide.Such as, maximum in the world microblogging community Twitter has the registered user being no less than 500,000,000; Maximum Sina of Chinese microblogging community microblogging also has the registered user more than 500,000,000.Therefore, after social networks is converted into figure, is also a huge graph structure, forms the data of magnanimity.Because data are comparatively large, and whole figure cannot be loaded into the excavation that internal memory completes figure intermediate cam shape; In addition, social networks is a scales-free network normally, and the angle value of its part of nodes is very high; How vast books according in excavate triangle and become a very important technical matters.
At present, dig in magnanimity diagram data and mainly can be divided three classes according to leg-of-mutton method: 1. estimation algorithm.Estimation algorithm adopts certain method the point in magnanimity diagram data and limit to be removed with certain probability usually, forms enough little graph structure, and excavates the triangle number in the little figure formed, finally estimate the triangle number in large figure.This kind of method can only be rough estimate leg-of-mutton quantity, it does not also know which concrete triangle has, and does not know accurate triangle number yet.Therefore, its using value is little.2. external memory split plot design.Figure is divided into multiple subgraph that can be loaded into internal memory by the method in external memory, and then uses internal memory to complete each subgraph and the leg-of-mutton excavation between subgraph and subgraph.Because the method will carry out a large amount of I/O operation, it is lower to carry out efficiency.3. parallel computation technique.Mining task is distributed on different computing machines with certain method by the method by figure, completes leg-of-mutton excavation.
In recent years, the maturation calculated along with parallel computation and the rise of MapReduce, the application based on parallel computation is also more and more wider.MapReduce is the de facto standards framework of parallel computation, is proposed, be used widely at present by Google in 2004.
Document " Graph twiddling in a MapReduce world " (Cohen, 2009, Computing in Science & Engineering) proposes the method proposing triangle excavation based on MapReduce first; But the memory headroom complexity of the method is O (d max) ≈ O (n), d maxbe respectively the maximum number of degrees of figure interior joint and total nodes with n, the node number of degrees are the adjacent node quantity of this node; In figure, the number of degrees of some node make too greatly internal memory cannot be loaded into the related data of this node, and this algorithm cannot perform; Therefore, this algorithm can not meet huge scales-free network, such as social networks.Document " Counting triangles and the curse ofthe last reducer " (Suri and Vassilvitskii, 2011, Proceedings of the 20th international conference on World wide web) propose GP algorithm based on MapReduce.It is k subgraph that figure is carried out cutting by GP algorithm, and excavates the triangle of wantonly three subgraphs from each subgraph of k, finally completes all triangles excavating whole large figure.The memory headroom complexity of GP algorithm is O (m/k 2), external space complexity is O (km), and wherein m is total limit number in figure.Document " An efficient MapReduce algorithm for counting triangles in a very large graph " (Ha-Myung and Chin-Wan, 2013, Proceedings of the 22nd ACM international conference on Conference on information & knowledge management) improve GP algorithm, and propose TTP algorithm.The lifting that TTP algorithm is all linear compared with GP algorithm in memory headroom complexity and external space complexity.Although GP algorithm and TTP algorithm all use MapReduce to solve and excavate leg-of-mutton problem in magnanimity diagram datas; But its operational efficiency is poor, and triangle export structure has a large amount of repeatability.
The method of the people such as Cohen solves the Mining Problems of part magnanimity diagram data intermediate cam shape, and efficiency is also high than GP and TTP algorithm; But it is not suitable in figure the triangle with the higher node of the number of degrees and excavates.Node in figure, on the basis analyzing people's method deficiencies such as Cohen, is divided into the number of degrees to be not less than threshold value k and the number of degrees and is less than k two kinds and processes, and draw the present invention by inventor.
Summary of the invention
The object of the invention is to excavate the triangle in social networks, for application such as community discovery, fictitious users identification, friend recommendations, be specifically related to a kind of method excavating magnanimity social network data intermediate cam shape based on MapReduce.The method comprises following steps:
1. social networks is converted into figure to express, and stores in the mode on limit;
2. the degree of each node in calculating chart;
3. pair each node carries out sequence by degree and title and is numbered in order; Its sequence number of node that the number of degrees are less is lower;
4. node is divided into two parts: the L={ number of degrees are less than the node of k } and the H={ number of degrees be not less than the node of k;
5. adopt MapReduce to carry out triangle excavation.
The present invention adopts two-wheeled MapReduce to carry out triangle excavation.
First round MapReduce, to the limit (u, v) of arbitrary input Map, if the number of degrees of node u and v are less than or are not less than k all simultaneously, namely u and v belongs to a part, then export two groups of data <u; V> and <v; U>; Otherwise only export one group of data, the key of these group data is the less node of number that sorts in u and v, is worth for the larger node of the number that sorts in u and v.The Reduce stage be input as <u; S>.If u belongs to L, then S is all of its neighbor node of u; If u belongs to H, then S is the adjacent node that all number of degrees of u are not less than k.Reduce is greater than two tuples (v, w) of v and w when first traveling through all u differences in S, export u from small to large, the tlv triple that v, w are formed by sequence number.If u belongs to H, the arbitrary sequence number in S is greater than to the node v of u, Reduce travels through all elements w in L, due to sequence number w < u < v, therefore exports u, v, w tlv triple < (w, u); V>.All tlv triple export all with less two nodes of the number that sorts for key, with the maximum node of the number that sorts for value exports.
Second takes turns MapReduce, to arbitrary tlv triple < (u, v) sorted from small to large by sequence number; W>, Map directly export it.The Reduce stage be input as < (u, v); S>.Reduce travels through the element w in S, if the number of times that this element occurs is 2, then u, v, w form triangle, and export (u, v, w).
Social networks is converted into figure and carries out after expression stores, the node in figure being carried out sequencing numbers by degree size, and the node in figure is divided into two parts by degree processes respectively, and can be competent at the triangle excavation of social networks in large scale by the present invention.
Social networks is scales-free network, and its degree obeys power-law distribution, i.e. p k~ k .The model of power-law distribution has a lot, and the present invention adopts continuous power-law distribution model to prove the advantage of method disclosed in the present invention, uses other power-law distribution models also can draw similar conclusion.Therefore, the nodes of spending for k can be expressed as p k = &Integral; k k + 1 C k - &alpha; dk . Again &Sigma; 1 &infin; p k = &Integral; 1 &infin; C k - &alpha; dk = C / ( &alpha; - 1 ) = 1 , Known C=α-1.So, therefore, the number of degrees are not less than the proportion that k and the node that is less than k account for sum and are respectively: | L|=1-|H|=1-k -α+1.In the Reduce stage of first round MapReduce, u is belonged to the node of L, its internal memory taken is no more than the number of degrees of u, is namely less than k; U is belonged to the node of H, the memory headroom shared by it is no more than the quantity of H.Therefore, when the inventive method is optimum, i.e. k=n × | H|=nk -α+1, n is total nodes of figure, has k=n 1/ α, now have memory headroom complexity to be O (n 1/ α).In like manner, can demonstrate,prove in the MapReduce taken turns second, the memory headroom complexity shared by it is also O (n 1/ α).Therefore, shared by the open method of the present invention, memory headroom complexity is O (n 1/ α), itself and limit number have nothing to do.Show through a large amount of real example data and theory deduction, in social networks, α ≈ 3.Therefore, for a PB level (10 15) social networks, under optimal situation, the memory headroom shared by it is about 10 5level (MB level).Therefore, extremely cheap computing machine is all enough to be competent at.
To sum up, the present invention has following advantage:
1. the limit number of required memory and social networks has nothing to do, only relevant with number of nodes;
2. internal memory complexity is low, can be competent at the social networks of mass data.
Accompanying drawing explanation
Fig. 1 is embodiments of the invention social networks exemplary plot.
Embodiment
With reference to Fig. 1, it is embodiments of the invention social networks exemplary plot.This schematic diagram hypothesis social networks has 6 users and 9 customer relationships, forms the graph structure on 6 nodes and 9 limits after being converted into figure, and is numbered each node from small to large by degree size respectively, finally adopts MapReduce to carry out triangle excavation.In this example, the adjacent node of each node is respectively: N (1)={ 6}, N (2)={ 5,6}, N (3)={ 4,5,6}, N (4)={ 2,3,5,6}, N (5)={ 2,3,4,6} and N (6)={ 1,2,3,4,5}.Be 4 by degree be two parts, i.e. L={1 by the node division in figure, 2,3}, H={4,5,6}.
In first round MapReduce, the Map stage directly exports 1, the adjacent node of 2,3, and 4,5,6 export its adjacent node in H.Therefore, obtain node in L, the input in Reduce stage is respectively: <1; { 6}>, <2; { 5,6}> and <3; { 4,5,6}>; Any two these node sequencing numbers traveled through in its adjacent node are not the node of maximal value, and by number after sequential arrangement, export tlv triple, it exports and is respectively empty, < (2,5); 6> and < (3,4); 5>< (3,4); 6>< (3,5); 6>.For the node in H, the input in Reduce stage is respectively: <4; { 5,6}>, <5; { 4,6}> and <6; { 4,5}>; Any two these node sequencing numbers traveled through in its adjacent node are not the node of maximal value, and export tlv triple after arrangement in order, its output is: < (4,5); 6>, < (4,5); 6> and sky; Then, to arbitrary adjacent node being greater than this node sequencing number, all nodes in traversal L also export the tlv triple formed, and its output is respectively: < (Isosorbide-5-Nitrae); 5>< (2,4); 5>< (3,4); 5>< (Isosorbide-5-Nitrae); 6>< (2,4); 6>< (3,4); 6>, < (1,5); 6>< (2,5); 6>< (3,5); 6> and sky.
Second takes turns MapReduce, and the Map stage directly exports by output content.In the Reduce stage, to arbitrary input, travel through the element in its value, if the number of times that this element occurs is 2, then key and this element form triangle, Output rusults.The input that the Reduce stage receives is respectively: < (2,5); { 6,6}>, < (3,4); { 5,5,6,6}>, < (3,5); { 6,5}>, < (Isosorbide-5-Nitrae); { 5,6}>, < (2,4); { 5,6}>, < (4,5); { 6,6}> and < (1,5); { 6}; Therefore, final found triangle has: (2,5,6), (3,4,5) >, (3,4,6) and (4,5,6).
The method disclosed in the present, the false code of its two-wheeled MapReduce is as follows:
The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should described be as the criterion with the protection domain of claim.

Claims (2)

1. excavate a method for magnanimity social network data intermediate cam shape based on MapReduce, it is characterized in that, described method comprises the steps:
I, social networks is converted into figure expresses, and store in the mode on limit;
The degree of each node in II, calculating chart;
III, carry out sequence being numbered in order to each node by degree and title, its number that sorts of node that the number of degrees are less is lower;
IV, node is divided into two parts: the L={ number of degrees are less than the node of k } and the H={ number of degrees be not less than the node of k;
V, employing MapReduce carry out triangle excavation.
2. the MapReduce that adopts as claimed in claim 1 carries out triangle excavation, it is characterized in that, adopts two-wheeled MapReduce to complete triangle and excavates:
First round MapReduce, to the limit (u, v) of arbitrary input Map, if the number of degrees of node u and v are less than or are not less than k all simultaneously, namely u and v belongs to a part, then export two groups of data < u; V > and < v; U >; Otherwise only export one group of data, the key of these group data is the less node of number that sorts in u and v, is worth and is input as < u for the number that sorts in u and v larger node=Reduce stage; S >; If u belongs to L, then S is all of its neighbor node of u; If u belongs to H, then S is the adjacent node that all number of degrees of u are not less than k; Reduce is greater than two tuples (v, w) of v and w when first traveling through all u differences in S, export u from small to large, the tlv triple that v, w are formed by sequence number; If u belongs to H, the arbitrary sequence number in S is greater than to the node v of u, Reduce travels through all elements w in L, due to sequence number w < u < v, therefore export u, v, w tlv triple < (w, u); V >; All tlv triple export all with less two nodes of the number that sorts for key, with the maximum node of the number that sorts for value exports;
Second takes turns MapReduce, to arbitrary tlv triple < (u, v) sorted from small to large by sequence number; W >, Map directly exports it; The Reduce stage be input as < (u, v); S >; Reduce travels through the element w in S, if the number of times that this element occurs is 2, then u, v, w form triangle, and export (u, v, w).
CN201410539880.XA 2014-10-13 2014-10-13 Map Reduce based method for mining triangles in massive social network data Pending CN104268270A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410539880.XA CN104268270A (en) 2014-10-13 2014-10-13 Map Reduce based method for mining triangles in massive social network data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410539880.XA CN104268270A (en) 2014-10-13 2014-10-13 Map Reduce based method for mining triangles in massive social network data

Publications (1)

Publication Number Publication Date
CN104268270A true CN104268270A (en) 2015-01-07

Family

ID=52159791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410539880.XA Pending CN104268270A (en) 2014-10-13 2014-10-13 Map Reduce based method for mining triangles in massive social network data

Country Status (1)

Country Link
CN (1) CN104268270A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108737185A (en) * 2018-05-23 2018-11-02 哈尔滨工业大学 A kind of triangle count method and device in datagram stream based on random sampling
CN109753598A (en) * 2019-01-02 2019-05-14 桑葛楠 A kind of online user's social networks independence recognizer based on equilateral triangle transform method
WO2021212812A1 (en) * 2020-04-22 2021-10-28 浙江工商大学 Method for mining cohesive subgraph in symbol network on the basis of cluster attribute and balance theory

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102724219A (en) * 2011-03-29 2012-10-10 国际商业机器公司 A network data computer processing method and a system thereof
US20130124504A1 (en) * 2011-11-14 2013-05-16 Google Inc. Sharing Digital Content to Discovered Content Streams in Social Networking Services
CN103457800A (en) * 2013-09-08 2013-12-18 西安电子科技大学 Network community detection method based on M elite coevolution strategy
CN103942308A (en) * 2014-04-18 2014-07-23 中国科学院信息工程研究所 Method and device for detecting large-scale social network communities

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102724219A (en) * 2011-03-29 2012-10-10 国际商业机器公司 A network data computer processing method and a system thereof
US20130124504A1 (en) * 2011-11-14 2013-05-16 Google Inc. Sharing Digital Content to Discovered Content Streams in Social Networking Services
CN103457800A (en) * 2013-09-08 2013-12-18 西安电子科技大学 Network community detection method based on M elite coevolution strategy
CN103942308A (en) * 2014-04-18 2014-07-23 中国科学院信息工程研究所 Method and device for detecting large-scale social network communities

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
U.KANG ET AL.: "HEigen:Spectral Analysis for Billion-Scale Graphs", 《IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING》 *
何忠育: "分布式社会网络分析支撑系统研究与应用", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108737185A (en) * 2018-05-23 2018-11-02 哈尔滨工业大学 A kind of triangle count method and device in datagram stream based on random sampling
CN109753598A (en) * 2019-01-02 2019-05-14 桑葛楠 A kind of online user's social networks independence recognizer based on equilateral triangle transform method
WO2021212812A1 (en) * 2020-04-22 2021-10-28 浙江工商大学 Method for mining cohesive subgraph in symbol network on the basis of cluster attribute and balance theory

Similar Documents

Publication Publication Date Title
CN102722709B (en) Method and device for identifying garbage pictures
CN109710701A (en) A kind of automated construction method for public safety field big data knowledge mapping
CN103838863B (en) A kind of big data clustering algorithm based on cloud computing platform
CN104268271A (en) Interest and network structure double-cohesion social network community discovering method
CN102810113B (en) A kind of mixed type clustering method for complex network
CN105721279B (en) A kind of the relationship cycle method for digging and system of subscribers to telecommunication network
Xu et al. Mobile cellular big data: Linking cyberspace and the physical world with social ecology
CN105893382A (en) Priori knowledge based microblog user group division method
CN105069025A (en) Intelligent aggregation visualization and management and control system for big data
CN105279187A (en) Edge clustering coefficient-based social network group division method
CN111159184B (en) Metadata tracing method and device and server
CN105183770A (en) Chinese integrated entity linking method based on graph model
CN103942308A (en) Method and device for detecting large-scale social network communities
CN103530402A (en) Method for identifying microblog key users based on improved Page Rank
CN105893381A (en) Semi-supervised label propagation based microblog user group division method
CN101667201A (en) Integration method of Deep Web query interface based on tree merging
CN106294715A (en) A kind of association rule mining method based on attribute reduction and device
CN105335438A (en) Local shortest loop based social network group division method
CN105678590A (en) topN recommendation method for social network based on cloud model
CN104317904A (en) Generalization method for weighted social network
CN104700311B (en) A kind of neighborhood in community network follows community discovery method
CN116522272A (en) Multi-source space-time data transparent fusion method based on urban information unit
CN104268270A (en) Map Reduce based method for mining triangles in massive social network data
CN105069290A (en) Parallelization critical node discovery method for postal delivery data
CN103793747A (en) Sensitive information template construction method in network content safety management

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150107

WD01 Invention patent application deemed withdrawn after publication