CN108596444B - Method and device for sampling large-scale social network users based on diversified strategies - Google Patents

Method and device for sampling large-scale social network users based on diversified strategies Download PDF

Info

Publication number
CN108596444B
CN108596444B CN201810284916.2A CN201810284916A CN108596444B CN 108596444 B CN108596444 B CN 108596444B CN 201810284916 A CN201810284916 A CN 201810284916A CN 108596444 B CN108596444 B CN 108596444B
Authority
CN
China
Prior art keywords
attribute
users
user
representative
social network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810284916.2A
Other languages
Chinese (zh)
Other versions
CN108596444A (en
Inventor
桑维
唐杰
刘德兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201810284916.2A priority Critical patent/CN108596444B/en
Publication of CN108596444A publication Critical patent/CN108596444A/en
Application granted granted Critical
Publication of CN108596444B publication Critical patent/CN108596444B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • Theoretical Computer Science (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Game Theory and Decision Science (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for sampling large-scale social network users based on a diversified strategy, wherein the method comprises the following steps: extracting a plurality of user representatives through a utility function; dividing the multiple user representatives into multiple attribute groups according to the attribute of each user representative of the multiple user representatives to obtain a model of attribute group representative degree; obtaining the maximum value of the utility function so as to select a representative user from a plurality of attribute groups; and selecting the least representative group by utilizing a diversified strategy sampling according to the representative user. The method can effectively reduce the data scale of the network, enables the data processing scale to be easy to process, is also beneficial to removing unrepresentative users, intensively researches more valuable user groups in the network, further effectively improves the sampling accuracy, and is very efficient in time complexity.

Description

Method and device for sampling large-scale social network users based on diversified strategies
Technical Field
The invention relates to the technical field of computer network information, in particular to a method and a device for sampling large-scale social network users based on a diversification strategy.
Background
At present, how to find a user subset from a large-scale network to statistically represent the whole network is a very important problem in social network analysis. The method can be applied to various applications, such as recommendation of academic information on a WeChat public platform and recommendation of friends in a social network. Sampling for large-scale network users is theoretically an NP (Non-Deterministic polymeric, NP problem) problem. Some strategies for selecting representative users have been proposed in prior art, but there are no different specific forms for different sampling strategies.
In the related art, the proposed statistical hierarchical sampling considers the importance of users so that the distribution of each attribute in a sample is as consistent as possible with the population, the Griunded Theory is a sampling strategy emphasizing diversity, and the related art studies a strategy represented by the population user's referral similar to political election. But the related art has little special research on the user sampling problem in the social network, only considers the similarity between nodes, and has difficulty in explaining which representative users are representative in the selected representative users.
Disclosure of Invention
The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, an object of the present invention is to provide a method for sampling users in a large-scale social network based on a diversification strategy, which can effectively improve the accuracy of sampling and is very efficient in terms of time complexity.
Another objective of the present invention is to provide a device for large-scale social network user sampling based on a diversification strategy.
In order to achieve the above object, an embodiment of an aspect of the present invention provides a method for sampling users in a large-scale social network based on a diversification policy, which includes the following steps: extracting a plurality of user representatives through a utility function; dividing the multiple user representatives into multiple attribute groups according to the attribute of each user representative of the multiple user representatives to obtain a model of attribute group representative degree; obtaining the maximum value of the utility function so as to select a representative user from the plurality of attribute groups; and selecting the least representative degree group by utilizing a diversified strategy according to the representative user.
According to the method for sampling the users of the large-scale social network based on the diversified strategies, the diversity of the attributes is considered through the utility function, and the contribution of the selected users to attribute groups and the contribution size can be seen, so that the data scale of the network is effectively reduced, the data processing scale is changed to be easy to process, meanwhile, the method is helpful for removing unrepresentative users, and more valuable user groups in the network are researched in a centralized manner, so that the sampling accuracy is effectively improved, and the method is very efficient in time complexity.
In addition, the method for large-scale social network user sampling based on the diversification strategy according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, the utility function is:
Figure BDA0001615738130000021
where G ═ (V, E) denotes a social network, where V represents a set of points containing | V | ═ N users,
Figure BDA0001615738130000022
representing an edge set containing M user relationships, | E |, X ∈ Rn×dIs an attribute matrix, T is a subset of users, λlIs a and attribute group (V)l,ajl) Positive integer, λ, of size dependencelDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lSet of all users for attribute l, ajlIs an attribute.
Further, in one embodiment of the invention, the set of attributes represents a model of the degree
Figure BDA0001615738130000023
Wherein R (T, v)i,ajl) To be at a specific attribute ajlUpper subset of users T to users viThe value range of the representative degree of (1) is [0,1 ]]Default definition when a node in T has an edge connected to viThen R (T, v)i,ajl) The value is 1, otherwise the value is 0.
Further, in one embodiment of the invention, if 1 ≦ l ≦ T, and P (T, l) >0, then all property groups are represented, with a relatively balanced P (T, l) for each property group to avoid property groups being represented too large or too small.
In order to achieve the above object, another embodiment of the present invention provides an apparatus for sampling users of a large-scale social network based on a diversification policy, including: the extraction module is used for extracting a plurality of user representatives through a utility function; the grouping module is used for dividing the multiple user representatives into multiple attribute groups according to the attribute of each user representative of the multiple user representatives so as to obtain a model of attribute group representative degree; the obtaining module is used for obtaining the maximum value of the utility function so as to select a representative user from the plurality of attribute groups; and the processing module is used for selecting the worst representative degree group by sampling according to the representative users by utilizing the diversified strategies.
According to the device for sampling the users of the large-scale social network based on the diversified strategies, the diversity of the attributes is considered through the utility function, and the contribution of the selected users to attribute groups and the contribution size can be seen, so that the data scale of the network is effectively reduced, the data processing scale is changed to be easy to process, meanwhile, the device is helpful for removing unrepresentative users, and more valuable user groups in the network are researched in a centralized manner, so that the sampling accuracy is effectively improved, and the device is very efficient in time complexity.
In addition, the device for large-scale social network user sampling based on the diversification strategy according to the above embodiment of the present invention may further have the following additional technical features:
further, in one embodiment of the present invention, the utility function is:
Figure BDA0001615738130000031
where G ═ (V, E) denotes a social network, where V represents a set of points containing | V | ═ N users,
Figure BDA0001615738130000032
representing an edge set containing M user relationships, | E |, X ∈ Rn×dIs an attribute matrix, T is a subset of users, λlIs a and attribute group (V)l,ajl) Positive integer, λ, of size dependencelDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lFor the set of all users of the attribute/,a jlis an attribute.
Further, in one embodiment of the invention, the set of attributes represents a model of the degree
Figure BDA0001615738130000033
Wherein λ islDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lSet of all users for attribute l, ajlIs an attribute.
Further, in one embodiment of the invention, if 1 ≦ l ≦ T, and P (T, l) >0, then all property groups are represented, with a relatively balanced P (T, l) for each property group to avoid property groups being represented too large or too small.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow diagram of a method for large-scale social network user sampling based on a diversification strategy according to one embodiment of the present invention;
fig. 2 is a schematic structural diagram of a device for large-scale social network user sampling based on a diversification strategy according to an embodiment of the invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
The method and apparatus for sampling users of a large scale social network based on a diversification strategy according to an embodiment of the present invention will be described with reference to the accompanying drawings.
FIG. 1 is a flow diagram of a method for large-scale social network user sampling based on a diversification strategy in accordance with one embodiment of the present invention.
As shown in FIG. 1, the method for sampling the users of the large-scale social network based on the diversification strategy comprises the following steps:
in step S101, a plurality of user representatives are extracted by the utility function.
It can be understood that for sampling problems, such as simple random sampling, sampling based on graph traversal, and sampling based on random walk, which are very meaningful for large-scale user sampling, the extraction of user representatives proposed by the embodiment of the present invention takes social influence into consideration, and a utility function is proposed to evaluate the selected user representatives.
In one embodiment of the invention, the utility function is:
Figure BDA0001615738130000041
where G ═ (V, E) denotes a social network, where V represents a set of points containing | V | ═ N users,
Figure BDA0001615738130000042
representing an edge set containing M user relationships, | E |, X ∈ Rn×dIs an attribute matrix, T is a subset of users, λlIs a and attribute group (V)l,ajl) Positive integer, λ, of size dependencelDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lSet of all users for attribute l, ajlIs an attribute.
Specifically, embodiments of the present invention primarily aim to extract user representatives using a utility function, and may be represented by G ═ (V, E), where V stands for a set of points containing | V | ═ N users,
Figure BDA0001615738130000043
representing an edge set containing M user relationships. In addition, the definition attribute set a ═ { a ═ is definedj}j=1...dWhere d represents the number of attributes. Thus, the embodiment of the invention can obtain the attribute matrix X ∈ Rn×dSuch that each row X of the matrixi=[xik]k=1...dCorresponding to user viSet of attributes, x, for e VikThen represents user viAt attribute akThe value of (c) above.
Embodiments of the present invention may define a function to represent the degree of representation of a subset of users T. Specifically, given any subset of users
Figure BDA0001615738130000044
And a user viDefining a function R (T, v)i,aj) Is shown at a particular attribute ajUpper, T to viIs in the range of [0,1 ]]. When R (T, v)i,aj) When 1, in the attribute ajTo user viIs perfectly represented by the subset of users T. In particular, when
Figure BDA0001615738130000045
For any vi,aj,R(T,vi,aj) 0. The definition of the representing degree function R is very flexible, and other information can be conveniently added in the definition mode. For example, when the embodiments of the present invention consider information in a network, a straightforward approach is to treat each user as an attribute, for each node viA set of neighbors is defined.
Based on the above definitions, embodiments of the present invention propose to use utility function Q (G, X, G, T) to select a user representative. Wherein the utility function:
Figure BDA0001615738130000051
wherein λ islIs a and attribute group (V)l,ajl) Positive integers of related magnitude.
In step S102, the plurality of user representatives are divided into a plurality of attribute groups according to the attribute of each user representative of the plurality of user representatives, so as to obtain a model of the degree of attribute group representation.
In one embodiment of the invention, the property groups represent a model of the degree
Figure BDA0001615738130000052
Wherein R (T, v)i,ajl) To be at a specific attribute ajlUpper subset of users T to users viThe value range of the representative degree of (1) is [0,1 ]]Default definition when a node in T has an edge connected to viThen R (T, v)i,ajl) The value is 1, otherwise the value is 0.
It can be understood that when selecting users, they are divided into different groups according to their different attributes, and for the different groups, a model is given:
Figure BDA0001615738130000053
if for attribute ajl,VlAll users in the system are perfectly represented by the subset T, called attribute group (V)l,ajl) Represented perfectly by T.
Specifically, for one attribute group (V)l,ajl) (1 ≦ l ≦ T), defining the user subset T for the property group (V)l,ajl) The representative degrees of (c) are:
Figure BDA0001615738130000054
if for attribute ajl,VlAll users in the system are perfectly represented by the subset T, called attribute group (V)l,ajl) Represented perfectly by T.
In step S103, a maximum value of the utility function is acquired to select a representative user from the plurality of attribute groups.
It can be understood that, according to the utility function Q (G, X, G, T), solving the maximum value of the utility function represents the optimum selected by the representative user at this time, and the problem can be converted into:
Figure BDA0001615738130000055
specifically, (1) a social network G is given as (V, E), where V is a set of all users in the network and E is an edge set, representing user relationships in the network; (2) user attribute value matrix X belongs to Rn×d(ii) a (3) An attribute group set G; (4) a representative user number k and a representative user set T; (5) and a utility function Q, wherein the user quantifies the representing degree of the selected representative user set on all attributes. The original problem can be transformed into an optimization problem:
Figure BDA0001615738130000061
the embodiment of the invention aims at solving the problems in the prior art to find representative users from a large-scale social network and construct a representative user selection model, and mainly applies to a diversification strategy sampling, wherein the aim of the strategy is to diversify the selected representatives.
In step S104, the least representative group is selected from the representative users by using a plurality of strategy samples.
Further, in one embodiment of the invention, if 1 ≦ l ≦ T, and P (T, l) >0, then all property groups are represented, with a relatively balanced P (T, l) for each property group to avoid property groups being represented too large or too small.
It will be appreciated that given that P (T, l) >0 is represented to some extent for all 1 ≦ l ≦ T, i.e., all property groups, the emphasis on "diversification" herein refers to having a relatively balanced P (T, l) for each property group, avoiding the presence of some property groups being represented too large or too small.
Specifically, embodiments of the present invention utilize a multivariate strategy sampling to select the group with the worst representativeness. The user representative model selected by the embodiment of the invention is mainly a sampling of a diversification strategy, and the purpose of the sampling strategy is to diversify the selected representative. When the k value is small, diversification means that the selected representation comes from as many attribute groups as possible. This situation is not discussed in the embodiment of the present invention, because the number of representatives selected in the embodiment of the present invention is generally greater than the number of attribute groups, it is only necessary to select one representative from each attribute group to ensure that all attribute groups are covered. Assuming that P (T, l) >0 is represented to some extent for all 1 ≦ l ≦ T, i.e., all property groups, the emphasis on "diversification" herein refers to having a relatively balanced P (T, l) for each property group, avoiding the presence of some property groups being represented too large or too small.
The size of the property groups may be very critical. For attribute groups with a large number of people, the degree of representation is generally relatively large. It is an aim of embodiments of the present invention to avoid this as much as possible, for any
Figure BDA0001615738130000062
The values of P (T, l) are relatively balanced. At the same time, an extreme case needs to be considered: all property groups are the same size, and the goal of an embodiment of the present invention is to make all P (T, l), P (T, l) ═ 0 as identical as possible. In this case, all the selected representative user sets will not be better than the empty set, because then P (T, l) ═ 0,
Figure BDA0001615738130000063
when a representative user is actually selected, the situation needs to be avoided, the embodiment of the present invention still requires that each attribute group has a certain representative degree, and thus, for the diversified policy sampling, the utility function given is as follows:
Figure BDA0001615738130000071
wherein λlIs a and attribute group (V)l,ajl) Positive integers of relevant magnitude, which can generally be taken as λl=|Vl|-1。λlThe property group with the smallest P (T, l) value is generally referred to as the "worst-degree-of-representation group". The performance of the utility function depends on the effect of the "least representative group".
In addition, the difficult problem of the embodiment of the invention is to find an efficient algorithm for huge and complex data of the internet, so that the large-scale data can be efficiently processed. Meanwhile, the significance of the embodiment of the invention is that the designed sampling algorithm can ensure the dynamic property of the data while ensuring certain precision.
Sampling is an effective way to reduce the size of data, and the basic idea is to replace the original data with a small sample. The embodiment of the invention can only care about the selected sample during actual analysis, and the selected node has larger contribution to the network than unselected nodes to a certain extent, thereby having more research significance. The embodiment of the invention tries to research how to effectively select the user representatives from the large-scale network based on the idea of sampling. In particular, given a large social network, it is desirable to have a sampling algorithm that finds a certain number of users so that they can represent as many users as possible of the total number of users in the network. In addition, it is desirable that the sampling algorithm ensure a certain accuracy and a certain efficiency to satisfy the dynamic property of data.
Further, the embodiment of the invention provides a general model for the sampling problem of large-scale social network users. The basic idea of the method is to provide a diversified strategy sampling model through a user sampling problem, wherein the extracted samples come from a plurality of attribute groups as much as possible, and an effective algorithm is designed. Experiments show that the method is obviously superior to other algorithms, has higher accuracy and is very efficient in time complexity.
In summary, the basic objective of the embodiments of the present invention is to design a diversified policy sampling model, which can effectively reduce the data scale for a large-scale network user sampling algorithm, facilitate subsequent research and analysis, and can be widely applied in the data preprocessing process. Firstly, the method discusses the steps of extracting user representatives from a large-scale social network, then respectively defining attribute group representative degree and discovering representative users for the extracted user representatives, respectively explaining the two definitions, then designing a diversified strategy sampling algorithm, finally applying the newly-proposed algorithm to actual data in an experiment, and finding that the newly-proposed algorithm is obviously superior to the original baseline algorithm according to the experimental result.
According to the method for sampling the large-scale social network users based on the diversified strategies, provided by the embodiment of the invention, the diversity of the attributes is also considered through the utility function, and the contribution of the selected users to the attribute groups and the contribution size can be seen, so that the data scale of the network is effectively reduced, the data processing scale is changed to be easy to process, meanwhile, the method is also beneficial to removing unrepresentative users, and the user groups with more values in the network are intensively researched, so that the sampling accuracy is effectively improved, and the method is very efficient in time complexity.
Next, a device for sampling users of a large-scale social network based on a diversification strategy according to an embodiment of the present invention will be described with reference to the accompanying drawings.
FIG. 2 is a schematic diagram of a device for large-scale social network user sampling based on a diversification strategy according to an embodiment of the present invention.
As shown in fig. 2, the apparatus 10 for large-scale social network user sampling based on diversified policies comprises: an extraction module 100, a grouping module 200, an acquisition module 300 and a processing module 400.
The extraction module 100 is configured to extract a plurality of user representatives through a utility function. The grouping module 200 is configured to divide the plurality of user representatives into a plurality of attribute groups according to the attribute of each user representative of the plurality of user representatives, so as to obtain a model of the degree of representation of the attribute groups. The obtaining module 300 is configured to obtain a maximum value of the utility function to select a representative user from the plurality of attribute groups. The processing module 400 is configured to select the least representative group based on the representative user using the plurality of policy samples. The device 10 of the embodiment of the invention can effectively reduce the data scale of the network, enables the data processing scale to be easy to process, is also beneficial to removing unrepresentative users, intensively researches more valuable user groups in the network, further effectively improves the sampling accuracy, and is very efficient in time complexity.
Further, in one embodiment of the present invention, the utility function is:
Figure BDA0001615738130000081
where G ═ (V, E) denotes a social network, where V represents a set of points containing | V | ═ N users,
Figure BDA0001615738130000082
representing an edge set containing M user relationships, | E |, X ∈ Rn×dIs an attribute matrix, T is a subset of users, λlIs a and attribute group (V)l,ajl) Positive integer, λ, of size dependencelDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lSet of all users for attribute l, ajlIs an attribute.
Further, in one embodiment of the invention, the property groups represent models of degree
Figure BDA0001615738130000083
Wherein R (T, v)i,ajl) To be at a specific attribute ajlUpper subset of users T to users viThe value range of the representative degree of (1) is [0,1 ]]Default definition when a node in T has an edge connected to viThen R (T, v)i,ajl) The value is 1, otherwise the value is 0.
Further, in one embodiment of the invention, if 1 ≦ l ≦ T, and P (T, l) >0, then all property groups are represented, with a relatively balanced P (T, l) for each property group to avoid property groups being represented too large or too small.
It should be noted that the explanation of the embodiment of the method for sampling users in a large scale social network based on a diversified policy is also applicable to the device for sampling users in a large scale social network based on a diversified policy in the embodiment, and is not repeated herein.
According to the device for sampling the users of the large-scale social network based on the diversified strategies, provided by the embodiment of the invention, the diversity of the attributes is also considered through the utility function, and the contribution of the selected users to the attribute groups and the contribution size can be seen, so that the data scale of the network is effectively reduced, the data processing scale is changed to be easy to process, meanwhile, the device is also beneficial to removing unrepresentative users, and the user groups with more values in the network are intensively researched, thereby effectively improving the sampling accuracy and simultaneously showing very high efficiency in time complexity.
In the description of the present invention, it is to be understood that the terms "central," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," "clockwise," "counterclockwise," "axial," "radial," "circumferential," and the like are used in the orientations and positional relationships indicated in the drawings for convenience in describing the invention and to simplify the description, and are not intended to indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and are therefore not to be considered limiting of the invention.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "connected," "secured," and the like are to be construed broadly and can, for example, be fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or they may be connected internally or in any other suitable relationship, unless expressly stated otherwise. The specific meanings of the above terms in the present invention can be understood by those skilled in the art according to specific situations.
In the present invention, unless otherwise expressly stated or limited, the first feature "on" or "under" the second feature may be directly contacting the first and second features or indirectly contacting the first and second features through an intermediate. Also, a first feature "on," "over," and "above" a second feature may be directly or diagonally above the second feature, or may simply indicate that the first feature is at a higher level than the second feature. A first feature being "under," "below," and "beneath" a second feature may be directly under or obliquely under the first feature, or may simply mean that the first feature is at a lesser elevation than the second feature.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims (4)

1. A large-scale social network data processing method based on a diversified strategy is characterized in that a user subset in a large-scale social network is applied to social platform recommendation information or social network recommendation friends, and the method comprises the following steps:
extracting a plurality of user representatives from the large-scale social network data through a utility function;
dividing the multiple user representatives into multiple attribute groups according to the attribute of each user representative of the multiple user representatives to obtain a model of attribute group representative degree;
obtaining the maximum value of the utility function so as to select a representative user from the plurality of attribute groups; and
selecting the worst representative degree group by sampling according to the representative users by utilizing a diversified strategy so as to remove the worst representative degree group from the large-scale social network data
Wherein the utility function is:
Figure FDA0003034922280000011
where G ═ (V, E) denotes a social network, where V represents a set of points containing | V | ═ N users,
Figure FDA0003034922280000012
representing an edge set containing M user relationships, | E |, X ∈ Rn×dIs an attribute matrix, T is a subset of users, λlIs a and attribute group (V)l,ajl) Positive integer, λ, of size dependencelDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lSet of all users for attribute l, ajlIs an attribute;model of degree of representation of the set of attributes
Figure FDA0003034922280000013
Wherein R (T, v)i,ajl) To be at a specific attribute ajlUpper subset of users T to users viThe value range of the representative degree of (1) is [0,1 ]]Default definition when a node in T has an edge connected to viThen R (T, v)i,ajl) The value is 1, otherwise the value is 0.
2. The method for processing large-scale social network data based on diversified strategies according to claim 1, wherein if 1 ≦ l ≦ T and P (T, l) >0, all attribute groups are represented, and there is a relatively balanced P (T, l) for each attribute group to avoid the attribute group being represented too large or too small.
3. A large-scale social network data processing device based on diversified strategies, wherein a subset of users in a large-scale social network is applied to social platform recommendation information or social network recommendation friends, the device is characterized by comprising:
the extraction module is used for extracting a plurality of user representatives from the large-scale social network data through the utility function;
the grouping module is used for dividing the multiple user representatives into multiple attribute groups according to the attribute of each user representative of the multiple user representatives so as to obtain a model of attribute group representative degree;
the obtaining module is used for obtaining the maximum value of the utility function so as to select a representative user from the plurality of attribute groups; and
a processing module for selecting the worst representative degree group according to the representative user by utilizing a diversified strategy sampling so as to remove the worst representative degree group from the large-scale social network data
Wherein the utility function is:
Figure FDA0003034922280000021
where G ═ (V, E) denotes a social network, where V represents a set of points containing | V | ═ N users,
Figure FDA0003034922280000022
representing an edge set containing M user relationships, | E |, X ∈ Rn×dIs an attribute matrix, T is a subset of users, λlIs a and attribute group (V)l,ajl) Positive integer, λ, of size dependencelDefault value is | Vl|-1P (T, l) is the subset of users T for the property group (V)l,ajl) Degree of representativeness of (V)lSet of all users for attribute l, ajlIs an attribute; model of degree of representation of the set of attributes
Figure FDA0003034922280000023
Wherein R (T, v)i,ajl) To be at a specific attribute ajlUpper subset of users T to users viThe value range of the representative degree of (1) is [0,1 ]]Default definition when a node in T has an edge connected to viThen R (T, v)i,ajl) The value is 1, otherwise the value is 0.
4. The device of claim 3, wherein if 1 ≦ l ≦ T and P (T, l) >0, then all property groups are represented, with a relatively balanced P (T, l) for each property group to avoid property groups being represented too large or too small.
CN201810284916.2A 2018-04-02 2018-04-02 Method and device for sampling large-scale social network users based on diversified strategies Active CN108596444B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810284916.2A CN108596444B (en) 2018-04-02 2018-04-02 Method and device for sampling large-scale social network users based on diversified strategies

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810284916.2A CN108596444B (en) 2018-04-02 2018-04-02 Method and device for sampling large-scale social network users based on diversified strategies

Publications (2)

Publication Number Publication Date
CN108596444A CN108596444A (en) 2018-09-28
CN108596444B true CN108596444B (en) 2021-06-29

Family

ID=63625174

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810284916.2A Active CN108596444B (en) 2018-04-02 2018-04-02 Method and device for sampling large-scale social network users based on diversified strategies

Country Status (1)

Country Link
CN (1) CN108596444B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831219A (en) * 2012-08-22 2012-12-19 浙江大学 Coverable clustering algorithm applying to community discovery
CN105976207A (en) * 2016-05-11 2016-09-28 山东大学 Information search result generation method and system based on multi-attribute dynamic weight distribution
CN106372072A (en) * 2015-07-20 2017-02-01 北京大学 Location-based recognition method for user relations in mobile social network
CN106875278A (en) * 2017-01-19 2017-06-20 浙江工商大学 Social network user portrait method based on random forest

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9224106B2 (en) * 2012-12-21 2015-12-29 Nec Laboratories America, Inc. Computationally efficient whole tissue classifier for histology slides
US20170124652A1 (en) * 2015-11-02 2017-05-04 Andrew Macleod Beaven Portfolio optimization by means of delta ratio quantified estimation error

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831219A (en) * 2012-08-22 2012-12-19 浙江大学 Coverable clustering algorithm applying to community discovery
CN106372072A (en) * 2015-07-20 2017-02-01 北京大学 Location-based recognition method for user relations in mobile social network
CN105976207A (en) * 2016-05-11 2016-09-28 山东大学 Information search result generation method and system based on multi-attribute dynamic weight distribution
CN106875278A (en) * 2017-01-19 2017-06-20 浙江工商大学 Social network user portrait method based on random forest

Also Published As

Publication number Publication date
CN108596444A (en) 2018-09-28

Similar Documents

Publication Publication Date Title
Luo et al. Efficient extraction of non-negative latent factors from high-dimensional and sparse matrices in industrial applications
Perozzi et al. Focused clustering and outlier detection in large attributed graphs
Mcauley et al. Discovering social circles in ego networks
Cai et al. Semi-supervised multi-view clustering based on orthonormality-constrained nonnegative matrix factorization
CN111899882B (en) Method and system for predicting cancer
Li et al. Multi-level network embedding with boosted low-rank matrix approximation
JP2010157214A (en) Gene clustering program, gene clustering method, and gene cluster analyzing device
EP2452274A1 (en) Systems and methods for making recommendations using model-based collaborative filtering with user communities and items collections
US11100688B2 (en) Methods and systems for encoding graphs
CN111125469B (en) User clustering method and device of social network and computer equipment
Chui et al. Representation of functions on big data associated with directed graphs
Scaldelai et al. MulticlusterKDE: a new algorithm for clustering based on multivariate kernel density estimation
Khouzani et al. Identification of the effects of the existing network properties on the performance of current community detection methods
CN113887698B (en) Integral knowledge distillation method and system based on graph neural network
Wang et al. Uncovering fuzzy communities in networks with structural similarity
JP2023546645A (en) Methods and systems for subsampling cells from single cell genomics datasets
CN117423391A (en) Method, system and equipment for establishing gene regulation network database
CN108596444B (en) Method and device for sampling large-scale social network users based on diversified strategies
He et al. Multi-objective spatially constrained clustering for regionalization with particle swarm optimization
Zheng et al. Improving pattern discovery and visualisation with self-adaptive neural networks through data transformations
Mehta et al. Neuronal classification from network connectivity via adjacency spectral embedding
Zarzour et al. An efficient recommender system based on collaborative filtering recommendation and cluster ensemble
Talwar et al. A topological nomenclature for 3D shape analysis in connectomics
Zhang et al. Graph clustering with graph capsule network
Legara et al. Complex network tools in building expert systems that perform framing analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant