CN112712115A

CN112712115A - Network user group division method and system

Info

Publication number: CN112712115A
Application number: CN202011601614.7A
Authority: CN
Inventors: 杜航原
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-27

Abstract

The invention discloses a network user group division method and a system, wherein the method comprises the following steps: acquiring user data samples corresponding to user groups to be divided, and clustering the user data samples by adopting a preset clustering algorithm to obtain a basic clustering division result of the user data samples; calculating a similarity matrix of the user data samples based on the base cluster partitioning result; obtaining graph data representation corresponding to the user data sample based on the similarity matrix; and carrying out clustering integration on the graph data representation based on a graph neural network to obtain a user group division result. The invention excavates the relation between the network users by combining the basic clustering division, and utilizes the graph neural network to carry out the clustering integration task, thereby improving the accuracy of the group division result of the network users.

Description

Network user group division method and system

Technical Field

The invention relates to the field of data mining, in particular to a network user group division method and a network user group division system.

Background

As a broadcast network platform, the microblog provides wide sharing and communication space for users, and the microblog has huge users by virtue of real-time, concise and open characteristics. Data show that the number of active users in the microblog reaches 4.62 hundred million in 2018, the growth exceeds 7000 ten thousand in three consecutive years, the number of vertical fields of the microblog is enlarged to 60, and the monthly reading amount reaches 32 fields beyond one billion. In the face of an increasing user group, how a microblog operator provides more accurate service for users is a problem to be solved at present. Massive data generated by a microblog user on a platform contains rich user behavior information, and a user group with similar interest and preference is found through analysis and research on user data, so that support can be provided for optimizing personalized service of the microblog platform.

At present, a method for partitioning microblog users mainly adopts a single clustering algorithm, and the single clustering algorithm has defects on the partition reliability and stability of the users; on the other hand, the clustering algorithms do not fully mine the relationship among the microblog users, so that the partitioning result of the users is not ideal.

Disclosure of Invention

The invention provides a network user group division method and a system, which aim to solve the technical problems that the existing network user group division method does not fully excavate the similarity between user data samples, and a single clustering algorithm is insufficient in the division reliability and stability of a network user group.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the present invention provides a method for dividing a network user group, including:

acquiring user data samples corresponding to user groups to be divided, and clustering the user data samples by adopting a preset clustering algorithm to obtain a basic clustering division result of the user data samples;

calculating a similarity matrix of the user data samples based on the base cluster partitioning result;

obtaining graph data representation corresponding to the user data sample based on the similarity matrix;

and carrying out clustering integration on the graph data representation based on a graph neural network to obtain a user group division result.

The method for clustering the user data samples by adopting the preset clustering algorithm to obtain the base clustering division result of the user data samples comprises the following steps:

selecting the number of categories to which the user data samples are to be clustered;

and clustering the user data samples by adopting a plurality of preset different clustering algorithms according to the category number to obtain a base clustering division result of the user data samples.

Wherein, based on the base cluster division result, calculating a similarity matrix of the user data samples comprises:

calculating a similarity matrix of the user data sample by adopting a weighted connected triple algorithm based on the base clustering division result; wherein the weighted connected triplet algorithm comprises the following steps:

calculating the similarity between the intersected clusters in the base cluster partitioning result;

calculating the similarity between the disjoint clusters in the base cluster partitioning result;

and calculating to obtain a similarity matrix between the user data samples based on the similarity between the intersected clusters in the basic clustering partitioning result and the similarity between the intersected clusters in the basic clustering partitioning result.

Obtaining graph data representation corresponding to the user data sample based on the similarity matrix, wherein the graph data representation comprises:

and taking the similarity matrix as an adjacency matrix to express the adjacency relation among the user data samples, and transforming the data representation of the user data samples in the feature space into corresponding graph data representation.

The clustering integration of the graph data representation based on the graph neural network to obtain the user population division result comprises the following steps:

learning low-dimensional embedding of the graph data representation using a preset graph autoencoder;

clustering the low-dimensional embedding by adopting a K mean value clustering algorithm to obtain an initial clustering center;

calculating the likelihood distribution of the low-dimensional embedding according to the low-dimensional embedding and the clustering center;

calculating the target distribution of the low-dimensional embedding according to the likelihood distribution;

and the likelihood distribution supervises the clustering integration process, and simultaneously guides the learning process of low-dimensional embedding through a clustering integration target to form a clustering integrated self-supervision optimization model so as to obtain a user group division result.

In another aspect, the present invention further provides a network user group partitioning system, including:

the base clustering module is used for acquiring user data samples corresponding to user groups to be partitioned, and clustering the user data samples by adopting a preset clustering algorithm to obtain base clustering partitioning results of the user data samples;

the similarity calculation module is used for calculating a similarity matrix of the user data sample based on the base clustering division result obtained by the base clustering module;

the graph data representation module is used for obtaining graph data representation corresponding to the user data sample based on the similarity matrix of the user data sample calculated by the similarity calculation module;

and the graph neural network clustering integration module is used for clustering and integrating the graph data representation obtained by the graph data representation module based on the graph neural network to obtain a user group division result.

Wherein the base clustering module is specifically configured to:

Wherein the similarity calculation module is specifically configured to:

Wherein the graph data representation module is specifically configured to:

The graph neural network clustering integration module is specifically used for:

In yet another aspect, the present invention also provides an electronic device comprising a processor and a memory; wherein the memory has stored therein at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer-readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.

The technical scheme provided by the invention has the beneficial effects that at least:

the invention adopts a clustering integration framework based on a graph neural network to carry out clustering integration analysis on user data. The graph data representation of the base cluster obtained by processing the user data completely reflects the global similarity relation of the user data samples; the present invention uses graph neural networks that are more advantageous for processing graph data; the self-supervision model enables information transmission and data mapping in the graph automatic encoder to obey a final clustering integration target, a better network user partition result can be obtained, and the accuracy of the network group user partition result is improved, so that support is provided for network operators to better optimize personalized services and promote marketing benefits.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a network user group division method according to a first embodiment of the present invention;

fig. 2 is a flowchart of a network user group division method according to a second embodiment of the present invention;

fig. 3 is a diagram of a clustering integration process based on a graph neural network according to a second embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

First embodiment

The embodiment provides a network user group division method, which has the core idea that the relation among network users is mined by combining with the basic clustering division, and a clustering integration task is carried out by utilizing a graph neural network, so that the accuracy of the network user group division result is improved. The method may be implemented by an electronic device, which may be a terminal or a server. The execution flow of the method is shown in fig. 1, and comprises the following steps:

s101, obtaining user data samples corresponding to user groups to be divided, and clustering the user data samples by adopting a preset clustering algorithm to obtain a basic clustering division result of the user data samples;

specifically, in this embodiment, the foregoing S101 may include the following processes:

selecting the number of categories to be clustered by the user data samples; and clustering the user data samples by adopting a plurality of preset different clustering algorithms so as to obtain a base clustering division result of the user data samples.

S102, calculating a similarity matrix of the user data samples based on the base cluster division result;

specifically, in this embodiment, the S102 may adopt a weighted connected triplet algorithm WCT to calculate a similarity matrix of the user data samples, where the WCT algorithm includes the following steps:

calculating the similarity between the intersected clusters in the base clustering division result;

calculating the similarity between the disjoint clusters in the base clustering partitioning result;

S103, obtaining graph data representation corresponding to the user data sample based on the similarity matrix;

specifically, in this embodiment, the step S103 may include the following steps:

And S104, carrying out clustering integration on the graph data based on the graph neural network to obtain a user group division result.

Specifically, in this embodiment, the step S104 may include the following steps:

learning the low-dimensional embedding of the graph data representation obtained in the last step by using a preset graph automatic encoder;

a clustering integration process is supervised by likelihood distribution, and a learning process of low-dimensional embedding is guided by a clustering integration target to form a clustering integrated self-supervision optimization model, so that a clustering integration result is optimized.

In the embodiment, the user data samples are clustered by adopting a preset clustering algorithm to obtain a base clustering division result of the user data samples; calculating a similarity matrix of the user data samples based on the base clustering division result; obtaining graph data representation corresponding to the user data sample based on the similarity matrix; and carrying out clustering integration on the graph data based on the graph neural network to obtain a user group division result. The accuracy of the network group user partition result is improved, and support is provided for network operators to better optimize personalized services and promote marketing benefits.

Second embodiment

In this embodiment, the network user group division method is used for the microblog user data sample set, the cluster integration analysis is performed on the microblog user data sample set, and the grouping division is performed on the microblog users according to the cluster integration result, so that the personalized service of the microblog operators on the users is facilitated. The method may be implemented by an electronic device, which may be a terminal or a server. The execution flow of the network user group division method is shown in fig. 2, and comprises the following steps:

s101, collecting microblog user information, extracting data characteristics, and entering S102;

specifically, in this embodiment, the steps specifically include: capturing information of microblog users by using a web crawler tool, wherein the captured information of the microblog users comprises user basic information and microblog account information; the microblog account information comprises microblog account names, microblog authentication, profiles, fan number and attention number in a user attention list.

S102, carrying out data inspection and pretreatment on user information data, and then entering S103;

specifically, the steps described above in this embodiment include the following steps:

s1021, data checking

Before clustering integrated analysis is carried out, firstly, whether a selected data sample can represent the whole is determined, three indexes of gender, age and area are selected, and the data sample is compared with standard data;

s1022, user filtering

In the crawled microblog users, silent users exist, and the users are mainly characterized in that the number of microblog accounts and the number of issued microblogs in an attention list of the users are small, so that the interest and preference of the users cannot be truly reflected, and the users need to be removed. In this respect, in this embodiment, the number of microblog account interests is smaller than one tenth of the mean value of the number of microblog interest of all the microblog users, and the microblog users who send out the number of microblog accounts smaller than ten are marked as "silent users", and are removed from the data table;

s1023, classifying the account numbers concerned by the microblog users

Account numbers of different categories are identified by using 'introduction' and 'authentication' fields in microblog account numbers concerned by microblog users, and account numbers of concerned lists are classified. According to the embodiment, the microblog account concerned by the user is divided into friends, famous people and functional microblogs according to a mainstream classification mode. The microblog of the friend means the microblog of a person close to the microblog user; the microblog of the known person refers to a microblog account number of the representative known person in a certain field; the functional microblog is a microblog account with a certain social function, and is generally an official authentication account of each industry, a consultation account of news media and the like;

s1024, representation of interest of microblog users

The representation of the interest of the microblog user comprises the steps of determining an interest set, removing an invalid account number and mapping the interest set; the interest set is determined by classifying the interests of microblog users by referring to a classification system of a mainstream microblog platform and the field classification of the microblade V to form an interest set; the invalid account number is removed, namely account numbers of microblog friends which are concerned by the user and cannot reflect the interests and hobbies of the user are removed, and account numbers which can obviously reflect the interests of the user are filtered out; mapping the interest set means that there is always one interest in the interest set, so that any account in the account set corresponds to the interest. The accounts similar to the functional microblog reflect interest preferences of the same category of users, and the accounts need to be integrated and classified. Specifically, through mainstream classification of the current network, the interest of the microblog users is classified into the following categories: fashion shopping, food, travel photography, sports, movie entertainment, music, game animation, literature reading, industrial work, and IT digital.

S103, clustering the preprocessed data to obtain a base cluster, and then entering S104;

specifically, in this embodiment, the steps specifically include: selecting the number K of the categories to be clustered, and clustering the data samples by adopting several common different clustering algorithms to obtain the base clustering division of the data samples.

S104, calculating the similarity between users to obtain a similarity matrix of the users, and then entering S105;

specifically, in this embodiment, the steps specifically include: calculating the similarity between users by adopting a WCT algorithm to obtain a similarity matrix of the users; the WCT algorithm mainly comprises the following steps:

calculating the similarity between the intersected clusters in the base clusters of the obtained data samples; calculating the similarity between the disjoint clusters in the base cluster; the similarity between the user data samples is calculated.

S105, obtaining graph data representation of the user according to the similarity matrix, and then entering S106;

specifically, in this embodiment, the steps specifically include: and taking the similarity matrix as an adjacency matrix to express the adjacency relation among the user data samples, and transforming the data representation of the user data samples in the feature space into corresponding graph data representation. Therefore, the relation among the microblog user data is completely reflected.

S106, learning low-dimensional embedding represented by graph data by utilizing a graph automatic encoder, and then entering S107;

s107, clustering the low-dimensional embedding by adopting a K-means clustering algorithm to obtain an initial clustering center, and then entering S108;

s108, calculating likelihood distribution according to the low-dimensional embedding and clustering center, and then entering S109;

s109, calculating target distribution according to the likelihood distribution, and then entering S110;

s110, minimizing a loss function, and then entering S111;

s111, judging whether a set threshold is reached, if so, entering S112, and if not, entering S108;

and S112, outputting a group division result of the microblog user.

The above steps S106 to S112 may be summarized as: as shown in fig. 3, an improved graph neural network clustering integration frame is used for clustering and integrating user data, a clustering integration target is used for guiding a low-dimensional embedding learning process, and a clustering integrated self-monitoring optimization model is formed, so that a clustering integration result is optimized, a clustering integration result is obtained after iteration is completed, and the clustering integration result is an optimal user partition result of a microblog network.

In summary, the method of the embodiment generates the graph data representation of the microblog users based on the existing base clustering, completely reflects the global similarity relation of the samples, uses the graph neural network which is more advantageous for processing the graph data with missing attributes, and enables information transfer and data mapping in the graph automatic encoder to comply with the final clustering integration target by the self-supervision model, so that the generated low-dimensional embedding is beneficial to obtaining the optimal microblog user group division result. Better microblog user partition results can be obtained, and the accuracy of the microblog group user partition results is improved, so that support is provided for microblog operators to better optimize personalized services and promote marketing benefits.

Third embodiment

The embodiment provides a network user group division system, which comprises the following modules:

The network user group division system of the present embodiment corresponds to the network user group division method of the first embodiment described above; the functions realized by each functional module in the network user group division system correspond to each flow step in the network user group division method one by one; therefore, it is not described herein.

Fourth embodiment

The present embodiment provides an electronic device, which includes a processor and a memory; wherein the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the method of the above embodiment.

The electronic device may generate a large difference due to different configurations or performances, and may include one or more processors (CPUs) and one or more memories, where at least one instruction is stored in the memory, and the instruction is loaded by the processor and performs the following steps:

The electronic equipment of the embodiment performs cluster analysis on the user data samples to obtain a base cluster division result of the user data samples; calculating a similarity matrix of the user data samples based on the base clustering division result; obtaining graph data representation corresponding to the user data sample based on the similarity matrix; and carrying out clustering integration on the graph data based on the graph neural network to obtain a user group division result. The accuracy of the network group user partition result is improved, and therefore support is provided for network operators to better optimize personalized services and promote marketing benefits.

Fifth embodiment

The present embodiments provide a computer-readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above-mentioned method. The computer readable storage medium may be, among others, ROM, Random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like. The instructions stored therein may be loaded by a processor in the terminal and perform the steps of:

In the program method stored in the computer-readable storage medium of this embodiment, a base clustering partitioning result of a user data sample is obtained by obtaining the user data sample corresponding to a user group to be partitioned and clustering the user data sample; calculating a similarity matrix of the user data samples based on the base clustering division result; obtaining graph data representation corresponding to the user data sample based on the similarity matrix; and carrying out clustering integration on the graph data based on the graph neural network to obtain a user group division result. The accuracy of the network group user partition result is improved, and therefore support is provided for network operators to better optimize personalized services and promote marketing benefits.

Furthermore, it should be noted that the present invention may be provided as a method, apparatus or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should also be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

Finally, it should be noted that while the above describes a preferred embodiment of the invention, it will be appreciated by those skilled in the art that, once the basic inventive concepts have been learned, numerous changes and modifications may be made without departing from the principles of the invention, which shall be deemed to be within the scope of the invention. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. A method for dividing a network user group is characterized by comprising the following steps:

2. The method of claim 1, wherein clustering the user data samples using a predetermined clustering algorithm to obtain a base cluster partitioning result of the user data samples comprises:

3. The method of claim 1, wherein said calculating a similarity matrix of said user data samples based on said base cluster partition results comprises:

4. The method for dividing a network user group according to claim 1, wherein the obtaining of the graph data representation corresponding to the user data sample based on the similarity matrix comprises:

5. The method according to claim 1, wherein the clustering integration of the graph data representation based on the graph neural network to obtain the user population partitioning result comprises:

6. A network user population partitioning system, said system comprising:

7. The system for network user population partitioning of claim 6, wherein said base clustering module is specifically configured to:

8. The system for partitioning a population of network users of claim 6, wherein the similarity calculation module is specifically configured to:

9. The system for partitioning a population of network users of claim 6, wherein said graph data representation module is specifically configured to:

10. The network user population partitioning system of claim 6, wherein said graph neural network clustering integration module is specifically configured to: