CN112711699A

CN112711699A - User division method, system, computer device and readable storage medium

Info

Publication number: CN112711699A
Application number: CN201911016451.3A
Authority: CN
Inventors: 陈家伟
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2019-10-24
Filing date: 2019-10-24
Publication date: 2021-04-27
Anticipated expiration: 2039-10-24
Also published as: CN112711699B

Abstract

The invention discloses a user dividing method, a user dividing system, computer equipment and a readable storage medium, wherein a plurality of manuscript data uploaded by a first user are obtained, a theme corresponding to each manuscript data is generated according to the plurality of manuscript data, then operation information of a second user on the first user within preset time is obtained, a corresponding heterogeneous network is generated according to the theme and the operation information, and then the second user is clustered according to the heterogeneous network, so that the second user of the same type forms a community according to a clustering result, and the second user is divided. The invention can mine and process the data of multiple types to realize the accurate division of the user, thereby providing more refined service for the user according to the division result of the user.

Description

User division method, system, computer device and readable storage medium

Technical Field

The invention relates to the field of big data, in particular to a user partitioning method, a user partitioning system, computer equipment and a readable storage medium.

Background

With the rapid development of the internet, the information amount of each large website is rapidly increased, and it becomes extremely difficult to classify users of each large website from massive data. At present, the general method for user division mainly comprises: (1) user segmentation based on demographic attributes; (2) user partitioning based on social relationships.

However, on one hand, due to the diversity of data types and the limitation of basic data accumulation, user partitioning is not combined with diverse data for effective mining, so that the partitioning result of the user is not accurate, and thus more refined services cannot be provided for the user. On the other hand, even if the diversity of data makes it possible to find the cause, higher demands are made on the mining method. Therefore, the invention aims to solve the problem of how to mine and process multi-type data to realize accurate division of users so as to provide more refined service for the users.

Disclosure of Invention

The invention aims to provide a user division method, a user division system, computer equipment and a readable storage medium, which can mine and process multi-type data to realize accurate division of users and further provide more refined services for the users according to the division results of the users.

According to an aspect of the present invention, there is provided a user segmentation method, including the steps of:

acquiring a plurality of manuscript data uploaded by a first user, and generating a theme corresponding to each manuscript data according to the plurality of manuscript data, wherein each manuscript data comprises manuscript contents and at least one label;

acquiring operation information of a second user on the first user within preset time;

generating a corresponding heterogeneous network according to the theme and the operation information;

and clustering the second users according to the heterogeneous network so that the second users of the same class form a community according to clustering results to divide the second users.

Optionally, the obtaining multiple pieces of manuscript data uploaded by the first user, and generating a theme corresponding to each piece of manuscript data according to the multiple pieces of manuscript data includes:

screening the labels of the manuscript data according to a preset rule to screen out a plurality of target labels;

acquiring the common occurrence times of the target labels in the manuscript data;

analyzing the distance of the target labels in the heterogeneous network according to the times, and training the target labels into corresponding target label word vectors;

and clustering the plurality of target labels according to the target label word vectors, the distance and the preset algorithm to form a theme corresponding to each manuscript data.

Optionally, the screening the tags of the multiple manuscript data according to a preset rule to screen out multiple target tags, including:

acquiring the labels of the plurality of manuscript data of the first user from a database, and counting the number of the manuscript contents under each label;

acquiring the heat information of each label from a database;

and screening the labels of the plurality of manuscript data according to the manuscript content quantity under each label and the heat information of each label so as to screen out the plurality of target labels.

Optionally, the generating a corresponding heterogeneous network according to the theme and the operation information includes:

calculating the distance between each label and each theme;

counting the average distance between the labels of the manuscript data and each topic according to the distance, and acquiring the score of the first user under each topic according to the average distance and a preset score mapping table;

and acquiring a first association relation between the first user and each topic according to the score.

Optionally, the counting average distances between the labels of the multiple pieces of manuscript data and the respective topics according to the distances to obtain scores of the first user under the respective topics according to the average distances and a preset score mapping table includes:

adding the distances from each label of the first user to each theme to obtain the total distance from the first user to each theme;

dividing the total distance by the number of the labels of the plurality of manuscript data to obtain an average distance between the first user and each theme;

and mapping the average distance to the score mapping table to obtain the score corresponding to the average distance.

Optionally, the generating a corresponding heterogeneous network according to the plurality of topics and the operation information further includes:

counting the operation times of the second user on the first user according to the operation information;

acquiring total operation information executed by the second user within the preset time, and counting the total operation times of the second user within the preset time according to the total operation information;

calculating the weight relationship between the second user and the first user according to the operation times and the total operation times so as to obtain a second association relationship between the second user and the first user according to the weight relationship;

and generating the heterogeneous network according to the first incidence relation and the second incidence relation.

Optionally, the clustering the second users according to the heterogeneous network so that the second users of the same class form a community according to a clustering result includes:

acquiring a sampling range input by a third user;

taking the second user as a central node, sampling other second user nodes in the sampling range to obtain other second user node sequences in the sampling range;

training the other second user nodes according to the other second user node sequences to obtain corresponding other second user word vectors;

and clustering the second user and the other second users according to the word vectors of the other second users and the preset algorithm, and forming the community according to a clustering result.

In order to achieve the above object, the present invention further provides a user partitioning system, which specifically includes the following components:

the acquisition module is used for acquiring a plurality of manuscript data uploaded by a first user and operation information of a second user on the first user in preset time, wherein each manuscript data comprises manuscript contents and at least one label;

the generating module is used for generating a theme corresponding to each manuscript data according to the plurality of manuscript data and generating a corresponding heterogeneous network according to the theme and the operation information;

and the dividing module is used for clustering the second users according to the heterogeneous network so as to enable the second users of the same class to form a community according to a clustering result, and thus the second users are divided.

In order to achieve the above object, the present invention further provides a computer device, which specifically includes: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the user segmentation method introduced above when executing the computer program.

In order to achieve the above object, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, realizes the above-introduced steps of the user segmentation method.

The user dividing method, the user dividing system, the computer device and the readable storage medium provided by the invention are used for acquiring a plurality of manuscript data uploaded by a first user, generating a theme corresponding to each manuscript data according to the plurality of manuscript data, then acquiring operation information of a second user on the first user within preset time, generating a corresponding heterogeneous network according to the theme and the operation information, and then clustering the second users according to the heterogeneous network, so that the second users of the same type form a community according to a clustering result, and the second users are divided. The invention can mine and process the data of multiple types to realize the accurate division of the user, thereby providing more refined service for the user according to the division result of the user.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is an optional application environment diagram of the user partition method provided by the embodiment of the present disclosure;

fig. 2 is an alternative flow chart of a user partition method according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating an alternative specific flowchart of step S100 in fig. 2;

fig. 4 is a schematic diagram of another alternative specific flowchart of step S104 in fig. 2;

fig. 5 is a schematic diagram illustrating an alternative specific flowchart of step S302 in fig. 4;

fig. 6 is a schematic diagram of another alternative specific flowchart of step S104 in fig. 2;

fig. 7 is a schematic diagram illustrating an alternative specific flowchart of step S106 in fig. 2;

FIG. 8 is a schematic diagram of an alternative program module of a user segmentation system provided by an embodiment of the present disclosure;

fig. 9 is a schematic diagram of an alternative hardware architecture of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 1 is a diagram of an application environment of the user partitioning method of the present invention. The method comprises the steps that a plurality of first users upload a plurality of manuscript data with labels to a video playing platform, and a user dividing system obtains the plurality of manuscript data and clusters the plurality of manuscript data into a plurality of themes according to the manuscript data. For example: the user division system receives contribution data a1, a2 … and an, contribution data b1, b2 … and bm, contribution data k1, k2 … and kj uploaded by k first users, and then clusters the contribution data a1, a2 … and an, the contribution data b1, b2 … and bm, and the contribution data k1, k2 … and kj to form a theme 1 and a theme 2. Exemplary clustering results are shown in fig. 1. Then, the user division system establishes the association relationship between the first user and each theme according to the distance between the manuscript data of the same first user and the theme. The second user realizes interaction with the first user by paying attention to the first user, commenting on the manuscript data of the first user, approving the manuscript data of the first user or collecting the manuscript data of the first user, so as to establish the incidence relation between the first user and the second user.

The following describes a user segmentation method provided by the present invention with reference to the accompanying drawings.

Fig. 2 is an optional flowchart of the user segmentation method of the present invention, and it is to be understood that the flowchart in this embodiment of the method is not used to limit the order of executing steps. The following description is made by taking a computer device as an execution subject. As shown in fig. 2, the method specifically includes steps S100 to S106.

Step S100: the method comprises the steps of obtaining a plurality of manuscript data uploaded by a first user, and generating a theme corresponding to each manuscript data according to the plurality of manuscript data, wherein each manuscript data comprises manuscript contents and at least one label.

Exemplarily, referring to fig. 1, when a first user 1 needs to upload a video about a food, tag remarking is performed on the video, for example, a food tag, and then the video with the food tag is uploaded, and a user classification system obtains the video with the food tag and generates a food theme according to the food tag and food tags of videos uploaded by other second users. When the first user 2 needs to upload the video about the exercise diet weight loss, the tags marking the video comprise food tags and sports tags, the video with the food tags and the sports tags is uploaded, the user dividing system obtains the video with the food tags and the sports tags, food themes are generated according to the food tags and other food tags, and sports themes are generated according to the sports tags and other sports tags uploaded by the second user. The manuscript content can also be an audio file or words.

Step S102: acquiring operation information of a second user to the first user within preset time, wherein the operation information comprises: paying attention to the first user, commenting on the manuscript data of the first user and/or collecting the manuscript data of the first user.

Specifically, after the first user uploads the video, the second user may perform an attention operation on the first user, or perform a comment operation, or a collection operation on the video uploaded by the first user. After the second user performs the operations, the user dividing system receives operation information of the first user by the second user. In this embodiment, in order to better explain the association relationship between the first user and the second user, that is, the interest level of the second user in the first user, the second user reviews, approves and collects the video uploaded by the first user, and the brief description is that the second user reviews, approves and collects the first user.

Step S104: and generating a corresponding heterogeneous network according to the theme and the operation information, wherein the heterogeneous network comprises a plurality of nodes, and the plurality of nodes comprise a first user node, a second user node and a theme node.

Specifically, after a theme is generated according to a label of manuscript data uploaded by a user, an association relation table of the first user, the second user and the theme is generated according to the theme and operation information of the second user on the first user, and the heterogeneous network is generated according to the association relation table. The more frequent the second user operates the first user, the higher the association degree between the second user and the first user is, and the closer the distance between the second user and the first user is, the shorter the side length between the second user and the first user in the heterogeneous network is; on the contrary, the lower the association degree between the second user and the first user is, the longer the distance between the second user and the first user is, and the longer the side length between the second user and the first user in the heterogeneous network is.

Illustratively, as the videos uploaded by the first users have a plurality of tags, a plurality of topics are generated according to the tags, so as to establish the relationship between the first users and the topics. And generating a relationship between the first user and the second user according to the operation of the second user on the first user, further establishing an association relationship among the first user, the second user and the theme, and constructing a heterogeneous network taking the first user, the second user and the theme as nodes. It should be noted that the heterogeneous network is used to indicate the degree of association between the first user, the second user and the theme. The shorter the side length between the first user and the second user is, the higher the interest degree of the second user in the first user is, and the higher the association degree between the second user and the first user is. The shorter the side length between the first user and the theme is, the closer the manuscript data uploaded by the first user is to the theme is, and the higher the association degree between the first user and the theme is.

With reference to fig. 1, a second user a performs a related operation on a first user 1, a first user 2 and a first user n to establish an association relationship between the second user a and the first user 1, the first user 2 and the first user n, a second user B performs a related operation on the first user 1 and the first user 2 to establish an association relationship between the second user B and the first user 1 and the first user 2, and a second user C performs a related operation on the first user 2 and the first user n to establish an association relationship between the second user C and the first user 2 and the first user n. The user division system generates a topic 1 according to the manuscript data a1 and a manuscript data a2 of the first user 1, the manuscript data b1 of the first user 2, the manuscript data b2 and a manuscript data b3, and the manuscript data k1 of the first user n, and generates a topic 2 according to the manuscript data a3 of the first user 1, the manuscript data k2 and a manuscript data k3 of the first user n, so as to establish the association relationship between the first user 1, the first user 2, and the first user n and the topics 1 and 2. And the user dividing system generates a heterogeneous network according to the association relationship between the second user A and the first user 1, the association relationship between the second user B and the first user 1 and the first user 2, the association relationship between the second user C and the first user 2 and the association relationship between the second user C and the first user n and the association relationship between the first user 1 and the first user 2, and the association relationship between the first user n and the subject 1 and the subject 2.

Step S106: and clustering the second users according to the heterogeneous network so that the second users of the same class form a community according to clustering results to divide the second users.

Illustratively, second users having similar interests according to the heterogeneous networks are divided into the same community to complete the division of the second users. With reference to fig. 1, if the association relationship between a second user a and a first user 1 of all first users is the closest, and the association relationship between a second user B and the first user 1 of all first users is the closest, that is, the second user a is most interested in the first user 1, the second user B is also most interested in the first user 1, and the first user 1 is the closest to the subject 1, the second user a and the second user B are grouped into one group, the second user a and the second user B form a community, and the second user a and the second user B are further divided into the same community.

In an exemplary embodiment, as shown in fig. 3, the step S100 may include steps S200 to S206.

Step S200: and screening the labels of the manuscript data according to a preset rule so as to screen out a plurality of target labels.

In a preferred embodiment, the step of filtering the tags of the manuscript data according to a preset rule to filter out a plurality of target tags includes: acquiring labels of a plurality of manuscript data of the first user and the heat information of each label from a database, counting the number of the manuscript contents under each label, and screening the labels of the plurality of manuscript data according to the number of the manuscript contents under each label and the heat information of each label to screen a plurality of target labels from the plurality of labels.

For example, referring to fig. 1, since the number of videos uploaded by the first user 1 and the first user 2 … is large, and each video has at least one tag, by obtaining all tags of all videos uploaded by the first users 1 to n and popularity information (i.e. popularity information) of each tag, and counting the number of videos under each tag, for example: only the label of the manuscript data a1 is food, the label of the manuscript data a2 is food and travel, and the number of videos under the food label is 2. And counting the quality of each label by integrating the number of videos under each label and the popularity information of each label so as to screen out target labels (namely high-quality labels) with large number of videos and high popularity from all the labels. Of course, the number of the screened target tags can be set arbitrarily according to the situation.

Step S202: and acquiring the common occurrence times of the target labels in the manuscript data.

Step S204: and analyzing the distance of the target labels in the heterogeneous network according to the times, and training the target labels into corresponding target label word vectors. It should be noted that, the more times two target tags appear together in multiple manuscript data, the closer the two target tags are in the heterogeneous network.

For example, for weight loss, it is often necessary to exercise while controlling diet. Therefore, two kinds of labels, namely a food label and a motion label, often appear in a video about weight loss, and the times of the common appearance of the food label and the motion label in manuscript data are obtained, so that the distance degree of the food label and the motion label in a heterogeneous network is analyzed according to the times of the common appearance of the food label and the motion label in the manuscript data. And then, training the food labels and the motion labels into corresponding food label word vectors and motion label word vectors.

Step S206: and clustering the plurality of target labels according to the target label word vectors, the distance and the preset algorithm to form a theme corresponding to each manuscript data. The preset algorithm may include any one of a Kmeans clustering algorithm, a mean shift clustering algorithm, and a density-based clustering algorithm.

Illustratively, in combination with the degree of distance of the food labels and the motion labels in the heterogeneous network, Kmeans clustering is adopted to cluster the trained food label word vectors and motion label word vectors to form the same topic, for example: subject of weight loss. For another example, the educational label and the travel label have a small number of common occurrences in the same manuscript data and a long distance in a heterogeneous network, so that word vectors generated by the educational label and the travel label form different topics, for example: respectively an educational theme and a tourist theme. Through the exemplary embodiment, the corresponding theme can be generated according to the co-occurrence frequency of the label under the manuscript, so that the incidence relation between the label and the theme is established.

In an exemplary embodiment, as shown in fig. 4, the step S104 may include steps S300 to S304.

Step S300: and calculating the distance between each label and each theme.

Illustratively, the distance function in the Kmeans algorithm is adopted to calculate the distance between each label and each topic. For example, in conjunction with fig. 1, fig. 1 illustrates only two topics 1 and 2 as examples. Calculating the distances from the food labels in the manuscript data a1 to the center of the theme 1 and the center of the theme 2 respectively, calculating the distances from the food labels and the travel labels in the manuscript data a2 to the center of the theme 1 and the center of the theme 2 respectively, and sequentially calculating the distances from the labels in all the manuscript data to the centers of the theme 1 and the theme 2 respectively.

Step S302: and counting the average distance between the labels of the manuscript data and each theme according to the distance so as to obtain the scores of the first user under each theme according to the average distance and a preset score mapping table.

Exemplarily, referring to fig. 1, the first user 1 includes manuscript data a1 and manuscript data a2 … manuscript data an, and obtains the score of the first user 1 under the theme 1 by calculating an average distance between all tags in the manuscript data a1 and the manuscript data a2 … and the center of the theme 1 and obtaining the score corresponding to the average distance according to a preset score mapping table. It should be noted that the higher the average distance, the higher the score of the first user 1. Of course, other rules of the score mapping table may be set, for example, the higher the average distance, the lower the score of the first user. In the embodiment of the present invention, the higher the average distance is, the higher the score of the first user is taken as an example for explanation.

Step S304: and acquiring a first association relation between the first user and each topic according to the score.

Illustratively, the higher the score of the first user 1 under the topic 1, the more closely the first user 1 has an association relationship with the topic 1. By the exemplary embodiment, the distances from all the labels of the manuscript data of the first user to all the topics can be integrated, and the association relationship between the first user and each topic can be accurately obtained.

In an exemplary embodiment, as shown in fig. 5, the step S302 may include steps S400 to S404.

Step S400: and adding the distances from each label of the first user to each theme to obtain the total distance from the first user to each theme.

Exemplarily, in conjunction with fig. 1, the distances from all the tags in the contribution data a1 and the contribution data a2 … of the contribution data a3 of the first user 1 to the center of the topic 1 are added to obtain the total distance from the first user 1 to the center of the topic 1. Adding the distances from all the labels in the manuscript data a1 and a2 … of the first user 1 and the manuscript data a3 to the center of the theme 2 to obtain the total distance from the first user 1 to the center of the theme 2. The total distance from the first user 2 to the first user n to the theme 1 and the theme 2 is calculated accordingly, which is not described herein again.

Step S402: and dividing the total distance by the number of the labels of the plurality of manuscript data to obtain the average distance between the first user and each theme.

Illustratively, the number of all the labels in all the manuscript data of the first user 1 is counted, and the calculated total distance from the first user 1 to the center of the theme 1 is divided by the number of all the labels, so as to obtain the average distance from the first user 1 to the center of the theme 1. And setting the calculated total distance from the first user 1 to the center of the theme 2 in the number of all the labels to obtain the average distance from the first user 1 to the center of the theme 2. The average distance from the first user 2 to the first user n to the subject 1 and the subject 2 is calculated accordingly, which is not described herein again.

Step S404: and mapping the average distance to the score mapping table to obtain the score corresponding to the average distance.

Illustratively, the calculated average distance from the first user 1 to the subject 1 is mapped with the preset score mapping table, and a score corresponding to the average distance is obtained from the score mapping table according to a mapping result.

In an exemplary embodiment, as shown in fig. 6, the step S104 may include steps S500 to S506.

Step S500: and counting the operation times of the second user on the first user according to the operation information.

Exemplarily, referring to fig. 1, operation information of attention, praise, comment and collection of a second user a to the first user 1 is obtained, and the number of operations of the second user a to the first user 1 is counted according to the operation information.

Step S502: and acquiring total operation information executed by the second user within the preset time, and counting the total operation times of the second user within the preset time according to the total operation information.

Illustratively, with reference to fig. 1, the total operation information executed by the second user a in the preset time is obtained, which includes the operation information of the second user a on the first user 2 to the first user n, and the total operation times of the second user a in the preset time is counted according to the total operation information.

Step S504: and calculating the weight relationship between the second user and the first user according to the operation times and the total operation times so as to obtain a second association relationship between the second user and the first user according to the weight relationship.

For example, the operation frequency of the second user a on the first user 1 is divided by the total operation frequency of the second user a in the preset time to obtain a weight of the first user 1 in the second user a, so as to obtain an association relationship between the second user a and the first user 1 according to the weight, that is, the interest degree of the second user a on the first user 1. The association relationship between the second user a and the first users 2 to N and the association relationship between the first users B and C and the first users 2 to N are calculated accordingly, which is not described herein again.

In a preferred embodiment, the user partitioning system further presets a weight distance table, and when obtaining the weight of the first user 1 occupying the second user a, matches the weight with a preset distance in the weight distance table to obtain a distance corresponding to the weight, so as to obtain the distance between the second user a and the first user 1 in the heterogeneous network.

Step S506: and generating the heterogeneous network according to the first incidence relation and the second incidence relation. Illustratively, the heterogeneous network is generated by combining the first association with the second association.

Through the exemplary embodiment, the association relationship between the second user and the first user can be established according to the operation information of the second user on the first user, and a network architecture basis is provided for the division of the second user by combining the association relationship heterogeneous network between the first user and each theme.

In an exemplary embodiment, as shown in fig. 7, the step S106 may include steps S600 to S606.

Step S600: a sampling range for a third user input is obtained.

Step S602: and taking the second user as a central node, and sampling other second user nodes in the sampling range to obtain other second user node sequences in the sampling range.

Referring to fig. 1, a sampling distance of a third user is obtained, and other second user nodes (for example, a second user B and a second user C within the sampling distance) on a node in the heterogeneous network are sampled by using a second user a as a center node. Wherein the other second user nodes include a second user B having a common operation with the second user a (as in fig. 1, a first user 1 and a first user 2 of the second user B and the second user a) and a second user C having a same topic of interest with the second user a.

Step S604: and training the other second user nodes according to the other second user node sequences to obtain corresponding other second user word vectors.

Step S606: and clustering the second user and the other second users according to the word vectors of the other second users and the preset algorithm, and forming the community according to a clustering result.

Illustratively, the second user a, the second user B and the second user C are clustered to form a community by using a Kmeans algorithm. According to the exemplary embodiment, the second users in the preset sampling range are sampled to cluster the second users in the sampling range, so that the second users can be divided more finely.

Based on the user partition method provided in the foregoing embodiment, the present embodiment provides a user partition system, and the user partition system may be applied to a computer device. In particular, FIG. 8 illustrates an alternative block diagram of the user-partitioned system, which is partitioned into one or more program modules, which are stored in a storage medium and executed by one or more processors to implement the present invention. The program module referred to in the present invention refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable for describing the execution process of the user partitioning system in the storage medium than the program itself.

As shown in fig. 8, the user segmentation system specifically includes the following components:

an obtaining module 201, configured to obtain multiple pieces of manuscript data uploaded by a first user and operation information of a second user on the first user within a preset time, where each piece of manuscript data includes manuscript content and at least one tag, and the operation information includes: paying attention to the first user, commenting on the manuscript data of the first user and collecting the manuscript data of the first user.

Illustratively, in conjunction with fig. 1, when the first user 1 needs to upload a video about a food, tag remarks are made to the video, such as a food tag, and then the video with the food tag is uploaded, and the obtaining module 201 obtains the video with the food tag. When the first user 2 needs to upload a video about diet slimming, the tags labeling the video include a food tag and a sports tag, and upload the video with the food tag and the sports tag, and the acquiring module 201 acquires the video with the food tag and the sports tag. The manuscript content can also be an audio file or words.

A generating module 202, configured to generate a theme corresponding to each piece of manuscript data according to the plurality of pieces of manuscript data, and generate a corresponding heterogeneous network according to the theme and the operation information, where the heterogeneous network includes a plurality of nodes, and the plurality of nodes includes a first user node, a second user node, and a theme node.

Illustratively, since there are multiple tags in the videos uploaded by multiple first users, the generating module 202 generates multiple topics according to the multiple tags, so as to establish the relationship between the first users and the respective topics. For example: generating a food theme according to the food label of the first user 1 and food labels of videos uploaded by other first users (such as the first user 2 and the first user 3); and generating a sports theme according to the sports label of the first user and the sports labels uploaded by other first users. And then, according to the operation of a second user on the first user, generating the relationship between the first user and the second user, further establishing an association relationship among the first user, the second user and the theme, and constructing a heterogeneous network taking the first user, the second user and the theme as nodes. It should be noted that the heterogeneous network is used to indicate the degree of association between the first user, the second user and the theme. The shorter the side length between the first user and the second user is, the higher the interest degree of the second user in the first user is, and the higher the association degree between the second user and the first user is. The shorter the side length between the first user and the theme is, the closer the manuscript data uploaded by the first user is to the theme is, and the higher the association degree between the first user and the theme is.

With reference to fig. 1, a second user a performs a related operation on a first user 1, a first user 2 and a first user n to establish an association relationship between the second user a and the first user 1, the first user 2 and the first user n, a second user B performs a related operation on the first user 1 and the first user 2 to establish an association relationship between the second user B and the first user 1 and the first user 2, and a second user C performs a related operation on the first user 2 and the first user n to establish an association relationship between the second user C and the first user 2 and the first user n. The generating module 202 generates a topic 1 according to the manuscript data a1 and a manuscript data a2 of the first user 1, the manuscript data b1 of the first user 2, the manuscript data b2 and a manuscript data b3, and the manuscript data k1 of the first user n, and generates a topic 2 according to the manuscript data a3 of the first user 1, the manuscript data k2 and a manuscript data k3 of the first user n, so as to establish an association relationship between the first user 1, the first user 2, and the first user n and the topics 1 and 2. The generating module 202 generates a heterogeneous network according to the association relationship between the second user a and the first user 1, the first user 2, and the first user n, the association relationship between the second user B and the first user 1 and the first user 2, and the association relationship between the second user C and the first user 2 and the first user n, and the association relationship between the first user 1, the first user 2, and the association relationship between the first user n and the theme 1, and the theme 2.

The dividing module 203 is configured to cluster the second users according to the heterogeneous network, so that the second users of the same class form a community according to a clustering result, and divide the second users.

Illustratively, the partitioning module 203 partitions the second users having similar interests into the same community according to the heterogeneous network to complete the partitioning of the second users. With reference to fig. 1, if the association relationship between a second user a and a first user 1 of all first users is the closest, and the association relationship between a second user B and the first user 1 of all first users is the closest, that is, the second user a is most interested in the first user 1, the second user B is also most interested in the first user 1, and the first user 1 is the closest to the subject 1, the dividing module 203 groups the second user a and the second user B into one group, the second user a and the second user B form a community, and further divides the second user a and the second user B into the same community.

In an exemplary embodiment, the generating module 202 further includes a screening unit, an analyzing unit, and a first clustering unit:

and the screening unit is used for screening the labels of the manuscript data according to a preset rule so as to screen out a plurality of target labels.

In a preferred embodiment, the screening unit is further configured to obtain, from a database, tags of the plurality of pieces of manuscript data of the first user and heat information of each tag, count the number of contents of the manuscript under each tag, and then screen the tags of the plurality of pieces of manuscript data according to the number of contents of the manuscript under each tag and the heat information of each tag, so as to screen a plurality of target tags from the plurality of tags.

For example, referring to fig. 1, since the number of videos uploaded by the first user 1 and the first user 2 … is large, and each video has at least one tag, by obtaining all tags of all videos uploaded by the first users 1 to n and popularity information (i.e. popularity information) of each tag, and counting the number of videos under each tag, for example: only the label of the manuscript data a1 is food, the label of the manuscript data a2 is food and travel, and the number of videos under the food label is 2. By integrating the number of videos under each label and the popularity information of each label, the quality of each label is counted, so that the screening unit screens out target labels (i.e. high-quality labels) with large number of videos and high popularity from all the labels. Of course, the number of the screened target tags can be set arbitrarily according to the situation.

The first acquisition unit is configured to acquire the number of times that the plurality of target tags commonly appear in the plurality of manuscript data.

And the analysis unit is used for analyzing the distance of the target labels in the heterogeneous network according to the times and training the target labels into corresponding target label word vectors. It should be noted that, the more times two target tags appear together in multiple manuscript data, the closer the two target tags are in the heterogeneous network.

For example, for weight loss, it is often necessary to exercise while controlling diet. Therefore, in a video about weight loss, two tags, namely a food tag and a motion tag, often appear, and at this time, the first obtaining unit obtains the times of the common appearance of the food tag and the motion tag in the manuscript data, so that the analyzing unit analyzes the distance between the food tag and the motion tag in the heterogeneous network according to the times of the common appearance of the food tag and the motion tag in the manuscript data. And then, training the food labels and the motion labels into corresponding food label word vectors and motion label word vectors.

The first clustering unit is used for clustering the plurality of target labels according to the target label word vectors, the distance and the preset algorithm so as to form a theme corresponding to each manuscript data. The preset algorithm may include any one of a Kmeans clustering algorithm, a mean shift clustering algorithm, and a density-based clustering algorithm.

In an exemplary embodiment, the generating module 202 further comprises a computing unit.

The calculating unit is used for calculating the distance between each label and each theme.

Illustratively, the calculation unit calculates the distance between each label and each topic by using a distance function in a Kmeans algorithm. For example, in conjunction with fig. 1, fig. 1 illustrates only two topics 1 and 2 as examples. The calculating unit calculates the distances from the food labels in the manuscript data a1 to the centers of the theme 1 and the theme 2 respectively, calculates the distances from the food labels and the travel labels in the manuscript data a2 to the centers of the theme 1 and the theme 2 respectively, and calculates the distances from the labels in all the manuscript data to the centers of the theme 1 and the theme 2 respectively in sequence.

The calculating unit is further configured to count average distances between the labels of the multiple pieces of manuscript data and the respective topics according to the distances, so as to obtain scores of the first user under the respective topics according to the average distances and a preset score mapping table.

Exemplarily, referring to fig. 1, the first user 1 includes manuscript data a1 and manuscript data a2 …, and the calculating unit obtains the score of the first user 1 under the theme 1 by calculating an average distance between all labels in the manuscript data a1 and the manuscript data a2 … and the center of the theme 1, and obtaining the score corresponding to the average distance according to a preset score mapping table. It should be noted that the higher the average distance, the higher the score of the first user 1. Of course, other rules of the score mapping table may be set, for example, the higher the average distance, the lower the score of the first user. In the embodiment of the present invention, the higher the average distance is, the higher the score of the first user is taken as an example for explanation.

The first obtaining unit is used for obtaining a first association relation between the first user and each topic according to the score.

In an exemplary embodiment, the computing unit is further configured to:

adding the distances from each label of the first user to each theme to obtain the total distance from the first user to each theme; dividing the total distance by the number of the labels of the plurality of manuscript data to obtain an average distance between the first user and each theme; and mapping the average distance to the score mapping table to obtain the score corresponding to the average distance.

Exemplarily, with reference to fig. 1, first, the calculating unit adds distances from all tags in the manuscript data a1 and a manuscript data a2 … of the first user 1 to the center of the theme 1 to obtain a total distance from the first user 1 to the center of the theme 1. Adding the distances from all the labels in the manuscript data a1 and a2 … of the first user 1 and the manuscript data a3 to the center of the theme 2 to obtain the total distance from the first user 1 to the center of the theme 2. The total distance from the first user 2 to the first user n to the theme 1 and the theme 2 is calculated accordingly, which is not described herein again.

Then, counting the number of all the labels in all the manuscript data of the first user 1, and dividing the calculated total distance from the first user 1 to the center of the theme 1 by the number of all the labels to obtain the average distance from the first user 1 to the center of the theme 1. And setting the calculated total distance from the first user 1 to the center of the theme 2 in the number of all the labels to obtain the average distance from the first user 1 to the center of the theme 2. The average distance from the first user 2 to the first user n to the subject 1 and the subject 2 is calculated accordingly, which is not described herein again.

And finally, mapping the calculated average distance from the first user 1 to the subject 1 with the preset score mapping table, and acquiring a score corresponding to the average distance from the score mapping table according to a mapping result.

In an exemplary embodiment, the generating module 202 further includes a statistic unit and a generating unit.

And the counting unit is used for counting the operation times of the second user on the first user according to the operation information.

For example, referring to fig. 1, after the obtaining module 201 obtains operation information of attention, praise, comment and collection of a second user a to the first user 1, the counting unit counts the operation times of the second user a to the first user 1 according to the operation information.

The first obtaining unit is further configured to obtain total operation information executed by the second user within the preset time.

The counting unit is further configured to count the total operation times of the second user in the preset time according to the total operation information.

Illustratively, with reference to fig. 1, the first obtaining unit obtains total operation information, which is executed by the second user a within a preset time and includes operation information of the second user a on the first user 2 to the first user n, and then the counting unit counts the total operation times of the second user a within the preset time according to the total operation information.

The calculating unit is configured to calculate a weight relationship between the second user and the first user according to the operation times and the total operation times, so as to obtain a second association relationship between the second user and the first user according to the weight relationship.

The generating unit is configured to generate the heterogeneous network according to the first association relationship and the second association relationship. Illustratively, the heterogeneous network is generated by combining the first association with the second association. Through the exemplary embodiment, the association relationship between the second user and the first user can be established according to the operation information of the second user on the first user, and a network architecture basis is provided for the division of the second user by combining the association relationship heterogeneous network between the first user and each theme.

In an exemplary embodiment, the partitioning module 203 includes a second obtaining unit, a sampling unit, a training unit, and a second clustering unit.

The second obtaining unit is used for obtaining a sampling range input by a third user.

And the sampling unit is used for sampling other second user nodes in the sampling range by taking the second user as a central node so as to obtain other second user node sequences in the sampling range.

Referring to fig. 1, the second obtaining unit obtains a sampling distance of a third user, and then the sampling unit samples other second user nodes (for example, a second user B and a second user C within the sampling distance) on a node in the heterogeneous network with a second user a as a center node. Wherein the other second user nodes include a second user B having a common operation with the second user a (as in fig. 1, a first user 1 and a first user 2 of the second user B and the second user a) and a second user C having a same topic of interest with the second user a.

And the training unit is used for training the other second user nodes according to the other second user node sequences so as to obtain corresponding other second user word vectors.

The second clustering unit is used for clustering the second user and the other second users according to the word vectors of the other second users and the preset algorithm, and forming the community according to a clustering result.

The embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. As shown in fig. 9, the computer device 30 of the present embodiment includes at least, but is not limited to: a memory 301, a processor 302 communicatively coupled to each other via a system bus. It is noted that FIG. 9 only shows the computer device 30 having components 301 and 302, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead.

In this embodiment, the memory 301 (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 301 may be an internal storage unit of the computer device 30, such as a hard disk or a memory of the computer device 30. In other embodiments, the memory 301 may also be an external storage device of the computer device 30, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 30. Of course, the memory 301 may also include both internal and external storage devices for the computer device 30. In the present embodiment, the memory 301 is generally used for storing an operating system and various types of application software installed in the computer device 30, such as the program codes of the user-divided system of the above-described embodiment. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 302 generally serves to control the overall operation of the computer device 30.

Specifically, in this embodiment, the processor 302 is configured to execute the program of the user partition method stored in the processor 302, and when executed, the program of the user partition method implements the following steps:

For the specific embodiment of the process of the above method steps, reference may be made to the above embodiments, and details of this embodiment are not repeated herein.

The present embodiments also provide a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., having stored thereon a computer program that when executed by a processor implements the method steps of:

The user partitioning method, the user partitioning system, the computer device, and the readable storage medium provided in this embodiment acquire a plurality of manuscript data uploaded by a first user, generate a theme corresponding to each manuscript data according to the plurality of manuscript data, then acquire operation information of a second user on the first user within a preset time, generate a corresponding heterogeneous network according to the theme and the operation information, and then cluster the second users according to the heterogeneous network, so that the second users of the same class form a community according to a clustering result, thereby partitioning the second users. The invention can mine and process the data of multiple types to realize the accurate division of the user, thereby providing more refined service for the user according to the division result of the user.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for user segmentation, the method comprising:

2. The user classifying method according to claim 1, wherein the obtaining of a plurality of manuscript data uploaded by a first user and the generating of a theme corresponding to each of the manuscript data according to the plurality of manuscript data comprises:

3. The user segmentation method according to claim 2, wherein the filtering the tags of the plurality of manuscript data according to a preset rule to filter out a plurality of target tags comprises:

acquiring the heat information of each label from a database;

4. The user partitioning method according to claim 3, wherein the generating of the corresponding heterogeneous network according to the theme and the operation information includes:

calculating the distance between each label and each theme;

5. The user segmentation method according to claim 4, wherein the calculating an average distance between the labels of the plurality of manuscript data and each topic according to the distance to obtain the score of the first user under each topic according to the average distance and a preset score mapping table comprises:

6. The user segmentation method according to claim 4, wherein the generating of the corresponding heterogeneous network according to the theme and the operation information further comprises:

7. The method as claimed in claim 1, wherein the clustering the second users according to the heterogeneous network so that the second users of the same class form a community according to the clustering result comprises:

acquiring a sampling range input by a third user;

8. A user segmentation system, the system comprising:

9. A computer device, the computer device comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the user segmentation method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the user segmentation method according to any one of claims 1 to 7.