WO2018006631A1 - 一种用户等级自动划分方法及系统 - Google Patents

一种用户等级自动划分方法及系统 Download PDF

Info

Publication number
WO2018006631A1
WO2018006631A1 PCT/CN2017/080777 CN2017080777W WO2018006631A1 WO 2018006631 A1 WO2018006631 A1 WO 2018006631A1 CN 2017080777 W CN2017080777 W CN 2017080777W WO 2018006631 A1 WO2018006631 A1 WO 2018006631A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
class
feature
module
distance
Prior art date
Application number
PCT/CN2017/080777
Other languages
English (en)
French (fr)
Inventor
龚灿
Original Assignee
武汉斗鱼网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 武汉斗鱼网络科技有限公司 filed Critical 武汉斗鱼网络科技有限公司
Publication of WO2018006631A1 publication Critical patent/WO2018006631A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the invention relates to the field of data mining technology, in particular to a user level automatic division method and system.
  • the purpose of the present invention is to overcome the deficiencies of the above background art, and provide a method and system for automatically dividing a user level, which can realize automatic division of user levels, and is accurate, efficient, and labor-saving.
  • the present invention provides a live room recommendation method for a live broadcast website, including the following steps:
  • Step S1 selecting sample data: selecting user behavior data within the specified time period as the original sample data, and proceeding to step S2;
  • Step S2 selecting a user feature: selecting at least one user feature in the user behavior data as a dimension for calculating the distance, and proceeding to step S3;
  • Step S3 determining the number K of the classification: according to the user is divided into several categories, determine the number K of the classification, K is a positive integer, and proceeds to step S4;
  • Step S4 determining the initial class core: randomly select K users in the original sample data as the initial core, and proceed to step S5;
  • Step S5 classification division: measuring the distance D of each user remaining in the original sample data to the current each class center according to the dimension selected in step S2; classifying each of the remaining users into the nearest class In the middle, complete the division of K classes, and proceed to step S6;
  • Step S6 calculating a new class core: in the currently divided K classes, recalculating the class core of each class, and proceeding to step S7;
  • step S7 the iterative steps S5 and S6 are repeated until the new class core is equal to the original class core or the change amount is less than the specified threshold, and the iterative operation is stopped, and the currently divided K categories are the user class classifications of the required division.
  • the user features in step S2 include the user viewing time, the number of user views, the number of user-issued bullets, the number of free items sent by the user, the number of free items received by the user online, the number of paid items sent by the user, and the number of users. Pay attention to the number of rooms, use The household concerned about the number of partitions.
  • the user feature value, MaxValue(X) is the largest user feature value in the user feature, and the normalized user feature values are concentrated between (0, 1).
  • step S5 the calculation formula of the distance D is:
  • x j is the jth user feature
  • j is a positive integer
  • ⁇ i is the class of the i-th class
  • i is a positive integer of 1 to K.
  • step S6 specifically includes the following operations: Step S601: Calculate the distance, V, distance and calculation formula of the user to each user of the class in the current class K for each user. for:
  • x j is the jth user feature
  • j is a positive integer
  • ⁇ i is the class of the i-th class
  • i is a positive integer of 1 to K
  • s i represents a set of user features, and proceeds to step S602;
  • the invention also provides a user level automatic division system, comprising the system comprising a sample data selection module, a user feature selection module, an initial class determination module, a classification division module, a new class calculation module and an iterative operation module;
  • the sample data selection module is configured to: select user behavior data in a specified time period as original sample data; and the user feature selection module is configured to: select at least one user feature in the user behavior data as a meter Calculating the dimension of the distance;
  • the initial class center determining module is configured to: according to the user being divided into several class levels, determine the number K of the classification, K is a positive integer; randomly select K users as the initial class core in the original sample data
  • the classification division module is configured to: according to the dimension selected by the user feature selection module, measure the distance D of each user remaining in the original sample data to the current each class core; classify each remaining user into a distance In a recent class, the division of K classes is completed; the new class calculation module is used to: recalculate the class cores of
  • the user features include the user viewing duration, the number of user views, the number of user-issued bullets, the number of free items sent by the user, the number of free items received by the user online, the number of paid items sent by the user, and the number of users paying attention to the number of users.
  • the user pays attention to the number of partitions.
  • the calculation formula for measuring the distance D by the classification division module is:
  • x j is the jth user feature
  • j is a positive integer
  • ⁇ i is the class of the i-th class
  • i is a positive integer of 1 to K.
  • the specific process of the new class calculation module recalculating the class core of each class is: for each user of each class in the current K class, the user is separately calculated to other users of the class.
  • the distance and V, distance and calculation formula are:
  • x j is the jth user feature
  • j is a positive integer
  • ⁇ i is the class of the i-th class
  • i is a positive integer of 1 to K
  • s i represents a set of user features
  • the present invention When performing user level division, the present invention first selects user behavior data in a specified time period as original sample data; then selects at least one user feature as a dimension for calculating distance; when determining the number K of classifications, random Selecting K users as the initial core in the original sample data; then, measuring the distance of each user remaining in the original sample data to the current each class core, and classifying each remaining user into the nearest class, Complete the division of K classes; then recalculate the class core of each class; finally repeat the iteration to classify the division operation and the new centroid calculation operation until the new class core is equal to the original class core or the change amount is less than the specified threshold, stop The iterative operation, the currently divided K categories are the user-level classification of the required division.
  • the present invention can realize automatic division of user levels, which not only makes the user's hierarchical division process more intelligent and automatic; but also has high quality, high efficiency, high reliability and effective user classification. The labor cost is saved and the user experience is good.
  • the normalized feature value operation is performed for each selected feature attribute, and the operation can avoid the different user feature dimensions. It affects the classification results, thereby improving the accuracy of user classification.
  • centroid of K-medios clustering based on the present invention adopts the calculation method of the central value, which makes the classification algorithm less affected by the outliers and the classification is more accurate.
  • FIG. 1 is a flowchart of a method for automatically dividing a user level according to an embodiment of the present invention
  • FIG. 2 is a structural block diagram of a user level automatic division system according to an embodiment of the present invention.
  • the K-means algorithm is one of the most widely used partition-based hard clustering analysis algorithms. It is a representative prototype-based objective function clustering method. It is a certain distance from the data point to the prototype as the objective function of the optimization. The function of the extremum is used to obtain the adjustment rules of the iterative operation.
  • the K-means algorithm takes the Euclidean distance as the similarity measure, which is to find the optimal classification of the V-corresponding to a certain initial cluster center vector, so that the evaluation index J is the smallest.
  • the algorithm uses the error squared criterion function as a clustering criterion function.
  • K-medios K-medios
  • the basic principle of K-medios is the same as K-means clustering, but K-means clustering calculates the centroid of each class (ie, seeking The average value is used to determine the centroid, and the K-medios clustering is to determine the centroid by calculating the center of each class (finding the closest point in each class to all other points in the class).
  • the core of K-medios clustering uses the calculation of the central value, which makes the classification algorithm less affected by the outliers and the classification is more accurate.
  • an embodiment of the present invention provides a method for automatically dividing a user level, which is based on a K-medios clustering algorithm, and specifically includes the following steps:
  • Step S1 Select sample data: select user behavior data in the specified time period as the original sample data, and proceed to step S2.
  • the specified time period can be set by the designer according to different situations.
  • the specified time period is generally set to one month, that is, the user behavior data within one month is selected as the original sample. data.
  • Step S2 Selecting a user feature: selecting at least one user feature in the user behavior data as a dimension for calculating a distance, the user feature including a user viewing time, a user viewing frequency, a number of user-issued bullets, and a number of free items sent by the user (eg, The number of fish balls), the number of free items received by the user online (such as the number of fish balls), the number of paid items sent by the user (such as the amount of shark fins), the number of users paying attention to the number of rooms, and the number of users paying attention to the number of partitions, and the process proceeds to step S3.
  • the process proceeds to step S3.
  • the classification result is affected, and after step S2, the operation of normalizing the feature values is further included: normalizing the feature values for each selected user feature. Calculation, the calculation formula is:
  • Y (X-MinValue(X))/(MaxValue(X)-MinValue(X)), where Y is the normalized eigenvalue and X is a user eigenvalue corresponding to a certain user feature, MinValue( X) is the smallest user feature value in the user feature, MaxValue(X) is the largest user feature value in the user feature, and the normalized user feature value is concentrated between (0, 1).
  • Step S3 determining the number K of the classifications: according to the user being divided into several category levels, determining the number K of the classifications, K is a positive integer, and proceeds to step S4.
  • Step S4 Determine an initial class core: randomly select K users as the initial class core in the original sample data, and proceed to step S5.
  • Step S5 classification division: according to the dimension selected in step S2 (ie, user characteristics), Measure the distance D of each user remaining in the original sample data (ie, users other than the centroid) to the current each class center.
  • the distance calculation formula is:
  • x j is the jth user feature (j is a positive integer)
  • ⁇ i is the class of the i th class (i is a positive integer from 1 to K)
  • each of the remaining users is classified into a distance
  • the division of the K classes is completed, and the process proceeds to step S6. At this point, each user is divided into one of the K categories.
  • Step S6 calculating a new class core: in the currently divided K classes, recalculating the class core of each class, and proceeding to step S7;
  • step S6 specifically includes the following operations:
  • Step S601 For each user of each category in the current K class, calculate the distance and V of the user to other users in the class, and the distance and the calculation formula are:
  • s i represents a set of user features, and proceeds to step S602;
  • Step S602 Select the distance and the smallest user as the new class core of the class for each class in the K class.
  • Step S7 repeating the iterative steps S5, S6 until the new class core is equal to the original class core (the original class core in the iterative process) (still unchanged) or the change amount is less than the specified threshold, and the iterative operation is stopped, the current
  • the K classifications that are divided are the user classifications of the required divisions.
  • the specified threshold is 1%, that is, the amount of change is less than 1%.
  • an embodiment of the present invention further provides a user level automatic division system.
  • the system includes a sample data selection module, a user feature selection module, an initial class determination module, a classification division module, a new centroid calculation module, and an iterative operation module.
  • the sample data selection module is configured to: select the number of user behaviors in a specified time period According to the original sample data;
  • the user feature selection module is configured to: select at least one user feature in the user behavior data as a dimension for calculating a distance;
  • the initial class center determining module is configured to: according to the user being divided into several class levels, determine the number K of the classification, K is a positive integer; randomly select K users as the initial class core in the original sample data;
  • the classification division module is configured to: according to the dimension selected by the user feature selection module, measure the distance D of each user remaining in the original sample data to the current each class center, and the distance calculation formula is:
  • x j is the jth user feature
  • j is a positive integer
  • ⁇ i is the centroid of the ith class
  • i is a positive integer from 1 to K
  • each of the remaining users is classified to the nearest distance In a class, complete the division of K classes
  • the new class calculation module is used to: recalculate the class cores of each class among the currently divided K classes;
  • the iterative operation module is used to: repeat the iterative call classification classification module and the new class calculation module to perform the classification division operation and the new centroid calculation operation until the new class core is equal to the original class core or the change amount is less than the specified threshold, stop The iterative operation, the currently divided K categories are the user-level classification of the required division.
  • the system also includes a feature value normalization module in order to avoid different user feature dimensions and influence the classification result.

Abstract

一种用户等级自动划分方法及系统,涉及数据挖掘技术领域。该方法包括:选取原始样本数据;选取至少一种用户特征作为计算距离的维度;确定分类的个数K;随机在原始样本数据中挑选K个用户作为初始类心;测量原始样本数据中剩余的每个用户到当前各个类心的距离,将剩余的每个用户归类到距离最近的一个类中,完成K个类的划分;重新计算各个类的类心;重复迭代S5、S6,直至新的类心与原类心相等或者变化量小于指定阈值时停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。该方法能实现用户等级的自动划分,准确、高效,节省人力成本。

Description

一种用户等级自动划分方法及系统 技术领域
本发明涉及数据挖掘技术领域,具体来讲是一种用户等级自动划分方法及系统。
背景技术
随着互联网技术的迅速发展,越来越多的用户可以使用电脑、手机等终端通过网络在各种网站上进行娱乐和工作。而对于各类网站来说,其用户群也随着用户数量的不断增多而变得越来越庞大。为了满足日益增大的用户群,提高网站服务质量,提升用户体验,通常需要对用户等级进行划分。例如,在视频直播网站的各业务场景中,为了刺激用户的观看兴致,提高观看量和用户体验,通常会对网站中的用户等级进行一系列的划分。
目前,各大网站在进行用户等级的划分时,普遍采用的是一种人工经验划分的方式。然而,在实际操作中,全由人工采用手动划分用户等级的方法,往往带有较大程度的主观性,使得划分标准不统一;另外,在海量数据的场景下,用户数据往往维度多、数据量大,靠人工评判用户等级划分标准往往不准确、覆盖率也不够高、重复性的工作也容易导致失误,而且人工操作时间较长,划分效率较低,人力成本较大。
发明内容
本发明的目的是为了克服上述背景技术的不足,提供一种用户等级自动划分方法及系统,能实现用户等级的自动划分,准确、高效、节省人力成本。
为达到以上目的,本发明提供一种用于直播网站的直播房间推荐方法,包括以下步骤:
步骤S1、选择样本数据:选取指定时间段内的用户行为数据作为原始样本数据,转入步骤S2;
步骤S2、选择用户特征:在用户行为数据中选取至少一种用户特征作为计算距离的维度,转入步骤S3;
步骤S3、确定分类的个数K值:根据用户被分成几个类别等级,确定分类的个数K,K为正整数,转入步骤S4;
步骤S4、确定初始类心:随机在原始样本数据中挑选K个用户作为初始类心,转入步骤S5;
步骤S5、归类划分:根据步骤S2中选取的维度,测量原始样本数据中剩余的每个用户到当前各个类心的距离D;将所述剩余的每个用户归类到距离最近的一个类中,完成K个类的划分,转入步骤S6;
步骤S6、计算新的类心:在当前已经划分好的K个类中,重新计算各个类的类心,转入步骤S7;
步骤S7、重复迭代步骤S5、S6,直至新的类心与原类心相等或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。
在上述技术方案的基础上,步骤S2中所述用户特征包括用户观看时长、用户观看次数、用户发送弹幕数、用户发送免费道具数、用户在线领取免费道具数、用户发送付费道具数、用户关注房间数、用 户关注分区数。
在上述技术方案的基础上,在步骤S2之后还包括归一化特征值的操作:对每个选取的用户特征进行特征值的归一化计算,计算公式为:Y=(X-MinValue(X))/(MaxValue(X)-MinValue(X)),其中,Y为归一化后的特征值,X为某个用户特征对应的一个用户特征值,MinValue(X)为该用户特征中最小的用户特征值,MaxValue(X)为该用户特征中最大的用户特征值,归一化之后的用户特征值都集中在(0,1]之间。
在上述技术方案的基础上,步骤S5中,距离D的计算公式为:
D=(xji)2
其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数。
在上述技术方案的基础上,步骤S6具体包括以下操作:步骤S601:针对当前K类中每一类下属的每一个用户,分别计算该用户到本类其他用户的距离和V,距离和计算公式为:
Figure PCTCN2017080777-appb-000001
其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数,si表示用户特征的集合,转入步骤S602;步骤S602:为K类中的每一类选取距离和最小的用户作为该类的新的类心。
本发明同时还提供一种用户等级自动划分系统,包括该系统包括样本数据选择模块、用户特征选择模块、初始类心确定模块、归类划分模块、新类心计算模块和迭代运算模块;所述样本数据选择模块用于:选取指定时间段内的用户行为数据作为原始样本数据;所述用户特征选择模块用于:在用户行为数据中选取至少一种用户特征作为计 算距离的维度;所述初始类心确定模块用于:根据用户被分成几个类别等级,确定分类的个数K,K为正整数;随机在原始样本数据中挑选K个用户作为初始类心;所述归类划分模块用于:根据用户特征选择模块选取的维度,测量原始样本数据中剩余的每个用户到当前各个类心的距离D;将所述剩余的每个用户归类到距离最近的一个类中,完成K个类的划分;所述新类心计算模块用于:在当前已经划分好的K个类中,重新计算各个类的类心;所述迭代运算模块用于:重复迭代调用归类划分模块和新类心计算模块进行归类划分操作及新类心计算操作,直至新的类心与原类心相等或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。
在上述技术方案的基础上,所述用户特征包括用户观看时长、用户观看次数、用户发送弹幕数、用户发送免费道具数、用户在线领取免费道具数、用户发送付费道具数、用户关注房间数、用户关注分区数。
在上述技术方案的基础上,该系统还包括特征值归一化模块,特征值归一化模块用于对每个选取的用户特征进行特征值的归一化计算,计算公式为:Y=(X-MinValue(X))/(MaxValue(X)-MinValue(X)),其中,Y为归一化后的特征值,X为某个用户特征对应的一个用户特征值,MinValue(X)为该用户特征中最小的用户特征值,MaxValue(X)为该用户特征中最大的用户特征值,归一化之后的用户特征值都集中在(0,1]之间。
在上述技术方案的基础上,所述归类划分模块测量距离D的计算公式为:
D=(xji)2
其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数。
在上述技术方案的基础上,所述新类心计算模块重新计算各个类的类心的具体过程为:针对当前K类中每一类下属的每一个用户,分别计算该用户到本类其他用户的距离和V,距离和计算公式为:
Figure PCTCN2017080777-appb-000002
其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数,si表示用户特征的集合;为K类中的每一类选取距离和最小的用户作为该类的新的类心。
本发明的有益效果在于:
(1)本发明在进行用户等级划分时,先选取指定时间段内的用户行为数据作为原始样本数据;然后选取至少一种用户特征作为计算距离的维度;当确定分类的个数K后,随机在原始样本数据中挑选K个用户作为初始类心;接着,测量原始样本数据中剩余的每个用户到当前各个类心的距离,将剩余的每个用户归类到距离最近的一个类中,完成K个类的划分;再重新计算各个类的类心;最后重复迭代进行归类划分操作及新类心计算操作,直至新的类心与原类心相等或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。
与现有技术相比,本发明能实现用户等级的自动划分,不但使得用户的等级划分过程变得更加的智能化、自动化;而且用户等级分类的质量高、效率高、可靠性强,有效的节省了人力成本,用户体验效果好。
(2)本发明中,在选取用户特征后,会对每一个选取的特征属性进行归一化特征值的操作,该操作能避免所选的用户特征量纲不同, 对分类结果造成影响,从而提高用户等级划分的准确性。
(3)相比传统的K-means聚类算法,本发明基于的K-medios聚类的类心采用中心值的计算方式,使得分类算法受离群点的影响更小,分类更准确。
附图说明
图1为本发明实施例中用户等级自动划分方法的流程图;
图2为本发明实施例中用户等级自动划分系统的结构框图。
具体实施方式
下面结合附图及具体实施例对本发明作进一步的详细描述。
聚类分析是数据挖掘及机器学习领域内的重点问题之一,在数据挖掘、模式识别、决策支持、机器学习及图像分割等领域有广泛的应用,是最重要的数据分析方法之一。而K-means算法是一种使用最广泛的基于划分的硬聚类分析算法,是典型的基于原型的目标函数聚类方法的代表,它是数据点到原型的某种距离作为优化的目标函数,利用函数求极值的方法得到迭代运算的调整规则。K-means算法以欧式距离作为相似度测度,它是求对应某一初始聚类中心向量V最优分类,使得评价指标J最小。算法采用误差平方和准则函数作为聚类准则函数。
而本发明中,采用的是K-means聚类的改良算法—K-medios,K-medios基本原理与K-means聚类相同,但K-means聚类通过计算每一类的质心(即求平均值)来确定类心,而K-medios聚类是通过计算每一类的中心(在每一类中找到离该类其他所有点最近的点)来确定类心的。相比传统的K-means聚类,K-medios聚类的类心采用中心值的计算方式,使得分类算法受离群点的影响更小,分类更准确。
基于上述改进,参见图1所示,本发明实施例提供一种用户等级自动划分方法,该方法基于K-medios聚类算法,具体包括以下步骤:
步骤S1、选择样本数据:选取指定时间段内的用户行为数据作为原始样本数据,转入步骤S2。
可以理解的是,在实际操作中,所述指定时间段可根据不同情况由设计人员自行设置,通常情况下,指定时间段一般设置为一个月,即选取一个月内的用户行为数据作为原始样本数据。
步骤S2、选择用户特征:在用户行为数据中选取至少一种用户特征作为计算距离的维度,所述用户特征包括用户观看时长、用户观看次数、用户发送弹幕数、用户发送免费道具数(如鱼丸数)、用户在线领取免费道具数(如鱼丸数)、用户发送付费道具数(如鱼翅金额)、用户关注房间数、用户关注分区数,转入步骤S3。
可以理解的是,为了避免所选的用户特征量纲不同,对分类结果造成影响,在步骤S2之后还包括归一化特征值的操作:对每个选取的用户特征进行特征值的归一化计算,计算公式为:
Y=(X-MinValue(X))/(MaxValue(X)-MinValue(X)),其中,Y为归一化后的特征值,X为某个用户特征对应的一个用户特征值,MinValue(X)为该用户特征中最小的用户特征值,MaxValue(X)为该用户特征中最大的用户特征值,归一化之后的用户特征值都集中在(0,1]之间。
步骤S3、确定分类的个数K值:根据用户被分成几个类别等级,确定分类的个数K,K为正整数,转入步骤S4。
步骤S4、确定初始类心:随机在原始样本数据中挑选K个用户作为初始类心,转入步骤S5。
步骤S5、归类划分:根据步骤S2中选取的维度(即用户特征), 测量原始样本数据中剩余的每个用户(即除类心以外的用户)到当前各个类心的距离D,距离计算公式为:
D=(xji)2
其中,xj为第j个用户特征(j为正整数),μi为第i个类的类心(i为1~K的正整数);将所述剩余的每个用户归类到距离最近的一个类中,完成K个类的划分,转入步骤S6。至此,每一个用户都被划分到K个分类中的某一个类中。
步骤S6、计算新的类心:在当前已经划分好的K个类中,重新计算各个类的类心,转入步骤S7;
实际操作时,步骤S6具体包括以下操作:
步骤S601:针对当前K类中每一类下属的每一个用户,分别计算该用户到本类其他用户的距离和V,距离和计算公式为:
Figure PCTCN2017080777-appb-000003
其中,si表示用户特征的集合,转入步骤S602;
步骤S602:为K类中的每一类选取距离和最小的用户作为该类的新的类心。
步骤S7、重复迭代步骤S5、S6,直至新的类心与原类心(此次迭代过程中的原类心)相等(保持不变)或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。本实施例中,所述指定阈值为1%,即变化量小于1%。
参见图2所示,本发明实施例还提供一种用户等级自动划分系统。该系统包括样本数据选择模块、用户特征选择模块、初始类心确定模块、归类划分模块、新类心计算模块和迭代运算模块。
其中,样本数据选择模块用于:选取指定时间段内的用户行为数 据作为原始样本数据;
用户特征选择模块用于:在用户行为数据中选取至少一种用户特征作为计算距离的维度;
初始类心确定模块用于:根据用户被分成几个类别等级,确定分类的个数K,K为正整数;随机在原始样本数据中挑选K个用户作为初始类心;
归类划分模块用于:根据用户特征选择模块选取的维度,测量原始样本数据中剩余的每个用户到当前各个类心的距离D,距离计算公式为:
D=(xji)2
其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数;将所述剩余的每个用户归类到距离最近的一个类中,完成K个类的划分;
新类心计算模块用于:在当前已经划分好的K个类中,重新计算各个类的类心;
迭代运算模块用于:重复迭代调用归类划分模块和新类心计算模块进行归类划分操作及新类心计算操作,直至新的类心与原类心相等或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。
同样可以理解的是,为了避免所选的用户特征量纲不同,对分类结果造成影响,本系统还包括特征值归一化模块。该特征值归一化模块用于对每个选取的用户特征进行特征值的归一化计算,计算公式为:Y=(X-MinValue(X))/(MaxValue(X)-MinValue(X)),其中,Y为归一化后的特征值,X为某个用户特征对应的一个用户特征值,MinValue(X)为该用户特征中最小的用户特征值,MaxValue(X)为该用 户特征中最大的用户特征值,归一化之后的用户特征值都集中在(0,1]之间。
需要说明的是:上述实施例提供的系统在进行操作时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将系统的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。
本发明不局限于上述实施方式,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也视为本发明的保护范围之内。
本说明书中未作详细描述的内容属于本领域专业技术人员公知的现有技术。

Claims (10)

  1. 一种用户等级自动划分方法,其特征在于,该方法包括以下步骤:
    步骤S1、选择样本数据:选取指定时间段内的用户行为数据作为原始样本数据,转入步骤S2;
    步骤S2、选择用户特征:在用户行为数据中选取至少一种用户特征作为计算距离的维度,转入步骤S3;
    步骤S3、确定分类的个数K值:根据用户被分成几个类别等级,确定分类的个数K,K为正整数,转入步骤S4;
    步骤S4、确定初始类心:随机在原始样本数据中挑选K个用户作为初始类心,转入步骤S5;
    步骤S5、归类划分:根据步骤S2中选取的维度,测量原始样本数据中剩余的每个用户到当前各个类心的距离D;将所述剩余的每个用户归类到距离最近的一个类中,完成K个类的划分,转入步骤S6;
    步骤S6、计算新的类心:在当前已经划分好的K个类中,重新计算各个类的类心,转入步骤S7;
    步骤S7、重复迭代步骤S5、S6,直至新的类心与原类心相等或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。
  2. 如权利要求1所述的用户等级自动划分方法,其特征在于:步骤S2中所述用户特征包括用户观看时长、用户观看次数、用户发送弹幕数、用户发送免费道具数、用户在线领取免费道具数、用户发送付费道具数、用户关注房间数、用户关注分区数。
  3. 如权利要求1所述的用户等级自动划分方法,其特征在于,在步骤S2之后还包括归一化特征值的操作:
    对每个选取的用户特征进行特征值的归一化计算,计算公式为:
    Y=(X-MinValue(X))/(MaxValue(X)-MinValue(X)),
    其中,Y为归一化后的特征值,X为某个用户特征对应的一个用户特征值,MinValue(X)为该用户特征中最小的用户特征值,MaxValue(X)为该用户特征中最大的用户特征值,归一化之后的用户特征值都集中在(0,1]之间。
  4. 如权利要求1所述的用户等级自动划分方法,其特征在于:步骤S5中,距离D的计算公式为:
    D=(xji)2
    其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数。
  5. 如权利要求1所述的用户等级自动划分方法,其特征在于,步骤S6具体包括以下操作:
    步骤S601:针对当前K类中每一类下属的每一个用户,分别计算该用户到本类其他用户的距离和V,距离和计算公式为:
    Figure PCTCN2017080777-appb-100001
    其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数,si表示用户特征的集合,转入步骤S602;
    步骤S602:为K类中的每一类选取距离和最小的用户作为该类的新的类心。
  6. 一种用户等级自动划分系统,其特征在于:该系统包括样本数据选择模块、用户特征选择模块、初始类心确定模块、归类划分模块、新类心计算模块和迭代运算模块;
    所述样本数据选择模块用于:选取指定时间段内的用户行为数据 作为原始样本数据;
    所述用户特征选择模块用于:在用户行为数据中选取至少一种用户特征作为计算距离的维度;
    所述初始类心确定模块用于:根据用户被分成几个类别等级,确定分类的个数K,K为正整数;随机在原始样本数据中挑选K个用户作为初始类心;
    所述归类划分模块用于:根据用户特征选择模块选取的维度,测量原始样本数据中剩余的每个用户到当前各个类心的距离D;将所述剩余的每个用户归类到距离最近的一个类中,完成K个类的划分;
    所述新类心计算模块用于:在当前已经划分好的K个类中,重新计算各个类的类心;
    所述迭代运算模块用于:重复迭代调用归类划分模块和新类心计算模块进行归类划分操作及新类心计算操作,直至新的类心与原类心相等或者变化量小于指定阈值时,停止迭代运算,则当前所划分的K个分类即为所需划分的用户等级分类。
  7. 如权利要求6所述的用户等级自动划分系统,其特征在于:所述用户特征包括用户观看时长、用户观看次数、用户发送弹幕数、用户发送免费道具数、用户在线领取免费道具数、用户发送付费道具数、用户关注房间数、用户关注分区数。
  8. 如权利要求6所述的用户等级自动划分系统,其特征在于:该系统还包括特征值归一化模块,所述特征值归一化模块用于对每个选取的用户特征进行特征值的归一化计算,计算公式为:
    Y=(X-MinValue(X))/(MaxValue(X)-MinValue(X)),
    其中,Y为归一化后的特征值,X为某个用户特征对应的一个用户特征值,MinValue(X)为该用户特征中最小的用户特征值, MaxValue(X)为该用户特征中最大的用户特征值,归一化之后的用户特征值都集中在(0,1]之间。
  9. 如权利要求6所述的用户等级自动划分系统,其特征在于:所述归类划分模块测量距离D的计算公式为:
    D=(xji)2
    其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数。
  10. 如权利要求6所述的用户等级自动划分系统,其特征在于:所述新类心计算模块重新计算各个类的类心的具体过程为:针对当前K类中每一类下属的每一个用户,分别计算该用户到本类其他用户的距离和V,距离和计算公式为:
    Figure PCTCN2017080777-appb-100002
    其中,xj为第j个用户特征,j为正整数,μi为第i个类的类心,i为1~K的正整数,si表示用户特征的集合;为K类中的每一类选取距离和最小的用户作为该类的新的类心。
PCT/CN2017/080777 2016-07-08 2017-04-17 一种用户等级自动划分方法及系统 WO2018006631A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610537520.5 2016-07-08
CN201610537520.5A CN106202388B (zh) 2016-07-08 2016-07-08 一种用户等级自动划分方法及系统

Publications (1)

Publication Number Publication Date
WO2018006631A1 true WO2018006631A1 (zh) 2018-01-11

Family

ID=57473935

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/080777 WO2018006631A1 (zh) 2016-07-08 2017-04-17 一种用户等级自动划分方法及系统

Country Status (2)

Country Link
CN (1) CN106202388B (zh)
WO (1) WO2018006631A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202388B (zh) * 2016-07-08 2017-12-08 武汉斗鱼网络科技有限公司 一种用户等级自动划分方法及系统
CN110874609B (zh) * 2018-09-04 2022-08-16 武汉斗鱼网络科技有限公司 基于用户行为的用户聚类方法、存储介质、设备及系统
CN109413459B (zh) * 2018-09-30 2020-10-16 武汉斗鱼网络科技有限公司 一种直播平台中用户的推荐方法以及相关设备
CN111127056A (zh) * 2018-10-31 2020-05-08 北京国双科技有限公司 用户等级划分方法及装置
CN111966951A (zh) * 2020-07-06 2020-11-20 东南数字经济发展研究院 一种基于社交电商交易数据的用户群体阶层划分方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477552A (zh) * 2009-02-03 2009-07-08 辽宁般若网络科技有限公司 网站用户等级划分方法
CN102222092A (zh) * 2011-06-03 2011-10-19 复旦大学 一种MapReduce平台上的海量高维数据聚类方法
US20140244664A1 (en) * 2013-02-25 2014-08-28 Telefonaktiebolaget L M Ericsson (Publ) Method and Apparatus For Determining Similarity Information For Users of a Network
CN104102649A (zh) * 2013-04-07 2014-10-15 阿里巴巴集团控股有限公司 一种对网站用户进行分级的方法和装置
CN106202388A (zh) * 2016-07-08 2016-12-07 武汉斗鱼网络科技有限公司 一种用户等级自动划分方法及系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8874727B2 (en) * 2010-05-31 2014-10-28 The Nielsen Company (Us), Llc Methods, apparatus, and articles of manufacture to rank users in an online social network
CN105281925B (zh) * 2014-06-30 2019-05-14 腾讯科技(深圳)有限公司 网络业务用户群组划分的方法和装置
CN104992182A (zh) * 2015-06-29 2015-10-21 北京京东尚科信息技术有限公司 一种判断用户级别的方法和装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477552A (zh) * 2009-02-03 2009-07-08 辽宁般若网络科技有限公司 网站用户等级划分方法
CN102222092A (zh) * 2011-06-03 2011-10-19 复旦大学 一种MapReduce平台上的海量高维数据聚类方法
US20140244664A1 (en) * 2013-02-25 2014-08-28 Telefonaktiebolaget L M Ericsson (Publ) Method and Apparatus For Determining Similarity Information For Users of a Network
CN104102649A (zh) * 2013-04-07 2014-10-15 阿里巴巴集团控股有限公司 一种对网站用户进行分级的方法和装置
CN106202388A (zh) * 2016-07-08 2016-12-07 武汉斗鱼网络科技有限公司 一种用户等级自动划分方法及系统

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUN, YANHUA: "jilyu2 ju4lei4 de wang3luo4 yong4hu4 xing2wei2 fenlxil", CHINA MASTER'S THESES FULL-TEXT DATABASE (INFORMATION SCIENCE & TECHNOLOGY, 15 January 2012 (2012-01-15), ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN106202388B (zh) 2017-12-08
CN106202388A (zh) 2016-12-07

Similar Documents

Publication Publication Date Title
WO2018006631A1 (zh) 一种用户等级自动划分方法及系统
CN108921206B (zh) 一种图像分类方法、装置、电子设备及存储介质
TWI677852B (zh) 一種圖像特徵獲取方法及裝置、電子設備、電腦可讀存儲介質
CN105022761B (zh) 群组查找方法和装置
WO2020098606A1 (zh) 节点分类方法、模型训练方法、装置、设备及存储介质
WO2021169445A1 (zh) 信息推荐方法、装置、计算机设备及存储介质
CN110826618A (zh) 一种基于随机森林的个人信用风险评估方法
CN110879981A (zh) 人脸关键点质量评估方法、装置、计算机设备及存储介质
WO2021189830A1 (zh) 样本数据优化方法、装置、设备及存储介质
CN116596095B (zh) 基于机器学习的碳排放量预测模型的训练方法及装置
CN110990576A (zh) 基于主动学习的意图分类方法、计算机设备和存储介质
CN109214444B (zh) 基于孪生神经网络和gmm的游戏防沉迷判定系统及方法
CN111833175A (zh) 基于knn算法的互联网金融平台申请欺诈行为检测方法
CN110348516B (zh) 数据处理方法、装置、存储介质及电子设备
CN114417095A (zh) 一种数据集划分方法及装置
CN117478390A (zh) 一种基于改进密度峰值聚类算法的网络入侵检测方法
CN112508363A (zh) 基于深度学习的电力信息系统状态分析方法及装置
CN108984630B (zh) 复杂网络中节点重要性在垃圾网页检测中的应用方法
CN111683141A (zh) 一种面向用户需求的动态QoS服务选择方法及其系统
CN116468102A (zh) 刀具图像分类模型剪枝方法、装置、计算机设备
CN115292303A (zh) 数据处理方法及装置
WO2018040561A1 (zh) 数据处理方法、装置及系统
CN111652733B (zh) 基于云计算和区块链的金融信息管理系统
Mishra et al. Efficient intelligent framework for selection of initial cluster centers
CN107766373A (zh) 图片所属类目的确定方法及其系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17823444

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17823444

Country of ref document: EP

Kind code of ref document: A1