WO2017215346A1 - 业务数据分类方法和装置 - Google Patents

业务数据分类方法和装置 Download PDF

Info

Publication number
WO2017215346A1
WO2017215346A1 PCT/CN2017/081387 CN2017081387W WO2017215346A1 WO 2017215346 A1 WO2017215346 A1 WO 2017215346A1 CN 2017081387 W CN2017081387 W CN 2017081387W WO 2017215346 A1 WO2017215346 A1 WO 2017215346A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
service
data set
business
service data
Prior art date
Application number
PCT/CN2017/081387
Other languages
English (en)
French (fr)
Inventor
闫强
王晓
葛胜利
李爱华
Original Assignee
北京京东尚科信息技术有限公司
北京京东世纪贸易有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东尚科信息技术有限公司, 北京京东世纪贸易有限公司 filed Critical 北京京东尚科信息技术有限公司
Priority to US16/310,301 priority Critical patent/US11023534B2/en
Publication of WO2017215346A1 publication Critical patent/WO2017215346A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23211Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with adaptive number of clusters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection

Definitions

  • the present invention relates to the field of data processing, and in particular, to a service data classification method and apparatus.
  • Data clustering is a conventional technique used in data mining and data classification.
  • business data usually has certain business attributes, while the conventional clustering method simply solves the problem from the data level, ignoring the business meaning of the data. For example, some data with particularly high or very low business metrics will be excluded as outliers and will no longer participate in clustering, and these culled outliers may be data with higher business value.
  • the data with higher business value cannot reflect its unique value and affect the accuracy of business data classification.
  • One technical problem to be solved by embodiments of the present invention is how to make the results of business data classification more accurate.
  • a method for classifying a service data includes: acquiring service data, where the service data includes multiple service indicators; and extracting, from the service data, the category attributes according to the set extraction rules according to a preset condition. Part of the service data forms a first data set, the extraction rule is set according to part of the service indicator; the undrawn service data is used as the second data set, and the service data in the second data set is clustered; according to the second data set The clustering result and the first data set determine the business data classification result.
  • the part of the service data that is obtained by extracting the category attribute from the service data according to the set extraction rule to form the first data set includes: extracting the service data of the outlier of the certain service indicator data according to the set threshold. Forming a first data set; or extracting, according to a logical operation result of the multiple service indicators of the service data, the business data of the logical operation result outlier to form a first data set; or, according to the data distribution of a certain service indicator, extracting The business data of a certain business indicator data is out of the group to form a first data set.
  • determining the service data classification result according to the clustering result of the second data set and the first data set comprises: selecting one or more classes of the first data set and categories in each of the second data sets The closest class of attributes is merged to obtain the business data classification result.
  • combining the classes in the first data set with the classes closest to the category attributes in the respective classes in the second data set comprises: determining that the first data set and the second data set are involved in the extraction rule The average feature of the business indicator combines the first data set and the class with the closest average feature in the second data set to obtain the business data classification result, wherein the average feature is the average value of the business indicators involved in the extraction rule of each class. Or a central point; or, combining each class in the first data set with a class having the closest category attribute in each class in the second data set to obtain a business data classification result, and classifying each business data in the classification result The difference in the amount of business data between the two is in accordance with the preset range.
  • the part of the service data that is obtained by extracting the category attribute from the service data according to the set extraction rule to form the first data set includes: extracting part of the service data from the service data according to different set extraction rules respectively. Different classes are formed, and different classes form a first data set.
  • the method before the clustering the service data in the second data set, the method further includes: filtering the outlier service indicator data from the second data set, and determining whether the service indicator data is out of the group.
  • the business indicator demarcation value is assigned to the outlier business indicator data; or, the mean value of all non-null data in a certain service indicator is calculated, and the mean value is assigned to the data null value of the service indicator in the second data set.
  • the clustering the service data in the second data set comprises: pre-clustering the service data in the second data set according to each predetermined number of clusters, and calculating the number of each predetermined cluster. Corresponding pre-clustering result contour coefficients; arranging corresponding contour coefficients in order of increasing number of predetermined clusters, obtaining a plurality of maxima in the contour coefficients, and determining a maximum value thereof; The number of pre-clusters corresponding to the maximum value is used as the actual number of clusters.
  • the preset condition is that the difference between the maximum value and the maximum value is smaller than the preset value; the actual cluster number is used for the service in the second data set.
  • the data is clustered.
  • a service data classification apparatus includes: a service data acquisition module, configured to acquire service data, where the service data includes multiple service indicators; and a service data extraction module, configured to be configured according to The extraction rule extracts part of the service data whose category attribute meets the preset condition from the business data to form a first data set, and the extraction rule is set according to part of the service indicator; the service data clustering module is configured to use the undrawn business data as the first The second data set is used to cluster the service data in the second data set; the service data classification module is configured to determine the service data classification result according to the clustering result of the second data set and the first data set.
  • the service data extraction module includes at least one unit: a first extraction unit, configured to extract service data of a certain service indicator data out of the group according to the set threshold, to form a first data set;
  • the extracting unit is configured to extract, according to a logical operation result of the plurality of service indicators of the service data, the business data of the logical operation result outlier to form a first data set; and the third extracting unit is configured to use the data distribution of the certain service indicator Extracting business data of a certain business indicator data out of the group to form a first data set.
  • the service data classification module is configured to combine one or more classes included in the first data set with classes closest to the category attributes in the respective classes in the second data set to obtain a business data classification result.
  • the service data classification module includes an average feature acquisition unit and a merging unit; the average feature acquisition unit is configured to determine an average feature of the service metrics involved in the extraction rule in the first data set and the second data set, wherein the average The feature is an average or a center point of the service indicator involved in the extraction rule of each class; the merging unit is configured to combine the first data set and the class with the average feature in the second data set to obtain the business data classification result; or The service data classification module is configured to combine the classes in the first data set and the classes in the respective classes in the second data set to obtain the business data classification result, and classify each service data in the classification result.
  • the difference in the amount of business data between the two is in accordance with the preset range.
  • the service data extraction module is configured to extract part of the service data from the service data according to different set extraction rules to form different classes, and the different classes form the first data set.
  • the apparatus further includes a pre-processing module
  • the pre-processing module includes: an outlier processing unit, configured to filter out out-of-group service indicator data from the second data set, and determine whether the service indicator data is away from The business indicator demarcation value of the group is assigned to the outlier business indicator data; and/or the null value processing unit is configured to calculate the mean value of all non-null data in a service indicator in the second data set, and assign the mean to the second data set The data null value of the business indicator.
  • the service data clustering module includes: a pre-clustering unit, configured to pre-cluster the service data in the second data set according to each predetermined number of clusters; and a contour coefficient calculation unit, configured to Calculating a contour coefficient of the pre-clustering result corresponding to each predetermined number of clusters in the clustering result;
  • the actual clustering number determining unit is configured to arrange the corresponding contour coefficients in order of increasing the number of predetermined clusters to obtain the contour coefficient
  • the maximum value of the maximum value, and determine the maximum value thereof, the number of pre-cluster corresponding to the maximum value corresponding to the preset condition for the first time is taken as the actual number of clusters, and the preset condition is the maximum value and the maximum value.
  • the difference is smaller than the preset value; the actual clustering unit is configured to cluster the service data in the second data set by using the actual number of clusters.
  • a service data classification apparatus comprising: a memory; and a processor coupled to the memory, the processor being configured to perform any one of the foregoing based on an instruction stored in the memory Business data classification method.
  • a computer non-transitory readable storage medium having stored thereon a computer program that, when executed by a processor, implements any of the foregoing business data classification methods.
  • the invention sets an extraction rule according to part of the service index of the service data, and extracts service data with clear category attributes according to the extraction rule, and then jointly determines the service data classification according to the clustering result of the unextracted business data and the business data with clear category attributes. As a result, the accuracy of the classification of business data is improved.
  • FIG. 1 is a flow chart of an embodiment of a service data classification method according to the present invention.
  • FIG. 2 is a flow chart of an embodiment of a service data clustering method of the present invention.
  • FIG. 3 is a structural diagram of an embodiment of a service data classification apparatus according to the present invention.
  • FIG. 4 is a structural diagram of another embodiment of a service data classification device of the present invention.
  • FIG. 5 is a structural diagram of still another embodiment of a service data classification apparatus according to the present invention.
  • Figure 6 is a structural diagram of still another embodiment of the service data classification device of the present invention.
  • the present invention is proposed for the conventional clustering method to solve the problem simply from the data level, ignoring the business meaning of the data, and making the data with higher business value unable to reflect its unique value, thereby affecting the accuracy of the classification of the business data.
  • FIG. 1 is a flow chart of an embodiment of a service data classification method according to the present invention. As shown in FIG. 1, the method of this embodiment includes:
  • Step S102 Obtain service data, where the service data includes multiple service indicators.
  • the related service indicators may be set according to the purpose of the service classification, and the related service data is obtained.
  • the service data has multiple dimensions, and each dimension is a service indicator related to the service classification purpose.
  • the purpose of a service classification is to divide the user level according to the activity level of the user, that is, to classify the activity level of the user, and the service data may include, for example, a PV (Page View) and an order total of the user in a recent period of time.
  • Business indicators that reflect user activity such as quantity, total order quantity, continuous PV time, number of items collected, and registration time.
  • data extraction and index calculation can be performed by ETL (Extract-Transform-Load) technology.
  • ETL Extract-Transform-Load
  • the condition of the business indicator to be obtained can be defined by the WHERE condition in the SQL statement, and the calculation result is inserted into the target table structure.
  • the identifier information may be selectively added to the service indicator to identify whether the size of the indicator value and the service purpose are positively related. For example, for a scenario in which the user is classified according to the activity level, the larger the PV, the higher the user activity, and the greater the interval between adjacent login times, the lower the activity of the user. Adding identification information helps in the setup and use of extraction rules in subsequent steps.
  • step S104 a part of the service data whose category attribute meets the preset condition is extracted from the service data according to the set extraction rule to form a first data set, and the extraction rule is set according to part of the service indicator.
  • the business data there is business data determined by the category attribute, that is, part of the service through which the data refers.
  • the target value can directly determine the business category to which it belongs.
  • Such data is often too large or too small on one or some of the indicators to be judged as outliers.
  • a user's PV is small, but the total order quantity is very high, and the total order amount is relatively low, that is, the user frequently purchases, but the purchased goods are low-priced daily fast-moving consumer goods; High, while continuing PV time is very long, but the total amount of orders compared to other users is mean or small. From the data point of view, both of the above users have a situation where the index is extremely large or very small.
  • the present invention breaks the routine, extracts part of the service data determined by the category attribute, and is used to participate in subsequent business data classification.
  • Step S106 The service data in the second data set is clustered by using the unextracted service data as the second data set.
  • One implementation of forming a second set of data can be as follows. Defining all the original business data to form the data set O, the first data set is D, then the second data set can be obtained by the following SQL statement:
  • the clustering method for the second data set may be, for example, a clustering algorithm such as Kmeans, Brich, or Optics. Taking the Kmeans algorithm as an example, the clustering process is as follows:
  • the initial centers of k classes are randomly selected, where the value of k is the determined actual number of clusters.
  • step 4 Determine whether the set convergence condition (or stop condition) is reached. If not, return to step 2-3 to continue iteration; if it is satisfied, stop iteration, and the cluster center is the optimal cluster center.
  • the class result is the final clustering result.
  • Step S108 Determine a service data classification result according to the clustering result of the second data set and the first data set.
  • the first set of data can include one or more classes.
  • the service data in the first data set has the same category attribute, for example, the service data with high activity or the service data with low activity.
  • the business data in the same class has the same category attribute.
  • the classes of the first data set have different methods of partitioning and forming.
  • the classes in the first data set may be divided according to service requirements, or part of the service data may be extracted from the service data according to different extraction rules set to form different classes, and different classes form the first data. set.
  • the first data set may include a class A, a class B, and a class C, and the three classes are respectively extracted according to different extraction rules.
  • Class A is the top 5% of business data when the order quantity is sorted from large to small
  • class B is the business data of the last 5% when the order quantity is sorted by big to small
  • class C is the number of collected goods greater than 200 and the collection The number of stores is greater than 150 data.
  • the category attributes in class A and class C are more active, and the class attributes in class B are less active. Thereby, the characteristics of the classes acquired according to the respective extraction rules can be retained, and targeted merging can be performed in subsequent steps.
  • the extraction rules are not limited to the above several rules. Other extraction rules can be used as needed, and will not be described here.
  • the degree of approximation of the class attribute of the class in the second data set and the first data set, and/or the degree of difference in the amount of service data between the service data categories, the second data set and the first The classes in the data collection are merged to determine the business data classification results. For example, the one or more classes included in the first data set and the class closest to the category attributes in the respective classes in the second data set are combined to obtain a business data classification result. Or combining the classes in the first data set and the classes in the respective classes in the second data set that are closest to each other, and making the difference in the quantity of the business data between the service data categories in the classification result consistent with the pre- Set the scope. Or, if the difference in the number of service data in the two sets is small, it may not be merged, and the types of the two sets are directly used as the final business data classification result.
  • the extraction rule is set according to the part of the service indicator of the service data, and the service data with the category attribute is extracted according to the extraction rule, and the service data is determined according to the clustering result of the unextracted service data and the service data with the category attribute.
  • the results of the classification improve the accuracy of the classification of business data.
  • step S104 a plurality of methods may be used to extract part of the service data whose category attribute meets the preset condition. Three exemplary extraction methods are described below.
  • the first method is to extract the business data of an outlier of a certain service indicator data according to the set threshold.
  • An application of the method may be: extracting service data whose service indicator exceeds a preset upper threshold and/or is lower than a preset lower threshold.
  • the number of consecutive login days is a very intuitive business indicator that reflects user activity. Therefore, the upper threshold and the lower threshold can be set according to business requirements.
  • the number of consecutive login days of the user is mostly within one month, the users who have consecutive login days exceeding 90 days can be extracted in combination with the current situation and business requirements, and the extracted users are obviously users with high activity.
  • the most intuitive high-value business data can be extracted based on the business metrics most relevant to the business objectives.
  • the second method is to extract the business data of the outliers of the logical operation result according to the logical operation result of the plurality of service indicators of the service data.
  • An application of the method may be: extracting service data whose logical operation result exceeds a preset upper threshold and/or lower than a preset lower threshold.
  • the two business indicators of order quantity and order quantity can reflect the user's activity, respectively, the relationship between the order quantity and the order quantity can also reflect the user's activity, for example, can calculate each business data.
  • the ratio of the total order quantity to the order quantity that is, the average order price per user is calculated. If the average order price is very high, for example, more than 50,000 yuan, the user can be classified into a highly active category. That is, according to whether the ratio of the total order quantity to the order quantity in all the business data is out of the group, it is decided whether to extract the business data corresponding to the outlier ratio.
  • This method takes into account the computational relationship between the indicators, which allows for more flexible extraction of data and extends the scope of the extraction rules.
  • the third method is to extract business data of an outlier of a certain service indicator data according to the data distribution of a certain service indicator.
  • the average value and the variance of all the data on the same service indicator can be calculated, and the service indicator data outside the preset floating range with the mean value is determined as the out-of-group service indicator data, and the service indicator data out-of-group service is extracted.
  • Data, where the preset floating range can be determined based on a preset multiple of the variance. For example, calculate the mean and variance of all the data in the indicator of consecutive login days, and extract the data with consecutive login days greater than the mean +2* variance and less than the mean-2* variance.
  • all the data on the same service indicator may be sequentially arranged in order of size, and the service data of the service indicator data on the upper side of the preset upper quantile and/or the lower side of the preset lower quantile may be extracted.
  • data that is greater than 95 quantiles or less than 5 quantiles is extracted, ie, the smallest 5% and the largest 5% of all data for the same business indicator are extracted.
  • This method uses the distribution characteristics of business data to filter out business indicators with large or very small values, which is suitable for difficult An application scenario in which specific numerical thresholds are set according to business conditions.
  • step S108 for example, the following method can be used to determine the business data classification result.
  • the following is an application example of how to select a class in the second data set and a class in the first data set according to the service indicator: first, determining the service indicators involved in the extraction rule in the first data set and the second data set The average feature, the average feature is the average or center point of the business indicators involved in the extraction rule of each class; then, the first data set and the class with the closest average feature in the second data set are combined to obtain the business data classification result.
  • the class D in the first data set is extracted according to the total amount of orders therein being more than 300,000 yuan. Therefore, when merging, each class in the second data set is sorted according to the average or center point of the order total, and the largest class in the sort result is a class that can be merged with the class D.
  • the method is also applicable to the merging of the extracted classes when the rules involve multiple indicators.
  • the class E in the first data set is determined according to the total order index divided by the order quantity index greater than 50,000, and the second data set is determined.
  • the average feature of the class is the average or the central point of the calculation of the total order value of each business data in the class divided by the number of orders.
  • the following method is an application example of deciding whether to perform merging between classes: combining each class in the first data set and the class closest to the category attribute in each class in the second data set to obtain a business data classification result, and The difference in the number of business data between the respective service data classifications in the classification result is made to comply with the preset range. That is, if the difference between the number of service data of each class in the first data set and the quantity of service data of each class in the second data set does not exceed the preset range, no merging is required; if the preset range is exceeded, Then merge the class with the closest category attribute.
  • the conditions of the above merge operation are applicable not only between the classes in the first data set and the classes in the second data set, but also between the classes of the first data set itself, ie, if each of the first data sets
  • the number of classes is much smaller than the classes in the second data set, and the classes with the same category attributes in the first data set can be combined to make the resulting data uniform.
  • the result of the service classification can be more uniform and the application is better.
  • the present invention also provides a method of clustering business data in a second data set.
  • FIG. 2 is a flow chart of one embodiment of a business data clustering method. As shown in FIG. 2, the method of this embodiment includes:
  • Step S2062 preprocessing the service data in the second data set.
  • the pre-processing may include one or more of outlier processing, null processing, and normalization processing.
  • An application of the outlier processing process is as follows: screening out the outbound business indicator data from the second data set, and assigning the business indicator demarcation value for determining whether the business indicator data is out of the group to the outlier business indicator data.
  • the service indicator that is greater than the mean + variance of all the data in the same service indicator can be given the mean + variance
  • the business value of the mean-variance of all the data in the same service indicator is given the mean-variance; and can be greater than the same service indicator.
  • the upper quartile of all data is assigned to the upper quartile
  • the lower quartile of all data in the same business indicator is assigned to the lower quartile.
  • the data pre-processed here is a relatively uniform value, and the outlier processing will produce a better clustering effect. Without affecting the accuracy of the classification of business data.
  • An application of the null process is as follows: calculating the mean of all non-null data in a service indicator in the second data set, and assigning the mean to the data null of the service indicator in the second data set.
  • the data with null values is averaged in the data of the same indicator to improve the accuracy of clustering.
  • step S2064 the actual number of clusters of the cluster is determined.
  • the number of actual clusters may be manually specified according to business requirements, or the number of actual clusters may be determined by the following pre-clustering method.
  • An application for determining the number of clusters by the pre-clustering method is as follows:
  • the contour coefficient of the clustering result is f(n)
  • the contour coefficient of the i-th data point in the clustering result is S i
  • the clustering result and the contour coefficient of the data point are calculated.
  • a i is the average of the distance from the i-th business data to each service data in the class; for b i , first the distance from the i-th business data to each service data in each class not containing the service data The average value is taken as the minimum value of each average value as b i .
  • the number of pre-cluster corresponding to the maximum value corresponding to the preset condition that appears for the first time is taken as the actual number of clusters, and the preset condition is that the difference between the maximum value and the maximum value is smaller than the preset value.
  • the contour coefficient is used to measure the degree of cohesion of each class in the clustering result and the degree of separation between different classes. Therefore, the number of clusters corresponding to the maximum value of the contour coefficient is usually taken as the actual number of clusters. However, for the classification of the service, if the contour coefficient is large, the number of clusters needs to be small, so as to avoid the number of clusters is not conducive to the presentation of the business classification result. Therefore, the present invention selects a number of maxima above a certain threshold, for example, a maximum value of a value obtained by subtracting 0.1 from a maximum value, and a minimum number of clusters corresponding to a plurality of maxima corresponding to the above conditions. As the actual number of clusters.
  • An application example for determining the actual number of clusters is as follows: first, whether the first maximum value is the maximum value, and if so, the number of clusters corresponding to the maximum value as the actual number of clusters; if the first pole If the large value is not the maximum value, the number of clusters corresponding to the maximum value of the first occurrence that is smaller than the preset value is smaller than the maximum number of clusters.
  • step S2066 the second data set is clustered by the actual number of clusters.
  • the business data and the clustering method participating in the clustering are optimized, so that the clustering result is more accurate, thereby improving the accuracy of the business data classification.
  • FIG. 3 is a structural diagram of an embodiment of a service data classification apparatus according to the present invention.
  • the device includes: a service data obtaining module 32, configured to acquire service data, where the service data includes multiple service indicators; and the service data extraction module 34 is configured to extract, from the service data, part of the service whose category attribute meets the preset condition according to the set extraction rule.
  • the data is formed into a first data set, and the extraction rule is set according to the partial service indicator.
  • the service data clustering module 36 is configured to use the undrawn service data as the second data set to aggregate the service data in the second data set.
  • the service data classification module 38 is configured to determine a service data classification result according to the clustering result of the second data set and the first data set.
  • the service data classification result is determined by setting the extraction rule according to the part of the business data, and extracting the business data with the category attribute according to the extraction rule, and then according to the clustering result of the undrawn business data and the business data with the category attribute. Improve the accuracy of business data classification.
  • the first data set may include one or more classes
  • the service data classification module 38 is configured to perform one or more classes included in the first data set and a class closest to the category attributes in each of the second data sets. Consolidate to obtain business data classification results. Thereby, the accuracy of the business classification result can be improved.
  • the service data extraction module 34 may be configured to extract part of the service data from the service data according to different set extraction rules to form different classes, and different classes form the first data set. Thereby, the characteristics of the classes acquired according to the respective extraction rules can be retained, and targeted merging can be performed in subsequent steps.
  • the service data extraction module 34 of this embodiment may include at least one of a first extraction unit 442, a second extraction unit 444, and a third extraction unit 446.
  • the first extracting unit 442 is configured to extract service data of an outlier of a certain service indicator data according to the set threshold, to form a first data set, so that the most intuitive high value can be obtained according to the service indicator most relevant to the business target.
  • Business data is extracted.
  • the second extracting unit 444 is configured to extract the business data of the logical operation result outlier according to the logical operation result of the plurality of service indicators of the service data to form a first data set, and the second extracting unit 444 considers the operation relationship between the indicators, and may More flexible extraction of data, extending the scope of the extraction rules.
  • the third extracting unit 446 is configured to extract the service data of a certain service indicator data out of the group according to the data distribution of a certain service indicator, and form a first data set, which is applicable to an application scenario in which it is difficult to set a specific numerical threshold according to the service condition. .
  • the business data classification module 38 can include an average feature acquisition unit 482 and a merge unit 484.
  • the average feature obtaining unit 482 is configured to determine the service indicators involved in the extraction rule in the first data set and the second data set.
  • An average feature wherein the average feature is an average or a center point of the business metrics involved in the extraction rule of each class;
  • the merging unit 484 is configured to combine the classes of the first data set and the average feature in the second data set to obtain Business data classification results.
  • the service data classification module 38 may be further configured to combine the classes in the first data set and the classes in the respective classes in the second data set to obtain the business data classification result, and make the classification result
  • the difference in the amount of business data between the various business data categories is in accordance with the preset range.
  • the apparatus may also include a pre-processing module 45 that includes an outlier processing unit 452 and/or a null processing unit 454.
  • the outlier processing unit 452 is configured to filter the outlier service indicator data from the second data set, and assign the service indicator boundary value for determining whether the service indicator data is out of the group to the outlier service indicator data. Since the data determined by the service category attribute has been extracted before the business data clustering, the data pre-processed here is a relatively uniform value, and the outlier processing will produce a better clustering effect. Without affecting the accuracy of the classification of business data.
  • the null value processing unit 454 is configured to calculate an average value of all non-null data in a service indicator in the second data set, and assign the mean value to the data null value of the service indicator in the second data set.
  • the data with null values is averaged in the data of the same indicator to improve the accuracy of clustering.
  • the service data clustering module 36 may include a pre-clustering unit 462, a contour coefficient calculating unit 464, an actual clustering number determining unit 466, and an actual clustering unit 468.
  • the pre-clustering unit 462 is configured to pre-cluster the service data in the second data set according to each predetermined number of clusters;
  • the contour coefficient calculation unit 464 is configured to calculate, according to the clustering result, the number of each predetermined cluster number.
  • the actual cluster number determining unit 466 is configured to arrange the corresponding contour coefficients in the order of increasing the number of predetermined clusters, obtain some maximum values in the contour coefficients, and determine the maximum value thereof.
  • the number of pre-clusters corresponding to the maximum value corresponding to the preset condition that appears for the first time is taken as the actual number of clusters, and the preset condition is that the difference between the maximum value and the maximum value is smaller than a preset value; the actual clustering unit 468 It is used to cluster the business data in the second data set by using the actual number of clusters.
  • the clustering result can have both good mathematical characteristics and good usability.
  • FIG. 5 is a structural diagram of still another embodiment of the service data classification apparatus of the present invention.
  • the apparatus 500 of this embodiment includes a memory 510 and a processor 520 coupled to the memory 510.
  • the processor 520 is configured.
  • the business data classification method in any of the foregoing embodiments is performed based on instructions stored in the memory 510.
  • the memory 510 may include, for example, a system memory, a fixed non-volatile storage medium, or the like.
  • the system memory stores, for example, an operating system, an application, a boot loader, and other programs.
  • FIG. 6 is a structural diagram of still another embodiment of the service data classification device of the present invention.
  • the apparatus 500 of this embodiment includes a memory 510 and a processor 520, and may further include an input/output interface 630, a network interface 640, a storage interface 650, and the like. These interfaces 630, 640, 650 and the memory 510 and the processor 520 can be connected, for example, via a bus 660.
  • the input/output interface 630 provides a connection interface for input and output devices such as a display, a mouse, a keyboard, and a touch screen.
  • Network interface 640 provides a connection interface for various networked devices.
  • the storage interface 650 provides a connection interface for an external storage device such as an SD card or a USB flash drive.
  • Embodiments of the present invention also provide a computer non-transitory readable storage medium having stored thereon a computer program that, when executed by a processor, implements any of the foregoing business data classification methods.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer usable program code. .
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种业务数据分类方法和装置,涉及数据处理领域。其中的业务数据分类方法包括:获取业务数据,业务数据包括多个业务指标(S102);根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,提取规则是根据部分业务指标设置的(S104);将未被提取的业务数据作为第二数据集合,对第二数据集合中的业务数据进行聚类(S106);根据第二数据集合的聚类结果和第一数据集合确定业务数据分类结果(S108)。根据业务数据的部分业务指标设置提取规则,并按照提取规则提取类别属性明确的业务数据,再根据未被提取的业务数据的聚类结果以及类别属性明确的业务数据共同确定业务数据分类的结果,提高了业务数据分类的准确性。

Description

业务数据分类方法和装置 技术领域
本发明涉及数据处理领域,特别涉及一种业务数据分类方法和装置。
背景技术
数据聚类是数据挖掘和数据分类中采用的一种常规的技术。但是在对业务数据进行分类的过程中,业务数据通常是具有一定业务属性的,而常规的聚类方法单纯从数据层面去解决问题,忽略了数据的业务含义。例如,某些业务指标特别高或者特别低的数据会被作为离群数据剔除,不再参与聚类,而这些被剔除的离群数据可能是业务价值较高的数据。
因此,按照常规的聚类方法进行分类,业务价值较高的数据无法体现其特有的价值,影响业务数据分类的准确性。
发明内容
本发明实施例所要解决的一个技术问题是:如何使业务数据分类的结果更准确。
根据本发明实施例的第一个方面,提供一种业务数据分类方法,包括:获取业务数据,业务数据包括多个业务指标;根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,提取规则是根据部分业务指标设置的;将未被提取的业务数据作为第二数据集合,对第二数据集合中的业务数据进行聚类;根据第二数据集合的聚类结果和第一数据集合确定业务数据分类结果。
在一个实施例中,根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合包括:根据设定的阈值提取某一业务指标数据离群的业务数据,形成第一数据集合;或者,根据业务数据的多个业务指标的逻辑运算结果提取出逻辑运算结果离群的业务数据,形成第一数据集合;或者,根据某一业务指标的数据分布情况,提取出某一业务指标数据离群的业务数据,形成第一数据集合。
在一个实施例中,根据第二数据集合的聚类结果和第一数据集合确定业务数据分类结果包括:将第一数据集合包括的一个或多个类和第二数据集合中的各个类中类别属性最接近的类进行合并,获得业务数据分类结果。
在一个实施例中,将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并包括:确定第一数据集合和第二数据集合中提取规则中涉及的业务指标的平均特征,将第一数据集合和第二数据集合中平均特征最接近的类进行合并获得业务数据分类结果,其中,平均特征是每个类的提取规则涉及的业务指标的平均值或中心点;或者,将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并获得业务数据分类结果,并使得分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。
在一个实施例中,根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合包括:分别根据设置的不同的提取规则从业务数据中提取部分业务数据以形成不同的类,不同的类形成第一数据集合。
在一个实施例中,在对第二数据集合中的业务数据进行聚类之前,方法还包括:从第二数据集合中筛选出离群的业务指标数据,将用于确定业务指标数据是否离群的业务指标分界值赋予离群的业务指标数据;或者,计算某一业务指标中所有非空数据的均值,将均值赋予第二数据集合中该业务指标的数据空值。
在一个实施例中,对第二数据集合中的业务数据进行聚类包括:按照各个预定聚类个数分别对第二数据集合中的业务数据进行预聚类,计算每个预定聚类个数对应的预聚类结果的轮廓系数;按照预定聚类个数递增的顺序排列相应的轮廓系数,获取轮廓系数中的若干极大值,并确定其中的最大值;将首次出现的符合预设条件的极大值对应的预聚类个数作为实际聚类个数,预设条件为极大值与最大值的差值小于预设值;采用实际聚类个数对第二数据集合中的业务数据进行聚类。
根据本发明实施例的第二个方面,提供一种业务数据分类装置,包括:业务数据获取模块,用于获取业务数据,业务数据包括多个业务指标;业务数据提取模块,用于根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,提取规则是根据部分业务指标设置的;业务数据聚类模块,用于将未被提取的业务数据作为第二数据集合,对第二数据集合中的业务数据进行聚类;业务数据分类模块,用于根据第二数据集合的聚类结果和第一数据集合确定业务数据分类结果。
在一个实施例中,业务数据提取模块包括以下至少一个单元:第一提取单元,用于根据设定的阈值提取某一业务指标数据离群的业务数据,形成第一数据集合;第二 提取单元,用于根据业务数据的多个业务指标的逻辑运算结果提取出逻辑运算结果离群的业务数据,形成第一数据集合;第三提取单元,用于根据某一业务指标的数据分布情况,提取出某一业务指标数据离群的业务数据,形成第一数据集合。
在一个实施例中,业务数据分类模块用于将第一数据集合包括的一个或多个类和第二数据集合中的各个类中类别属性最接近的类进行合并,获得业务数据分类结果。
在一个实施例中,业务数据分类模块包括平均特征获取单元和合并单元;平均特征获取单元用于确定第一数据集合和第二数据集合中提取规则中涉及的业务指标的平均特征,其中,平均特征是每个类的提取规则涉及的业务指标的平均值或中心点;合并单元用于将第一数据集合和第二数据集合中平均特征最接近的类进行合并获得业务数据分类结果;或者,业务数据分类模块用于将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并获得业务数据分类结果,并使得分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。
在一个实施例中,业务数据提取模块用于分别根据设置的不同的提取规则从业务数据中提取部分业务数据以形成不同的类,不同的类形成第一数据集合。
在一个实施例中,装置还包括预处理模块,预处理模块包括:离群点处理单元,用于从第二数据集合中筛选出离群的业务指标数据,将用于确定业务指标数据是否离群的业务指标分界值赋予离群的业务指标数据;和/或,空值处理单元,用于计算第二数据集合中某一业务指标中所有非空数据的均值,将均值赋予第二数据集合中该业务指标的数据空值。
在一个实施例中,业务数据聚类模块包括:预聚类单元,用于按照各个预定聚类个数分别对第二数据集合中的业务数据进行预聚类;轮廓系数计算单元,用于根据于聚类结果计算每个预定聚类个数对应的预聚类结果的轮廓系数;实际聚类个数确定单元,用于按照预定聚类个数递增的顺序排列相应的轮廓系数,获取轮廓系数中的若干极大值,并确定其中的最大值,将首次出现的符合预设条件的极大值对应的预聚类个数作为实际聚类个数,预设条件为极大值与最大值的差值小于预设值;实际聚类单元,用于采用实际聚类个数对第二数据集合中的业务数据进行聚类。
根据本发明实施例的第三个方面,提供一种业务数据分类装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器中的指令,执行前述任意一种业务数据分类方法。
根据本发明实施例的第四个方面,提供一种计算机非瞬时性可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任意一种业务数据分类方法。
本发明根据业务数据的部分业务指标设置提取规则,并按照提取规则提取类别属性明确的业务数据,再根据未被提取的业务数据的聚类结果以及类别属性明确的业务数据共同确定业务数据分类的结果,提高了业务数据分类的准确性。
通过以下参照附图对本发明的示例性实施例的详细描述,本发明的其它特征及其优点将会变得清楚。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本发明业务数据分类方法的一个实施例的流程图。
图2为本发明业务数据聚类方法的一个实施例的流程图。
图3为本发明业务数据分类装置的一个实施例的结构图。
图4为本发明业务数据分类装置的另一个实施例的结构图
图5为本发明业务数据分类装置的又一个实施例的结构图
图6为本发明业务数据分类装置的再一个实施例的结构图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本发明及其应用或使用的任何限制。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实 际的比例关系绘制的。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
针对常规的聚类方法单纯从数据层面去解决问题,忽略了数据的业务含义,使得业务价值较高的数据无法体现其特有的价值,从而影响业务数据分类的准确性的问题,提出本发明。
图1为本发明业务数据分类方法的一个实施例的流程图。如图1所示,该实施例的方法包括:
步骤S102,获取业务数据,业务数据包括多个业务指标。
在一个实施例中,可以根据业务分类目的设置相关的业务指标,并获取相关的业务数据,业务数据具有多个维度,各个维度是与业务分类目的相关的各个业务指标。
例如,某业务分类目的为根据用户的活跃度划分用户等级,也即对用户进行活跃度高低的分类,则业务数据例如可以包括用户近一段时间的PV(Page View,页面浏览量)、订单总量、订单总额、持续PV时间、商品收藏数量、注册时间等能够反映用户活跃度的业务指标。
其中,可以通过ETL(Extract-Transform-Load,抽取、转换、加载)技术进行数据提取和指标计算。例如,可以通过SQL语句中的WHERE条件限定所要获取的业务指标的条件,再将计算结果插入到目标表结构中。
其中,还可以为业务指标选择性地添加标识信息,标识该指标值的大小和业务目的是否为正相关的。例如,对于根据活跃度为用户进行分类的场景,PV越大表示用户活跃度越高,相邻登录时间的间隔越大表示用户的活跃度越低。添加标识信息有助于后续步骤中提取规则的设置和使用。
步骤S104,根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,提取规则是根据部分业务指标设置的。
在业务数据中,存在类别属性确定的业务数据,也即通过这些数据的部分业务指 标的数值可以直接确定其所属的业务类别。这类数据往往在某个或某些指标上的数值过大或过小,从而被判定为离群点。例如,某用户PV很小,但是订单总量很高,同时订单总额相对很低,即该用户经常进行购买,但是购买的商品为价格较低的日用快消品;另一用户的PV较高,同时持续PV时间很长,但相较于其他用户的订单总量为均值或者偏小。从数据来看,上述两位用户都存在某一指标极大或极小的情况。按照常规聚类方法,如果直接对包含上述两个用户的业务数据进行聚类,由于极端值的存在会使聚类效果较差;如果去除离群点后再聚类,虽然聚类效果较好,但是被去除的数据无法体现其特有的价值,影响业务数据分类的准确性。因此,本发明打破常规,将类别属性确定的部分业务数据提取出来,并用于参与后续的业务数据分类。
步骤S106,将未被提取的业务数据作为第二数据集合,对第二数据集合中的业务数据进行聚类。
形成第二数据集合的一种实现方式可以如下所示。定义原始的所有业务数据组成数据集O,第一数据集合为D,则第二数据集合可以通过以下SQL语句获得:
SELECT *
FROM O
WHERE NOT EXISTS(SELECT NULL FROM D WHERE O.ID=D.ID)
其中,针对第二数据集合的聚类方法例如可以采用Kmeans、Brich、Optics等聚类算法。以Kmeans算法为例,聚类过程如下:
1、随机选择k个类的初始中心,其中,k的值为确定的实际聚类个数。
2、对所有数据点,计算其到k个中心的距离,将该数据点归属到距离最短的中心所在类。
3、更新各个类的中心点。
4、判断是否达到设定的收敛条件(或称停止条件),如果不满足,返回到2-3步骤进行继续迭代;如果满足,则停止迭代,该聚类中心为最优聚类中心,聚类结果为最终的聚类结果。
步骤S108,根据第二数据集合的聚类结果和第一数据集合确定业务数据分类结果。
第一数据集合可以包括一个或多个类。
当第一数据集合包括一个类时,第一数据集合中的所有业务数据作为一个整体, 不再有更进一步的划分。此时,第一数据集合中的业务数据具有相同的类别属性,例如其中均为活跃度高的业务数据或者均为活跃度低的业务数据。
当第一数据集合包括多个类时,同一个类中的业务数据具有相同的类别属性。第一数据集合的类有不同的划分和形成方法。例如,第一数据集合中的类可以是根据业务需要进行划分的,或者,分别根据设置的不同的提取规则从业务数据中提取部分业务数据以形成不同的类,各个不同的类形成第一数据集合。
例如,第一数据集合中可以包括类A、类B和类C,三个类是分别根据不同的提取规则提取的。类A是订单数量按照由大到小排序时位于前5%的业务数据,类B是订单数量按照由大到小排序时位于后5%的业务数据,类C是收藏商品数量大于200并且收藏店铺数量大于150的数据。显然,类A和类C中的类别属性是活跃度较高的,类B的类别属性是活跃度较低的。从而,可以保留根据各个提取规则所获取的类的特点,在后续步骤中能够进行有针对性的合并。本领域技术人员应当清楚,提取规则并不限于上述几种规则。根据需要,可以采用其他的提取规则,这里不再赘述。
在一个实施例中,可以根据第二数据集合和第一数据集合中类的类别属性近似程度,和/或,业务数据分类之间的业务数据数量的差异程度,对第二数据集合和第一数据集合中的类进行合并,确定出业务数据分类结果。例如,将第一数据集合包括的一个或多个类和第二数据集合中的各个类中类别属性最接近的类进行合并,获得业务数据分类结果。或者,将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并,并使得分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。或者,如果两个集合的各类中业务数据数量差异较小,也可以不合并,将两个集合的各类直接作为最终的业务数据分类结果。
上述实施例通过根据业务数据的部分业务指标设置提取规则,并按照提取规则提取类别属性明确的业务数据,再根据未被提取的业务数据的聚类结果以及类别属性明确的业务数据共同确定业务数据分类的结果,提高了业务数据分类的准确性。
在步骤S104中,可以采用多种方法提取类别属性符合预设条件的部分业务数据。下面介绍三种示例性的提取方法。
第一种方法为根据设定的阈值提取某一业务指标数据离群的业务数据。该方法的一个应用例可以为,将业务指标超出预设的上限阈值和/或者低于预设的下限阈值的业务数据提取出来。
例如,连续登录天数是体现用户活跃度的一个十分直观的业务指标,因此,可以根据业务需求设置上限阈值和下限阈值。当用户的连续登录天数大多集中在一个月以内时,可以结合这一现状和业务需求,将连续登录天数超过90天的用户提取出来,提取的这些用户显然为活跃度很高的用户。
从而,可以根据与业务目最相关的业务指标,将最直观的高价值业务数据提取出来。
第二种方法为根据业务数据的多个业务指标的逻辑运算结果提取出逻辑运算结果离群的业务数据。该方法的一个应用例可以为,将逻辑运算结果超出预设的上限阈值和/或者低于预设的下限阈值的业务数据提取出来。
以订单总额和订单数量这两个业务指标为例,虽然它们可以分别体现用户的活跃度,然而,订单总额和订单数量之间的关系也可以反映用户的活跃度,例如可以计算各个业务数据中订单总额和订单数量的比值,即计算每个用户的平均订单单价,如果平均订单单价非常高,例如超过了5万元,即可以将用户划分到活跃度高的类别。即,根据所有业务数据中订单总额和订单数量的比值是否离群,决定是否提取离群的比值所对应的业务数据。
这种方法考虑了指标间的运算关系,可以更灵活地提取数据,扩展了提取规则的设置范围。
第三种方法为根据某一业务指标的数据分布情况,提取出某一业务指标数据离群的业务数据。
例如,可以计算同一业务指标上所有数据的均值和方差,将以均值为中心的预设浮动范围之外的业务指标数据确定为离群的业务指标数据,并提取出业务指标数据离群的业务数据,其中的预设浮动范围可以根据方差的预设倍数确定。例如,计算连续登录天数这一指标中所有数据的均值和方差,将连续登录天数大于均值+2*方差以及小于均值-2*方差的数据提取出来。
例如,还可以将同一业务指标上所有数据按照大小顺序依次排列,提取出业务指标数据在预设上分位数上侧和/或在预设下分位数下侧的业务数据。例如,将大于95分位数或小于5分位数的数据提取出来,即将同一业务指标的所有数据中最小的5%和最大的5%的数据提取出来。
这种方法通过业务数据的分布特性筛选出数值极大或极小的业务指标,适用于难 以根据业务情况设置具体数值阈值的应用场景。
在步骤S108中例如可以采用以下方法确定业务数据分类结果。
以下为如何根据业务指标选取第二数据集合中的类与第一数据集合中的类进行合并的一个应用例:首先,确定第一数据集合和第二数据集合中提取规则中涉及的业务指标的平均特征,平均特征是每个类的提取规则涉及的业务指标的平均值或中心点;然后,将第一数据集合和第二数据集合中平均特征最接近的类进行合并获得业务数据分类结果。
例如,第一数据集合中的类D是根据其中的订单总额大于30万元提取的。因此,在合并时将第二数据集合中的各个类按照订单总额的平均值或者中心点进行由大到小的排序,排序结果中最大的类即为可以与类D进行合并的类。该方法同样适用于提取规则涉及多个指标时所提取的类的合并,例如第一数据集合中的类E是根据订单总额指标除以订单数量指标大于5万确定的,则第二数据集合中的类的平均特征为类中各个业务数据的订单总额指标除以订单数量的计算结果的平均值或者中心点。
通过计算各个类的平均特征,能够客观地确定与待合并的类在类别属性上最相近的类,从而提高了合并的准确性。
以下方法为决定是否进行类之间的合并的一个应用例:将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并获得业务数据分类结果,并使得分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。即,如果第一数据集合中各个类的业务数据的数量以及第二数据集合中各个类的业务数据的数量之间的差异没有超出预设范围,则无需进行合并;如果超过了预设范围,则合并类别属性最接近的类。
上述合并操作的条件不仅适用于第一数据集合中的类与第二数据集合中的类之间,还可以用于第一数据集合本身的类之间,即,如果第一数据集合中的各个类数量均远小于第二数据集合中的类,可以将第一数据集合中具有同样类别属性的类进行合并,以使结果的数据均匀。
通过类之间数据量的差异决定是否进行合并,能够使业务分类的结果更均匀,应用性更好。
本发明还提供了对第二数据集合中的业务数据进行聚类的方法。
图2为业务数据聚类方法的一个实施例的流程图。如图2所示,该实施例的方法包括:
步骤S2062,对第二数据集合中的业务数据进行预处理。
其中,预处理可以包括离群点处理、空值处理和标准化处理中的一个或多个。
离群点处理过程的一个应用例如下所示:从第二数据集合中筛选出离群的业务指标数据,将用于确定业务指标数据是否离群的业务指标分界值赋予离群的业务指标数据。例如,可以将大于同一业务指标中所有数据的均值+方差的业务指标赋予均值+方差,小于同一业务指标中所有数据的均值-方差的业务指标赋予均值-方差;还可以将大于同一业务指标中所有数据的上四分位数的业务指标赋予上四分位数,小于同一业务指标中所有数据的下四分位数的业务指标赋予下四分位数。此外,还可以将大于同一业务指标中所有数据的均值+方差的业务指标赋予上四分位数,小于同一业务指标中所有数据的均值-方差的业务指标赋予下四分位数。
由于在进行业务数据聚类之前已经将业务类别属性确定的数据提取出来,因此,此处进行预处理的数据为数值较均匀的数据,对其进行离群点处理会产生更好的聚类效果,而不会影响业务数据分类的准确性。
空值处理过程的一个应用例如下所示:计算第二数据集合中某一业务指标中所有非空数据的均值,将均值赋予第二数据集合中该业务指标的数据空值。从而,使具有空值的数据在同一指标的数据中处于平均水平,以提高聚类的准确性。
对于完成离群点处理、空值处理等操作的数据,还可以对其进行标准化或者规范化处理。标准化处理的一个应用例为,对于同一业务指标,可以首先计算该业务指标中所有数据的均值和方差,再将其中的每个业务指标的原始数值替换为(原始数值-均值)/方差,从而统一参与聚类的业务数据的各个业务指标的权重。
根据需要,本领域技术人员还可以采用其他数据预处理的方法,这里不再赘述。
步骤S2064,确定聚类的实际聚类个数。
在进行聚类之前,可以根据业务需求人工指定实际聚类的个数,也可以采用以下预聚类的方法确定实际聚类个数。通过预聚类方法确定聚类个数的一个应用例如下所示:
1.按照各个预定聚类个数分别对第二数据集合中的业务数据进行预聚类,计算每个预定聚类个数对应的预聚类结果的轮廓系数;
设参与聚类的业务数据总数为N。当采用n作为预定聚类个数时,聚类结果的轮廓系数为f(n),聚类结果中第i个数据点的轮廓系数为Si,聚类结果和数据点的轮廓系 数的计算方法分别如公式(1)和公式(2)所示:
f(n)=∑Si/N      (1)
Si=(bi-ai)/max(ai,vi)   (2)
其中,ai为第i个业务数据到类内每个业务数据的距离的平均值;对于bi,首先求第i个业务数据到不包含该业务数据的各个类中每个业务数据的距离的平均值,将各个平均值的最小值作为bi
2.按照预定聚类个数递增的顺序排列相应的轮廓系数,获取轮廓系数中的若干极大值,并确定其中的最大值;
上述步骤可以通过坐标系较为直观地表述出来。当横坐标为预定聚类个数,纵坐标为轮廓系数时,将各个聚类结果对应的数据点依次连接,纵坐标大于相邻两点的数据点对应的轮廓系数即为极大值,各个极大值中的最大值也是所有数据点中的极大值。
3.将首次出现的符合预设条件的极大值对应的预聚类个数作为实际聚类个数,预设条件为极大值与最大值的差值小于预设值。
轮廓系数用于衡量聚类结果中各个类本身的凝聚度以及不同类之间的分离度,因此通常将轮廓系数的最大值对应的聚类个数作为实际聚类个数。然而,对于业务上的分类,在满足了轮廓系数较大的情况下,还需要使聚类个数较小,以免聚类个数过多不利于业务分类结果的呈现。因此,本发明选取了在特定阈值以上的若干极大值,例如大于最大值减去0.1后所得值的极大值,再将符合上述条件的若干极大值所对应的最小的聚类个数作为实际聚类个数。
确定实际聚类个数的一个应用例为:首先判断第一个极大值是否为最大值,如果是,则将最大值对应的聚类个数作为实际聚类个数;如果第一个极大值不是最大值,则将首次出现的与所述最大值的差距小于预设值的极大值对应的聚类个数作为实际聚类个数。
步骤S2066,采用实际聚类个数对第二数据集合进行聚类。
通过采用上述方法,优化了参与聚类的业务数据以及聚类方法,使得聚类的结果更准确,从而提高了业务数据分类的准确度。
图3为本发明业务数据分类装置的一个实施例的结构图。如图3所示,该实施例的装 置包括:业务数据获取模块32,用于获取业务数据,业务数据包括多个业务指标;业务数据提取模块34,用于根据设置的提取规则从业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,提取规则是根据部分业务指标设置的;业务数据聚类模块36,用于将未被提取的业务数据作为第二数据集合,对第二数据集合中的业务数据进行聚类;业务数据分类模块38,用于根据第二数据集合的聚类结果和第一数据集合确定业务数据分类结果。
通过根据业务数据的部分业务指标设置提取规则,并按照提取规则提取类别属性明确的业务数据,再根据未被提取的业务数据的聚类结果以及类别属性明确的业务数据共同确定业务数据分类的结果,提高了业务数据分类的准确性。
其中,第一数据集合可以包括一个或多个类,业务数据分类模块38用于将第一数据集合包括的一个或多个类和第二数据集合中的各个类中类别属性最接近的类进行合并,获得业务数据分类结果。从而,能够提高业务分类结果的准确性。
其中,业务数据提取模块34可以用于分别根据设置的不同的提取规则从业务数据中提取部分业务数据以形成不同的类,不同的类形成第一数据集合。从而,可以保留根据各个提取规则所获取的类的特点,在后续步骤中能够进行有针对性的合并。
图4为本发明业务数据分类装置的另一个实施例的结构图。如图4所示,该实施例的业务数据提取模块34可以包括第一提取单元442、第二提取单元444和第三提取单元446中的至少一个。
其中,第一提取单元442用于根据设定的阈值提取某一业务指标数据离群的业务数据,形成第一数据集合,从而可以根据与业务目最相关的业务指标,将最直观的高价值业务数据提取出来。
第二提取单元444用于根据业务数据的多个业务指标的逻辑运算结果提取出逻辑运算结果离群的业务数据,形成第一数据集合,第二提取单元444考虑了指标间的运算关系,可以更灵活地提取数据,扩展了提取规则的设置范围。
第三提取单元446用于根据某一业务指标的数据分布情况,提取出某一业务指标数据离群的业务数据,形成第一数据集合,适用于难以根据业务情况设置具体数值阈值的应用场景。。
业务数据分类模块38可以包括平均特征获取单元482和合并单元484。平均特征获取单元482用于确定第一数据集合和第二数据集合中提取规则中涉及的业务指标的 平均特征,其中,平均特征是每个类的提取规则涉及的业务指标的平均值或中心点;合并单元484用于将第一数据集合和第二数据集合中平均特征最接近的类进行合并获得业务数据分类结果。通过计算各个类的平均特征,能够客观地确定与待合并的类在类别属性上最相近的类,从而提高了合并的准确性。
或者,业务数据分类模块38也可以用于将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并获得业务数据分类结果,并使得分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。通过类之间数据的差异决定是否进行合并,能够使业务分类的结果更均匀,应用性更好。
该装置还可以包括预处理模块45,预处理模块45包括离群点处理单元452和/或空值处理单元454。
离群点处理单元452用于从第二数据集合中筛选出离群的业务指标数据,将用于确定业务指标数据是否离群的业务指标分界值赋予离群的业务指标数据。由于在进行业务数据聚类之前已经将业务类别属性确定的数据提取出来,因此,此处进行预处理的数据为数值较均匀的数据,对其进行离群点处理会产生更好的聚类效果,而不会影响业务数据分类的准确性。
空值处理单元454用于计算第二数据集合中某一业务指标中所有非空数据的均值,将均值赋予第二数据集合中该业务指标的数据空值。从而,使具有空值的数据在同一指标的数据中处于平均水平,以提高聚类的准确性。
其中,业务数据聚类模块36可以包括预聚类单元462、轮廓系数计算单元464、实际聚类个数确定单元466、实际聚类单元468。预聚类单元462用于按照各个预定聚类个数分别对第二数据集合中的业务数据进行预聚类;轮廓系数计算单元464用于根据于聚类结果计算每个预定聚类个数对应的预聚类结果的轮廓系数;实际聚类个数确定单元466用于按照预定聚类个数递增的顺序排列相应的轮廓系数,获取轮廓系数中的若干极大值,并确定其中的最大值,将首次出现的符合预设条件的极大值对应的预聚类个数作为实际聚类个数,预设条件为极大值与最大值的差值小于预设值;实际聚类单元468用于采用实际聚类个数对第二数据集合中的业务数据进行聚类。
通过采用上述方法,可以使聚类结果既具有较好的数学特性,也有较好的可用性。
图5为本发明业务数据分类装置的又一个实施例的结构图。如图5所示,该实施例的装置500包括:存储器510以及耦接至该存储器510的处理器520,处理器520被配 置为基于存储在存储器510中的指令,执行前述任意一个实施例中的业务数据分类方法。
其中,存储器510例如可以包括系统存储器、固定非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。
图6为本发明业务数据分类装置的再一个实施例的结构图。如图6所示,该实施例的装置500包括:存储器510以及处理器520,还可以包括输入输出接口630、网络接口640、存储接口650等。这些接口630,640,650以及存储器510和处理器520之间例如可以通过总线660连接。其中,输入输出接口630为显示器、鼠标、键盘、触摸屏等输入输出设备提供连接接口。网络接口640为各种联网设备提供连接接口。存储接口650为SD卡、U盘等外置存储设备提供连接接口。
本发明的实施例还提供一种计算机非瞬时性可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任意一种业务数据分类方法。
本领域内的技术人员应当明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非瞬时性存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解为可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计 算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。

Claims (16)

  1. 一种业务数据分类方法,其特征在于,包括:
    获取业务数据,所述业务数据包括多个业务指标;
    根据设置的提取规则从所述业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,所述提取规则是根据部分业务指标设置的;
    将未被提取的业务数据作为第二数据集合,对所述第二数据集合中的业务数据进行聚类;
    根据所述第二数据集合的聚类结果和所述第一数据集合确定业务数据分类结果。
  2. 根据权利要求1所述的方法,其特征在于,所述根据设置的提取规则从所述业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合包括:
    根据设定的阈值提取某一业务指标数据离群的业务数据,形成第一数据集合;
    或者,
    根据业务数据的多个业务指标的逻辑运算结果提取出所述逻辑运算结果离群的业务数据,形成第一数据集合;
    或者,
    根据某一业务指标的数据分布情况,提取出所述某一业务指标数据离群的业务数据,形成第一数据集合。
  3. 根据权利要求1所述的方法,其特征在于,所述根据所述第二数据集合的聚类结果和所述第一数据集合确定业务数据分类结果包括:
    将第一数据集合包括的一个或多个类和第二数据集合中的各个类中类别属性最接近的类进行合并,获得业务数据分类结果。
  4. 根据权利要求3所述的方法,其特征在于,所述将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并包括:
    确定第一数据集合和第二数据集合中所述提取规则中涉及的业务指标的平均特征,将第一数据集合和第二数据集合中所述平均特征最接近的类进行合并获得业务数据分类结果,其中,所述平均特征是每个类的所述提取规则涉及的业务指标的平均值或中心点;
    或者,
    将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并获得业务数据分类结果,并使得所述分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。
  5. 根据权利要求1所述的方法,其特征在于,所述根据设置的提取规则从所述业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合包括:
    分别根据设置的不同的提取规则从所述业务数据中提取部分业务数据以形成不同的类,所述不同的类形成第一数据集合。
  6. 根据权利要求1所述的方法,其特征在于,在对所述第二数据集合中的业务数据进行聚类之前,所述方法还包括:
    从所述第二数据集合中筛选出离群的业务指标数据,将用于确定业务指标数据是否离群的业务指标分界值赋予离群的业务指标数据;
    或者,
    计算第二数据集合中某一业务指标中所有非空数据的均值,将均值赋予第二数据集合中该业务指标的空值。
  7. 根据权利要求1所述的方法,其特征在于,所述对所述第二数据集合中的业务数据进行聚类包括:
    按照各个预定聚类个数分别对所述第二数据集合中的业务数据进行预聚类,计算每个预定聚类个数对应的预聚类结果的轮廓系数;
    按照预定聚类个数递增的顺序排列相应的轮廓系数,获取所述轮廓系数中的若干极大值,并确定其中的最大值;
    将首次出现的符合预设条件的极大值对应的预聚类个数作为实际聚类个数,所述预设条件为极大值与所述最大值的差值小于预设值;
    采用所述实际聚类个数对第二数据集合中的业务数据进行聚类。
  8. 一种业务数据分类装置,其特征在于,包括:
    业务数据获取模块,用于获取业务数据,所述业务数据包括多个业务指标;
    业务数据提取模块,用于根据设置的提取规则从所述业务数据中提取类别属性符合预设条件的部分业务数据形成第一数据集合,所述提取规则是根据部分业务指标设置的;
    业务数据聚类模块,用于将未被提取的业务数据作为第二数据集合,对所述第二 数据集合中的业务数据进行聚类;
    业务数据分类模块,用于根据所述第二数据集合的聚类结果和所述第一数据集合确定业务数据分类结果。
  9. 根据权利要求8所述的装置,其特征在于,所述业务数据提取模块包括以下至少一个单元:
    第一提取单元,用于根据设定的阈值提取某一业务指标数据离群的业务数据,形成第一数据集合;
    第二提取单元,用于根据业务数据的多个业务指标的逻辑运算结果提取出所述逻辑运算结果离群的业务数据,形成第一数据集合;
    第三提取单元,用于根据某一业务指标的数据分布情况,提取出所述某一业务指标数据离群的业务数据,形成第一数据集合。
  10. 根据权利要求8所述的装置,其特征在于,所述业务数据分类模块用于将第一数据集合包括的一个或多个类和第二数据集合中的各个类中类别属性最接近的类进行合并,获得业务数据分类结果。
  11. 根据权利要求10所述的装置,其特征在于,
    所述业务数据分类模块包括平均特征获取单元和合并单元;所述平均特征获取单元用于确定第一数据集合和第二数据集合中所述提取规则中涉及的业务指标的平均特征,其中,所述平均特征是每个类的所述提取规则涉及的业务指标的平均值或中心点;所述合并单元用于将第一数据集合和第二数据集合中所述平均特征最接近的类进行合并获得业务数据分类结果;
    或者,所述业务数据分类模块用于将第一数据集合中的各个类和第二数据集合中的各个类中类别属性最接近的类进行合并获得业务数据分类结果,并使得所述分类结果中的各个业务数据分类之间的业务数据的数量差异符合预设范围。
  12. 根据权利要求8所述的装置,其特征在于,所述业务数据提取模块用于分别根据设置的不同的提取规则从所述业务数据中提取部分业务数据以形成不同的类,所述不同的类形成第一数据集合。
  13. 根据权利要求8所述的装置,其特征在于,还包括预处理模块,所述预处理模块包括:
    离群点处理单元,用于从所述第二数据集合中筛选出离群的业务指标数据,将用 于确定业务指标数据是否离群的业务指标分界值赋予离群的业务指标数据;和/或,
    空值处理单元,用于计算第二数据集合中某一业务指标中所有非空数据的均值,将均值赋予第二数据集合中该业务指标的数据空值。
  14. 根据权利要求8所述的装置,其特征在于,所述业务数据聚类模块包括:
    预聚类单元,用于按照各个预定聚类个数分别对所述第二数据集合中的业务数据进行预聚类;
    轮廓系数计算单元,用于根据于聚类结果计算每个预定聚类个数对应的预聚类结果的轮廓系数;
    实际聚类个数确定单元,用于按照预定聚类个数递增的顺序排列相应的轮廓系数,获取所述轮廓系数中的若干极大值,并确定其中的最大值,将首次出现的符合预设条件的极大值对应的预聚类个数作为实际聚类个数,所述预设条件为极大值与所述最大值的差值小于预设值;
    实际聚类单元,用于采用所述实际聚类个数对第二数据集合中的业务数据进行聚类。
  15. 一种业务数据分类装置,其特征在于,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器中的指令,执行如权利要求1-7中任一项所述的业务数据分类方法。
  16. 一种计算机非瞬时性可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-7中任一项所述的业务数据分类方法。
PCT/CN2017/081387 2016-06-15 2017-04-21 业务数据分类方法和装置 WO2017215346A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/310,301 US11023534B2 (en) 2016-06-15 2017-04-21 Classification method and a classification device for service data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610423480.1A CN106156791B (zh) 2016-06-15 2016-06-15 业务数据分类方法和装置
CN201610423480.1 2016-06-15

Publications (1)

Publication Number Publication Date
WO2017215346A1 true WO2017215346A1 (zh) 2017-12-21

Family

ID=57353212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/081387 WO2017215346A1 (zh) 2016-06-15 2017-04-21 业务数据分类方法和装置

Country Status (3)

Country Link
US (1) US11023534B2 (zh)
CN (1) CN106156791B (zh)
WO (1) WO2017215346A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598278A (zh) * 2018-09-20 2019-04-09 阿里巴巴集团控股有限公司 聚类处理方法、装置、电子设备及计算机可读存储介质
CN110348133A (zh) * 2019-07-15 2019-10-18 西南交通大学 一种高速列车三维产品结构技术功效图构建系统及方法

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
BR112014026771B1 (pt) 2012-04-27 2022-03-15 Fisher & Paykel Healthcare Limited Aparelho de umidificação para sistema de umidificação respiratória
CN115671460A (zh) 2013-09-13 2023-02-03 费雪派克医疗保健有限公司 用于加湿系统的连接
CN106156791B (zh) * 2016-06-15 2021-03-30 北京京东尚科信息技术有限公司 业务数据分类方法和装置
WO2018106126A1 (en) 2016-12-07 2018-06-14 Fisher And Paykel Healthcare Limited Sensing arrangements for medical devices
WO2018119882A1 (zh) * 2016-12-29 2018-07-05 中国科学院深圳先进技术研究院 一种宏基因组数据分类方法和装置
CN108255888B (zh) * 2016-12-29 2021-08-17 北京国双科技有限公司 一种数据处理方法及系统
CN106940803B (zh) * 2017-02-17 2018-04-17 平安科技(深圳)有限公司 相关变量识别方法和装置
CN107911232B (zh) * 2017-10-27 2021-04-30 绿盟科技集团股份有限公司 一种确定业务操作规则的方法及装置
CN110007914B (zh) * 2017-12-29 2022-08-19 珠海市君天电子科技有限公司 一种大数据计算方法及装置
CN110162564A (zh) * 2019-05-30 2019-08-23 北京中电普华信息技术有限公司 业务数据处理方法及系统
CN110766591A (zh) * 2019-09-06 2020-02-07 中移(杭州)信息技术有限公司 一种智能业务管理方法、装置、终端及存储介质
CN111401674B (zh) * 2019-12-10 2023-06-23 李福瑞 一种基于大数据的高企信息化管理系统
US11301351B2 (en) * 2020-03-27 2022-04-12 International Business Machines Corporation Machine learning based data monitoring
US11727030B2 (en) * 2020-05-05 2023-08-15 Business Objects Software Ltd. Automatic hot area detection in heat map visualizations
CN112579581B (zh) * 2020-11-30 2023-04-14 贵州力创科技发展有限公司 一种数据分析引擎的数据接入方法及系统
CN113448954B (zh) * 2021-06-29 2024-02-06 平安证券股份有限公司 业务数据执行方法、装置、电子设备及计算机存储介质
CN116304056B (zh) * 2023-04-11 2024-01-30 山西玖邦科技有限公司 一种用于计算机软件开发数据的管理方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735550B1 (en) * 2001-01-16 2004-05-11 University Corporation For Atmospheric Research Feature classification for time series data
EP1455300A2 (en) * 2003-03-05 2004-09-08 Nec Corporation Clustering apparatus, clustering method, and clustering program
CN103810261A (zh) * 2014-01-26 2014-05-21 西安理工大学 一种基于商空间理论的K-means聚类方法
CN104181597A (zh) * 2014-08-31 2014-12-03 电子科技大学 一种基于叠前地震数据的地震相分析方法
CN104699702A (zh) * 2013-12-09 2015-06-10 中国银联股份有限公司 数据挖掘及分类方法
CN106156791A (zh) * 2016-06-15 2016-11-23 北京京东尚科信息技术有限公司 业务数据分类方法和装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2003250875A1 (en) * 2002-07-02 2004-01-23 Fraunhofer-Gesellschaft Zur Forderung Der Angewandten Forschung E.V. Method and apparatus for analysing arbitrary objects
CN102129470A (zh) * 2011-03-28 2011-07-20 中国科学技术大学 标签聚类方法和系统
CN102253996B (zh) * 2011-07-08 2013-08-21 北京航空航天大学 一种多视角阶段式的图像聚类方法
CN103268495B (zh) * 2013-05-31 2016-08-17 公安部第三研究所 计算机系统中基于先验知识聚类的人体行为建模识别方法
CN104216985B (zh) * 2014-09-04 2017-09-01 深圳供电局有限公司 一种甄别异常数据的方法及系统
US10147107B2 (en) * 2015-06-26 2018-12-04 Microsoft Technology Licensing, Llc Social sketches

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6735550B1 (en) * 2001-01-16 2004-05-11 University Corporation For Atmospheric Research Feature classification for time series data
EP1455300A2 (en) * 2003-03-05 2004-09-08 Nec Corporation Clustering apparatus, clustering method, and clustering program
CN104699702A (zh) * 2013-12-09 2015-06-10 中国银联股份有限公司 数据挖掘及分类方法
CN103810261A (zh) * 2014-01-26 2014-05-21 西安理工大学 一种基于商空间理论的K-means聚类方法
CN104181597A (zh) * 2014-08-31 2014-12-03 电子科技大学 一种基于叠前地震数据的地震相分析方法
CN106156791A (zh) * 2016-06-15 2016-11-23 北京京东尚科信息技术有限公司 业务数据分类方法和装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598278A (zh) * 2018-09-20 2019-04-09 阿里巴巴集团控股有限公司 聚类处理方法、装置、电子设备及计算机可读存储介质
CN110348133A (zh) * 2019-07-15 2019-10-18 西南交通大学 一种高速列车三维产品结构技术功效图构建系统及方法
CN110348133B (zh) * 2019-07-15 2022-08-19 西南交通大学 一种高速列车三维产品结构技术功效图构建系统及方法

Also Published As

Publication number Publication date
CN106156791A (zh) 2016-11-23
US20190197057A1 (en) 2019-06-27
US11023534B2 (en) 2021-06-01
CN106156791B (zh) 2021-03-30

Similar Documents

Publication Publication Date Title
WO2017215346A1 (zh) 业务数据分类方法和装置
CN108170692B (zh) 一种热点事件信息处理方法和装置
CN105956628B (zh) 数据分类方法和用于数据分类的装置
CN110457577B (zh) 数据处理方法、装置、设备和计算机存储介质
TW201537366A (zh) 大數據處理方法及平台
TWI464604B (zh) 資料分群方法與裝置、資料處理裝置及影像處理裝置
CN110569922B (zh) 交互式层次聚类实现方法、装置、设备及可读存储介质
US20180329963A1 (en) Embedded Analytics and Transactional Data Processing
WO2018228049A1 (zh) 数据库性能指标的监测方法、装置、设备及存储介质
CN110245687B (zh) 用户分类方法以及装置
WO2017148327A1 (zh) 一种业务参数选取方法及相关设备
WO2021238664A1 (zh) 信息采集方法和装置、关注度检测方法、装置和系统
WO2015180340A1 (zh) 一种数据挖掘方法及装置
WO2018090643A1 (zh) 客户分类方法、电子装置及存储介质
CN104391879A (zh) 层次聚类的方法及装置
CN106610977B (zh) 一种数据聚类方法和装置
WO2014177050A1 (zh) 对文档进行聚类的方法和装置
CN109214772A (zh) 项目推荐方法、装置、计算机设备及存储介质
US20120328167A1 (en) Merging face clusters
CN112446660A (zh) 网点聚类方法、装置、服务器及存储介质
CN107480426A (zh) 自迭代病历档案聚类分析系统
CN108428138A (zh) 一种基于客户分群的客户生存率分析装置及分析方法
CN113705625A (zh) 异常生活保障申请家庭的识别方法、装置及电子设备
CN108537654B (zh) 客户关系网络图的渲染方法、装置、终端设备及介质
CN110717787A (zh) 一种用户的分类方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17812458

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS (EPO FORM 1205A DATED 03.04.2019)

122 Ep: pct application non-entry in european phase

Ref document number: 17812458

Country of ref document: EP

Kind code of ref document: A1