WO2017133568A1 - 一种目标特征数据的挖掘方法和装置 - Google Patents

一种目标特征数据的挖掘方法和装置 Download PDF

Info

Publication number
WO2017133568A1
WO2017133568A1 PCT/CN2017/072404 CN2017072404W WO2017133568A1 WO 2017133568 A1 WO2017133568 A1 WO 2017133568A1 CN 2017072404 W CN2017072404 W CN 2017072404W WO 2017133568 A1 WO2017133568 A1 WO 2017133568A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature data
frequency
data
filtering
Prior art date
Application number
PCT/CN2017/072404
Other languages
English (en)
French (fr)
Inventor
周俊
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Priority to US16/063,755 priority Critical patent/US20200272933A1/en
Publication of WO2017133568A1 publication Critical patent/WO2017133568A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • the present application relates to the technical field of computer processing, and in particular, to a method for mining target feature data and a device for mining target feature data.
  • Machine Learning is a multidisciplinary subject involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is mainly used for artificial intelligence to acquire new knowledge or Skills, reorganizing existing knowledge structures to continuously improve their performance.
  • the CTR estimate requires at least two aspects of data, one is the data of the information itself, and the other is the user's data, assuming all the data is collected. Data, then you can use this data to assess the likelihood (ie, probability) that the user clicked on the information.
  • the characteristics of information are relatively large, such as information size, information text, information industry, information pictures, etc.
  • user data characteristics are also more, such as the user's age, gender, region, occupation, school, mobile platform, etc.
  • feedback features such as real-time CTR for each message and so on.
  • ID class features are multiplied by other features, which may reach the characteristics of 10 billion data or even 100 billion data.
  • a frequency threshold is generally set in advance, and all the features whose frequency is less than the frequency threshold are all filtered.
  • This way of filtering features in a general way may filter out a large number of effective features, resulting in a significant drop in machine learning.
  • embodiments of the present application have been made in order to provide an object feature data mining method and a corresponding object feature data mining device that overcome the above problems or at least partially solve the above problems.
  • the embodiment of the present application discloses a method for mining target feature data, including:
  • the method further comprises:
  • the specified model is trained using the target feature data.
  • the step of counting feature frequencies for the first feature data comprises:
  • the first feature data and the feature frequency that have been counted are merged by the second working node.
  • the step of filtering the low frequency feature data from the first feature data according to the feature frequency to obtain the second feature data comprises:
  • the step of filtering the low frequency feature data from the first feature data according to the feature frequency to obtain the second feature data comprises:
  • the second feature data and the feature frequency obtained by the second working node are combined and filtered.
  • the step of filtering the at least part of the intermediate frequency feature data from the second feature data according to the feature frequency to obtain the target feature data comprises:
  • the step of filtering the at least part of the intermediate frequency feature data from the second feature data according to the feature frequency to obtain the target feature data comprises:
  • the target feature data and the feature frequency obtained by the second working node are combined and filtered.
  • the method further comprises:
  • the method further comprises:
  • the first feature probability is a probability that a score of a positive sample in the third test model is greater than a score of a negative sample in the third test model
  • the second feature probability is a probability that a score of the positive sample in the fourth test model is greater than a score of the negative sample in the fourth test model.
  • the embodiment of the present application further discloses an apparatus for mining target feature data, including:
  • a feature frequency statistics module configured to collect a feature frequency for the first feature data
  • a low frequency feature filtering module configured to filter low frequency feature data from the first feature data according to the feature frequency to obtain second feature data
  • an IF feature filtering module configured to filter at least part of the IF feature data from the second feature data according to the feature frequency to obtain target feature data.
  • the method further comprises:
  • a model training module is configured to train the specified model using the target feature data.
  • the feature frequency statistics module comprises:
  • a first allocation submodule configured to allocate first feature data to one or more first working nodes
  • a frequency statistics submodule configured to collect, by the first working node, a feature frequency of the first feature data that is allocated by the first working node;
  • a first transmission submodule configured to transmit, by the first working node, the first feature data and the feature frequency that have been counted to the second working node;
  • the first merging submodule is configured to merge the first feature data and the feature frequency that have been counted by the second working node.
  • the low frequency feature filtering module comprises:
  • a low frequency feature determining submodule configured to determine that the first feature data is low frequency feature data when a feature frequency of the first feature data is less than a preset low frequency threshold
  • a second feature data obtaining submodule configured to filter the first feature data to obtain second feature data.
  • the low frequency feature filtering module comprises:
  • a second allocation submodule configured to allocate the first feature data and the feature frequency to one or more first working nodes
  • a first filtering submodule configured to filter, by the first working node, low frequency feature data from the allocated first feature data according to the allocated feature frequency, to obtain second feature data
  • a second transmission submodule configured to transmit, by the first working node, the second feature data and the feature frequency obtained by the filtering to the second working node;
  • a second merging submodule configured to combine the second feature data and the feature frequency obtained by the second working node.
  • the intermediate frequency feature filtering module comprises:
  • a random number configuration submodule configured to configure a random value for the second feature data
  • the intermediate frequency feature determining submodule is configured to determine, when the product of the feature frequency of the second feature data and the random value is less than a preset intermediate frequency threshold, the second feature data is the intermediate frequency feature data;
  • the target feature data obtaining submodule is configured to filter the second feature data to obtain target feature data.
  • the intermediate frequency feature filtering module comprises:
  • a third allocation submodule configured to allocate the second feature data and the feature frequency to one or more first working nodes
  • a second filtering submodule configured to: by the second working node, filter at least part of the intermediate frequency feature data from the allocated second feature data according to the allocated feature frequency, to obtain target feature data;
  • a third transmission submodule configured to transmit the target feature data and the feature frequency obtained by the filtering by the first working node to the second working node;
  • a third merging sub-module configured to combine the target feature data and the feature frequency obtained by the second working node.
  • the method further comprises:
  • a first test model training module configured to train the first test model by using the first original feature data
  • a second test model training module configured to train the second test model by filtering the first original feature data after the feature frequency is less than the first candidate threshold
  • test module configured to perform A/B testing on the first test model and the second test model to obtain a first score and a second score
  • the low frequency threshold determining module is configured to confirm that the first candidate threshold is a low frequency threshold when a difference between the first click rate and the second click rate is less than a preset first gap threshold.
  • the method further comprises:
  • a third test model training module configured to train the third test model by using the second original feature data
  • a fourth test model training module configured to train the fourth test model by filtering the second original feature data after the product of the feature frequency and the random value is smaller than the second candidate threshold
  • a probability calculation submodule configured to calculate a first feature probability and a second feature probability
  • An intermediate frequency threshold determining module configured to use a difference between the first feature probability and the second feature probability is less than Determining, by the preset second gap threshold, that the second candidate threshold is an intermediate frequency threshold;
  • the first feature probability is a probability that a score of a positive sample in the third test model is greater than a score of a negative sample in the third test model
  • the second feature probability is a probability that a score of the positive sample in the fourth test model is greater than a score of the negative sample in the fourth test model.
  • the embodiment of the present application filters at least part of the intermediate frequency feature data of the low frequency feature data, and the obtained target feature data has high frequency feature data, and may have partial intermediate frequency feature data.
  • the training model based on the target feature data does not substantially affect the performance of the model, and is guaranteed.
  • the effect of machine learning greatly reduces the number of features, thereby greatly reducing the number of machines and resources required, greatly reducing the training time and speed of training, thereby greatly reducing the training cost.
  • FIG. 1 is a flow chart showing the steps of an embodiment of a method for mining target feature data according to the present application
  • FIG. 2 is a structural block diagram of an embodiment of a target feature data mining device of the present application.
  • FIG. 1 a flow chart of steps of an embodiment of a method for mining target feature data of the present application is shown, which may specifically include the following steps:
  • Step 101 Statistics feature frequency of the first feature data
  • the source data may be collected through a network log, such as parsing the source data, and removing meaningless information, such as the field “-”, to obtain the structured first feature data, such as the user ID, and the product ID accessed by the user. , access time, user behavior (such as clicks, purchases, reviews), and more.
  • a network log such as parsing the source data, and removing meaningless information, such as the field “-”, to obtain the structured first feature data, such as the user ID, and the product ID accessed by the user. , access time, user behavior (such as clicks, purchases, reviews), and more.
  • the website log is:
  • the structured first feature data obtained after filtering is:
  • the first feature data may be filtered to obtain target feature data to train the specified model.
  • filtering can be performed in a single computer. If the number of the first feature data is large, filtering can be performed in multiple computers, such as a distributed system (Distributed System), Hadoop, ODPS (Open Data Processing Service) and so on.
  • distributed System Distributed System
  • Hadoop Hadoop
  • ODPS Open Data Processing Service
  • a distributed system can refer to a computer system consisting of a plurality of interconnected processing resources that cooperatively perform the same task under the control of the entire system. These resources can be geographically adjacent or geographically dispersed. .
  • Hadoop is described as an embodiment of a distributed system.
  • Hadoop mainly consists of two parts, one is the Distributed File System (HDFS), and the other is the distributed computing framework, MapReduce.
  • HDFS Distributed File System
  • MapReduce MapReduce
  • HDFS is a highly fault-tolerant system that provides high-throughput data access for applications with large data sets.
  • MapReduce is a programming model that extracts the analysis elements from the massive source data and returns the result set.
  • the basic principle can be to divide the large data analysis into small pieces and analyze them one by one, and then summarize the extracted data.
  • MapReduce In Hadoop, there are two machine roles for executing MapReduce: one is the JobTracker and the other is the TaskTracker.
  • JobTracker can be used for scheduling work
  • TaskTracker can be used to perform work
  • a TaskTracker may refer to a processing node of the distributed system, and the processing A node may include one or more Map nodes and one or more Reduce nodes.
  • MapReduce In distributed computing, MapReduce handles complex problems such as distributed storage, work scheduling, load balancing, fault-tolerant equalization, fault-tolerant processing, and network communication in parallel programming.
  • mapping function maps function
  • statute function reduce function
  • the map function can decompose the task into multiple tasks
  • the reduce function can summarize the results of the decomposed multitasking.
  • each MapReduce task can be initialized to a Job, and each Job can be divided into two phases: the map phase and the reduce phase. These two phases are represented by two functions, the map function and the reduce function.
  • the map function can accept an input of the form ⁇ key, value> (Input), and then generate an intermediate output of the form ⁇ key, value>.
  • the Hadoop function can receive a form such as ⁇ key, (list of values)> The input (Input), then the value set is processed, each reduce function produces 0 or 1 output (Output), the output of the reduce function is also in the form of ⁇ key, value>.
  • the frequency of the first feature data statistical feature collected in advance that is, the quantity of the first feature data may be extracted, and then filtered according to the feature frequency.
  • step 101 may include the following sub-steps:
  • Sub-step S11 the first feature data is allocated to one or more first working nodes
  • a distributed system there is a first working node and a second working node for filtering.
  • the first working node is a Map node
  • the second working node is a Reduce node.
  • first feature data data allocated on each first working node such as a Map node
  • the first feature data may be represented in the form of a data ID.
  • the first working node A is assigned the first feature data as userid1, and the first feature data assigned by the first working node B is userid2 and userid3, and userid1 is not assigned.
  • each first working node such as a Map node
  • a hash value is calculated for each first feature data. And dividing the hash value by a specified value, taking the remainder, and assigning the first feature data to the first working node (such as a Map node) whose value of the sequence number is the same as the remainder.
  • allocation method is only an example. When implementing the embodiment of the present application, it may be set according to actual conditions. Other allocation methods, such as a random allocation method (random(x)%N), etc., are not limited in this embodiment of the present application.
  • Sub-step S12 the characteristic frequency of the first feature data allocated by the first working node is counted
  • Sub-step S13 the first working node and the feature frequency are transmitted by the first working node to the second working node;
  • the first working node may perform statistics on the allocated first feature data, obtain the feature frequency, and transparently transmit the result to the second working node (such as a Reduce node).
  • mapping function is defined to count the feature frequency of the first feature data.
  • the data format of the statistical result may be (first feature data, feature frequency).
  • Sub-step S14 the first feature data and the feature frequency that have been counted are merged by the second working node.
  • the statistical results of the first working node may be combined to obtain a final result.
  • the data format of the combined result may be (first feature data, feature frequency).
  • Step 102 Filter the low frequency feature data from the first feature data according to the feature frequency to obtain second feature data.
  • the first feature data may be divided into low frequency feature data, intermediate frequency feature data, and high frequency feature data according to a feature frequency.
  • the low frequency feature data may refer to the feature data having the lowest frequency of the feature and occupying the first proportion of the total amount of the first feature data;
  • the intermediate frequency characteristic data may refer to the feature data having a higher frequency of the feature (higher than the feature frequency of the low frequency feature data and lower than the feature frequency of the high frequency feature data) and occupying the second proportion of the total amount of the first feature data;
  • the high frequency characteristic data may refer to the feature data with the highest frequency of the feature and occupying the third proportion of the total amount of the first feature data;
  • the first feature data includes only the low frequency feature data, the intermediate frequency feature data, and the high frequency feature data, for the intermediate frequency feature data, It is considered to be feature data other than the low frequency feature data and the high frequency feature data in the first feature data.
  • the manner of dividing the above feature data is only an example.
  • other manners of dividing the feature data may be set according to actual conditions, such as ultra-low frequency feature data, low frequency feature data, intermediate frequency feature data, and high frequency feature data.
  • the UHF feature data and the like are not limited in this embodiment of the present application.
  • those skilled in the art may also adopt other feature data according to actual needs.
  • the manner of division is not limited by the embodiment of the present application.
  • the low frequency threshold may be pre-trained for filtering the low frequency feature data.
  • the first feature data may be filtered to obtain second feature data.
  • the second feature data includes intermediate frequency feature data and high frequency feature data.
  • the low frequency threshold may be set to 3, so that the first feature data f1 is filtered out.
  • the low frequency threshold is also different, and the first ratio is different, so that the low frequency threshold is also different. Therefore, those skilled in the art can set the low frequency threshold according to the actual situation. This example does not limit this.
  • the low frequency threshold can be trained as follows:
  • Sub-step S21 training the first test model with the first original feature data
  • the so-called first original feature data which is also feature data, has a feature frequency.
  • it may refer to source data that does not filter low-frequency feature data, and includes low-frequency feature data, intermediate frequency feature data, and high-frequency feature data.
  • Sub-step S22 training the second test model by filtering the first original feature data after the feature frequency is less than the first candidate threshold
  • the first candidate threshold may be preset as the original low frequency threshold.
  • Filtering the feature frequency whose feature frequency is less than the first candidate threshold from the first original feature data is regarded as filtering the low frequency feature from the original feature data.
  • the first original feature data filtered by the low frequency feature is used for machine learning, and the second test model is trained.
  • Sub-step S23 performing A/B testing on the first test model and the second test model to obtain a first score and a second score;
  • Sub-step S24 when the difference between the first click rate and the second click rate is less than a preset first gap threshold, confirm that the first candidate threshold is a low frequency threshold.
  • a / B Testing can be defined for the same target (such as low-frequency threshold) A, B two programs (such as the first test model, the second test model), let some users use the A program
  • Another part of the user uses the B scheme to record the user's usage (such as obtaining the first score in the first test model and the second score in the second test model) to determine which scheme is more in line with the target.
  • the first webpage information (such as advertisement data, news data, etc.) is extracted by using the first test model
  • the second webpage information (such as advertisement data, news data, etc.) is extracted by using the second test model.
  • the first test model or the second test model is selected for service according to a probability of 50%, that is, the first webpage information and the second webpage information are displayed.
  • the first click probability of the first web page information is recorded as the first score
  • the second click rate of the second web page information is recorded as the second score
  • the first candidate threshold may be considered as a low frequency threshold, otherwise, the new first candidate threshold is selected, and Train.
  • step 102 may include the following sub-steps:
  • Sub-step S31 the first feature data and the feature frequency are allocated to one or more first working nodes
  • a distributed system there is a first working node and a second working node for filtering.
  • the first working node is a Map node
  • the second working node is a Reduce node.
  • the first feature data and the feature frequency may be allocated to one or more by a hash remainder (hash(x)%N) allocation method, a random allocation method (random(x)%N), or the like.
  • the first working node may be allocated to one or more by a hash remainder (hash(x)%N) allocation method, a random allocation method (random(x)%N), or the like.
  • the first feature data may be represented in the form of a data ID.
  • Sub-step S32 the first working node filters the low-frequency feature data from the allocated first feature data according to the allocated feature frequency, to obtain second feature data;
  • Sub-step S33 the second feature data and the feature frequency obtained by the filtering are transmitted by the first working node to the second working node;
  • the first working node may filter the low frequency feature from the allocated first feature data, obtain the second feature data, and transparently transmit the result to the second working node (such as a Reduce node).
  • mapping function is defined to determine that the first feature data is low frequency feature data when the feature frequency of the first feature data is less than a preset low frequency threshold, and filter the first feature data.
  • the data format of the filtering result may be (second feature data, feature frequency).
  • the low frequency feature is filtered.
  • the data, its characteristic frequency will also be filtered together, and the second feature data retained, the feature frequency will also be retained together.
  • Sub-step S34 the second feature data and the feature frequency obtained by the second working node are combined and filtered.
  • the filtering results of the first working node may be combined to obtain a final result.
  • the data format of the combined result may be (second feature data, feature frequency).
  • Step 103 Filter at least part of the intermediate frequency feature data from the second feature data according to the feature frequency to obtain target feature data.
  • the intermediate frequency feature data is useful for model training, in the embodiment of the present application, the intermediate frequency feature data can be filtered from the second feature data in a random manner.
  • the target feature data remaining after filtering may include intermediate frequency characteristic data or may not include intermediate frequency characteristic data, in addition to high frequency feature data.
  • the intermediate frequency threshold is pre-trained for filtering the intermediate frequency feature data.
  • the second feature data may be configured with a random value (ie, a randomly generated value) by a Poisson distribution or the like.
  • the second feature data may be determined to be the intermediate frequency feature data, and the second feature data is filtered to obtain the target feature data.
  • the IF threshold is also different, and the second ratio is different, so that the low frequency threshold is also different. Therefore, those skilled in the art can set the IF threshold according to the actual situation. This example does not limit this.
  • the low frequency threshold can be trained as follows:
  • Sub-step S41 training the third test model with the second original feature data
  • the so-called second original feature data which is also feature data, has a feature frequency.
  • the source data of the unfiltered intermediate frequency feature data may be referred to, including low frequency feature data, intermediate frequency feature data, and high frequency features. data.
  • Sub-step S42 training the fourth test model by filtering the second original feature data after the product of the feature frequency and the random value is smaller than the second candidate threshold;
  • the second candidate threshold may be preset as the original intermediate frequency threshold.
  • Filtering the feature frequency of the feature frequency from the random value from the second original feature data is less than the feature frequency of the second candidate threshold, and is considered to filter the intermediate frequency feature from the original feature data.
  • the second original feature data filtered by the intermediate frequency feature is used for machine learning, and the fourth test model is trained.
  • Sub-step S43 calculating a first feature probability and a second feature probability
  • Sub-step S44 when the difference between the first feature probability and the second feature probability is less than a preset second gap threshold, confirm that the second candidate threshold is an intermediate frequency threshold.
  • test data including a positive sample and a negative sample
  • AUC rea under Curve
  • the AUC value is the area under the Roc (Receiver Operating Characteristic) curve, which is between 0.1 and 1. It can directly evaluate the quality of the classifier. The larger the AUC value, the better the performance of the classifier.
  • the AUC value is a probability value.
  • the probability that the current classifier ranks the positive sample in front of the negative sample according to the calculated Score value (point value) is the AUC value.
  • the first feature probability is a probability that the score of the positive sample in the third test model is greater than the score of the negative sample in the third test model;
  • the second feature probability is a probability that the score of the positive sample in the fourth test model is greater than the score of the negative sample in the fourth test model.
  • the Wilcoxon-Mann-Witney Test is to test any given positive and negative samples. The probability of the positive sample's score is greater than the negative sample's score.
  • Method 1 Count all M ⁇ N (M is the number of positive samples, N is the number of negative samples). Positive and negative sample pairs, how many groups of positive samples have a score larger than the negative sample score.
  • Method 2 Sort the score from large to small, and then let the rank of the sample corresponding to the maximum score be n, the second largest score corresponding to the rank of the sample is n-1, and so on.
  • AUC ((All positive positions are added)-M*(M+1))/(M*N)
  • the second candidate threshold may be considered as the intermediate frequency threshold, otherwise, the new second candidate threshold is selected. , re-training.
  • step 103 may include the following sub-steps:
  • Sub-step S51 the second feature data and the feature frequency are allocated to one or more first working nodes
  • a distributed system there is a first working node and a second working node for filtering.
  • the first working node is a Map node
  • the second working node is a Reduce node.
  • the first feature data and the feature frequency may be allocated to one or more by a hash remainder (hash(x)%N) allocation method, a random allocation method (random(x)%N), or the like.
  • the first working node may be allocated to one or more by a hash remainder (hash(x)%N) allocation method, a random allocation method (random(x)%N), or the like.
  • the first feature data may be represented in the form of a data ID.
  • Sub-step S52 the second working node filters at least part of the intermediate frequency feature data from the allocated second feature data according to the allocated feature frequency, to obtain target feature data;
  • Sub-step S53 the target feature data and the feature frequency obtained by the filtering are transmitted by the first working node to the second working node;
  • the first working node may filter the IF feature from the allocated second feature data to obtain the target feature data, and transparently transmit the result to the second working node (such as a Reduce node).
  • mapping function is defined to filter the second feature data when the second feature data is the intermediate frequency feature data when the product of the feature frequency of the second feature data and the random value is less than the preset intermediate frequency threshold.
  • the data format of the filtering result may be (target feature data, feature frequency).
  • the intermediate frequency feature is filtered.
  • the data, its characteristic frequency will also be filtered together, and the feature frequency of the retained target feature data will also be retained together.
  • Sub-step S54 the target feature data and the feature frequency obtained by the second working node are combined and filtered.
  • the filtering results of the first working node may be combined to obtain a final result.
  • the data format of the combined result may be (target feature data, feature frequency).
  • the target feature data filtered by the low frequency feature data and at least part of the intermediate frequency feature data may be used to train the specified model, for example, SVM ((Support Vector Machine), logistic regression model, deep learning DP model And so on, the embodiment of the present application does not limit this.
  • SVM Small Vector Machine
  • logistic regression model logistic regression model
  • deep learning DP model deep learning DP model
  • the number of low frequency feature data and intermediate frequency feature data occupies about 80%-90% of the total number of feature data, and the high frequency feature data accounts for about 10%-20% of the total number of feature data.
  • the frequency of occurrence is very low, and in the case where the total amount of the feature data is large, the filtering has substantially no effect on the performance of the model.
  • filtering the low frequency characteristic data of the weather or the intermediate frequency characteristic data of the cover of the book, retaining the quality of the book, the high frequency characteristic data or the IF characteristic data of the cover of the book, has substantially no influence on the performance of the training purchase model.
  • the characteristics of the whole group are obtained, considering the main characteristics of the group (such as the quality of the book), and filtering the secondary features (such as the weather), the impact on the performance of the model is basically not affected.
  • filtering features in a general manner may filter a large number of effective feature data (such as intermediate frequency features and even high frequency features). As a result, the effect of machine learning is significantly reduced.
  • the embodiment of the present application filters at least part of the intermediate frequency feature data of the low frequency feature data, and the obtained target feature data has high frequency feature data, and may have partial intermediate frequency feature data.
  • the training model based on the target feature data does not substantially affect the performance of the model, and is guaranteed.
  • the effect of machine learning greatly reduces the number of features, thereby greatly reducing the number of machines and resources required, greatly reducing the training time and speed of training, thereby greatly reducing the training cost.
  • FIG. 2 a structural block diagram of an embodiment of a target feature data mining device of the present application is shown, which may specifically include the following modules:
  • the feature frequency statistics module 201 is configured to collect a feature frequency for the first feature data
  • the low frequency feature filtering module 202 is configured to filter the low frequency feature data from the first feature data according to the feature frequency to obtain second feature data;
  • the IF feature filtering module 203 is configured to filter at least part of the IF feature data from the second feature data according to the feature frequency to obtain target feature data.
  • the apparatus may further include the following modules:
  • a model training module is configured to train the specified model using the target feature data.
  • the feature frequency statistics module 201 may include the following sub-modules:
  • a first allocation submodule configured to allocate first feature data to one or more first working nodes
  • a frequency statistics submodule configured to collect, by the first working node, a feature frequency of the first feature data that is allocated by the first working node;
  • a first transmission submodule configured to transmit, by the first working node, the first feature data and the feature frequency that have been counted to the second working node;
  • the first merging submodule is configured to merge the first feature data and the feature frequency that have been counted by the second working node.
  • the low frequency feature filtering module 202 may include the following submodules:
  • a low frequency feature determining submodule configured to determine that the first feature data is low frequency feature data when a feature frequency of the first feature data is less than a preset low frequency threshold
  • a second feature data obtaining submodule configured to filter the first feature data to obtain second feature data.
  • the low frequency feature filtering module 202 may include the following submodules:
  • a second allocation submodule configured to allocate the first feature data and the feature frequency to one or more first working nodes
  • a first filtering submodule configured to filter, by the first working node, low frequency feature data from the allocated first feature data according to the allocated feature frequency, to obtain second feature data
  • a second transmission submodule configured to transmit, by the first working node, the second feature data and the feature frequency obtained by the filtering to the second working node;
  • a second merging submodule configured to combine the second feature data and the feature frequency obtained by the second working node.
  • the intermediate frequency feature filtering module 203 may include the following submodules:
  • a random number configuration submodule configured to configure a random value for the second feature data
  • the intermediate frequency feature determining submodule is configured to determine, when the product of the feature frequency of the second feature data and the random value is less than a preset intermediate frequency threshold, the second feature data is the intermediate frequency feature data;
  • the target feature data obtaining submodule is configured to filter the second feature data to obtain target feature data.
  • the intermediate frequency feature filtering module 203 may include the following submodules:
  • a third allocation submodule configured to allocate the second feature data and the feature frequency to one or more first working nodes
  • a second filtering submodule configured to: by the second working node, filter at least part of the intermediate frequency feature data from the allocated second feature data according to the allocated feature frequency, to obtain target feature data;
  • a third transmission submodule configured to transmit the target feature data and the feature frequency obtained by the filtering by the first working node to the second working node;
  • a third merging sub-module configured to combine the target feature data and the feature frequency obtained by the second working node.
  • the apparatus may further include the following modules:
  • a first test model training module configured to train the first test model by using the first original feature data
  • a second test model training module configured to train the second test model by filtering the first original feature data after the feature frequency is less than the first candidate threshold
  • test module configured to perform A/B testing on the first test model and the second test model to obtain a first score and a second score
  • the low frequency threshold determining module is configured to confirm that the first candidate threshold is a low frequency threshold when a difference between the first click rate and the second click rate is less than a preset first gap threshold.
  • the apparatus may further include the following modules:
  • a third test model training module configured to train the third test model by using the second original feature data
  • a fourth test model training module configured to train the fourth test model by filtering the second original feature data after the product of the feature frequency and the random value is smaller than the second candidate threshold
  • a probability calculation submodule configured to calculate a first feature probability and a second feature probability
  • An IF threshold determining module configured to confirm that the second candidate threshold is an intermediate frequency threshold when a difference between the first feature probability and the second feature probability is less than a preset second gap threshold;
  • the first feature probability is a probability that a score of a positive sample in the third test model is greater than a score of a negative sample in the third test model
  • the second feature probability is a probability that a score of the positive sample in the fourth test model is greater than a score of the negative sample in the fourth test model.
  • the description is relatively simple, and the relevant parts can be referred to the description of the method embodiment.
  • embodiments of the embodiments of the present application can be provided as a method, apparatus, or computer program product. Therefore, the embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware. Moreover, embodiments of the present application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology. The information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), EEPROM, Fast Flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, magnetic cassette, magnetic tape storage or other magnetic storage device or any other non- A transmission medium that can be used to store information that can be accessed by a computing device.
  • PRAM phase change memory
  • SRAM Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • RAM Random Access Memory
  • ROM Read Only Memory
  • EEPROM Electrically erasable programmable read-only Memory
  • Fast Flash memory or other memory technology
  • CD-ROM compact disc
  • DVD digital versatile disc
  • magnetic cassette magnetic tape storage or other magnetic storage device or any other non- A transmission medium that can be used to store information that can be accessed by a computing device.
  • computer readable media does not include non-persistent computer readable media, such
  • Embodiments of the present application are described with reference to flowcharts and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block of the flowchart illustrations and/or FIG.
  • These computer program instructions can be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing terminal device to produce a machine such that instructions are executed by a processor of a computer or other programmable data processing terminal device
  • Means are provided for implementing the functions specified in one or more of the flow or in one or more blocks of the flow chart.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the instruction device implements the functions specified in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Computational Mathematics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种目标特征数据的挖掘方法和装置,该方法包括:对第一特征数据统计特征频次;根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。本申请实施例基本不影响模型的性能,在保证机器学习的效果的同时,大大减少了特征的数量,从而大大减少所需的机器数量、资源数量,大大减少训练的时间、提高训练的速度,从而大大降低了训练成本。

Description

一种目标特征数据的挖掘方法和装置
本申请要求2016年02月05日递交的申请号为201610082536.1、发明名称为“一种目标特征数据的挖掘方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机处理的技术领域,特别是涉及一种目标特征数据的挖掘方法和一种目标特征数据的挖掘装置。
背景技术
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科,主要用于人工智能,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。
数据和特征是机器学习中尤为重要的两个方面,他们很大程度上影响了机器学习的效果。
以预估某信息的点击率(CTR,Click through rate)为例,CTR的预估至少需要两方面的数据,一方面是信息本身的数据,另一方面是用户的数据,假设采集了所有的数据,那么可以利用这些数据评估用户点击这个信息的可能性(也就是概率)。
信息的特征是比较多的,如信息大小、信息文本、信息所属行业、信息图片等等,用户数据的特征也是比较多的,如用户的年龄、性别、地域、职业、学校、手机平台等等,此外,还有反馈的特征,如每个信息的实时CTR等等。
但是,提升CTR是个长期的过程,用户在变,信息的创意也在变,因此会一直在增加的新特征。
再考虑到大量的ID类特征跟其他特征进行交叉,即ID类特征跟其他特征相乘,可能达到百亿数据量甚至千亿数据量的特征。
假设有10万个ID类特征,有10万个信息,两者进行交叉,即直接相乘就得到了100亿特征规模。
海量的特征,使用机器学习进行训练,往往需要上万台机器,占用大量资源,不间断训练一天甚至更长的时间,训练速度慢、资源消耗大从而导致训练成本极高。
目前,为降低特征的数量,一般会预先设定一个频次阈值,凡频次小于该频次阈值的特征,全部过滤。
这种方式笼统地过滤特征,可能会把大量有效的特征过滤掉,从而造成机器学习的效果显著下降。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种目标特征数据的挖掘方法和相应的一种目标特征数据的挖掘装置。
为了解决上述问题,本申请实施例公开了一种目标特征数据的挖掘方法,包括:
对第一特征数据统计特征频次;
根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;
根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。
优选地,还包括:
采用所述目标特征数据训练指定的模型。
优选地,所述对第一特征数据统计特征频次的步骤包括:
将第一特征数据分配至一个或多个第一工作节点;
由所述第一工作节点统计所分配的第一特征数据的特征频次;
由所述第一工作节点将已统计的第一特征数据和特征频次传输至第二工作节点;
由所述第二工作节点合并已统计的第一特征数据和特征频次。
优选地,所述根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据的步骤包括:
当所述第一特征数据的特征频次小于预设的低频阈值时,确定所述第一特征数据为低频特征数据;
过滤所述第一特征数据,获得第二特征数据。
优选地,所述根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据的步骤包括:
将所述第一特征数据及所述特征频次分配至一个或多个第一工作节点;
由所述第一工作节点根据所分配的特征频次过从所分配的第一特征数据过滤低频特 征数据,获得第二特征数据;
由所述第一工作节点将过滤获得的第二特征数据和特征频次传输至第二工作节点;
由所述第二工作节点合并过滤获得的第二特征数据和特征频次。
优选地,所述根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据的步骤包括:
对所述第二特征数据配置一随机数值;
当所述第二特征数据的特征频次与所述随机数值的乘积小于预设的中频阈值时,确定所述第二特征数据为中频特征数据;
过滤所述第二特征数据,获得目标特征数据。
优选地,所述根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据的步骤包括:
将所述第二特征数据及所述特征频次分配至一个或多个第一工作节点;
由所述第二工作节点根据所分配的特征频次过从所分配的第二特征数据过滤至少部分中频特征数据,获得目标特征数据;
由所述第一工作节点将过滤获得的目标特征数据和特征频次传输至第二工作节点;
由所述第二工作节点合并过滤获得的目标特征数据和特征频次。
优选地,还包括:
采用第一原始特征数据训练第一测试模型;
采用过滤了特征频次小于第一候选阈值之后的第一原始特征数据训练第二测试模型;
对所述第一测试模型和所述第二测试模型进行A/B测试,获得第一分数和第二分数;
当所述第一点击率与所述第二点击率之间的差值小于预设的第一差距阈值时,确认所述第一候选阈值为低频阈值。
优选地,还包括:
采用第二原始特征数据训练第三测试模型;
采用过滤了特征频次与随机数值的乘积小于第二候选阈值之后的第二原始特征数据训练第四测试模型;
计算第一特征概率和第二特征概率;
当所述第一特征概率与所述第二特征概率之间的差值小于预设的第二差距阈值时,确认所述第二候选阈值为中频阈值;
其中,所述第一特征概率为正样本在所述第三测试模型的分数大于,负样本在所述第三测试模型的分数的概率;
所述第二特征概率为正样本在所述第四测试模型的分数大于,负样本在所述第四测试模型的分数的概率。
本申请实施例还公开了一种目标特征数据的挖掘装置,包括:
特征频次统计模块,用于对第一特征数据统计特征频次;
低频特征过滤模块,用于根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;
中频特征过滤模块,用于根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。
优选地,还包括:
模型训练模块,用于采用所述目标特征数据训练指定的模型。
优选地,所述特征频次统计模块包括:
第一分配子模块,用于将第一特征数据分配至一个或多个第一工作节点;
频次统计子模块,用于由所述第一工作节点统计所分配的第一特征数据的特征频次;
第一传输子模块,用于由所述第一工作节点将已统计的第一特征数据和特征频次传输至第二工作节点;
第一合并子模块,用于由所述第二工作节点合并已统计的第一特征数据和特征频次。
优选地,所述低频特征过滤模块包括:
低频特征确定子模块,用于在所述第一特征数据的特征频次小于预设的低频阈值时,确定所述第一特征数据为低频特征数据;
第二特征数据获得子模块,用于过滤所述第一特征数据,获得第二特征数据。
优选地,所述低频特征过滤模块包括:
第二分配子模块,用于将所述第一特征数据及所述特征频次分配至一个或多个第一工作节点;
第一过滤子模块,用于由所述第一工作节点根据所分配的特征频次过从所分配的第一特征数据过滤低频特征数据,获得第二特征数据;
第二传输子模块,用于由所述第一工作节点将过滤获得的第二特征数据和特征频次传输至第二工作节点;
第二合并子模块,用于由所述第二工作节点合并过滤获得的第二特征数据和特征频次。
优选地,所述中频特征过滤模块包括:
随机数值配置子模块,用于对所述第二特征数据配置一随机数值;
中频特征确定子模块,用于在所述第二特征数据的特征频次与所述随机数值的乘积小于预设的中频阈值时,确定所述第二特征数据为中频特征数据;
目标特征数据获得子模块,用于过滤所述第二特征数据,获得目标特征数据。
优选地,所述中频特征过滤模块包括:
第三分配子模块,用于将所述第二特征数据及所述特征频次分配至一个或多个第一工作节点;
第二过滤子模块,用于由所述第二工作节点根据所分配的特征频次过从所分配的第二特征数据过滤至少部分中频特征数据,获得目标特征数据;
第三传输子模块,用于由所述第一工作节点将过滤获得的目标特征数据和特征频次传输至第二工作节点;
第三合并子模块,用于由所述第二工作节点合并过滤获得的目标特征数据和特征频次。
优选地,还包括:
第一测试模型训练模块,用于采用第一原始特征数据训练第一测试模型;
第二测试模型训练模块,用于采用过滤了特征频次小于第一候选阈值之后的第一原始特征数据训练第二测试模型;
测试模块,用于对所述第一测试模型和所述第二测试模型进行A/B测试,获得第一分数和第二分数;
低频阈值确定模块,用于在所述第一点击率与所述第二点击率之间的差值小于预设的第一差距阈值时,确认所述第一候选阈值为低频阈值。
优选地,还包括:
第三测试模型训练模块,用于采用第二原始特征数据训练第三测试模型;
第四测试模型训练模块,用于采用过滤了特征频次与随机数值的乘积小于第二候选阈值之后的第二原始特征数据训练第四测试模型;
概率计算子模块,用于计算第一特征概率和第二特征概率;
中频阈值确定模块,用于在所述第一特征概率与所述第二特征概率之间的差值小于 预设的第二差距阈值时,确认所述第二候选阈值为中频阈值;
其中,所述第一特征概率为正样本在所述第三测试模型的分数大于,负样本在所述第三测试模型的分数的概率;
所述第二特征概率为正样本在所述第四测试模型的分数大于,负样本在所述第四测试模型的分数的概率。
本申请实施例包括以下优点:
本申请实施例过滤低频特征数据至少部分中频特征数据,获得的目标特征数据具有高频特征数据,可能具有部分中频特征数据,基于这样的目标特征数据训练模型,基本不影响模型的性能,在保证机器学习的效果的同时,大大减少了特征的数量,从而大大减少所需的机器数量、资源数量,大大减少训练的时间、提高训练的速度,从而大大降低了训练成本。
附图说明
图1是本申请的一种目标特征数据的挖掘方法实施例的步骤流程图;
图2是本申请的一种目标特征数据的挖掘装置实施例的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
参照图1,示出了本申请的一种目标特征数据的挖掘方法实施例的步骤流程图,具体可以包括如下步骤:
步骤101,对第一特征数据统计特征频次;
在具体实现中,可以通过网络日志采集源数据,如对源数据进行解析,去掉无意义的信息,如字段“-”,获得结构化的第一特征数据,如用户ID,用户访问的商品ID,访问时间,用户行为(如点击,购买,评价),等等。
例如,网站日志为:
118.112.27.164---[24/Oct/2012:11:00:00+0800]"GET/b.jpg?cD17Mn0mdT17L2NoaW5hLmFsaWJhYmEuY29tL30mbT17R0VUfSZzPXsyMDB9J  nI9e2h0dHA6Ly9mdy50bWFsbC5jb20vP3NwbT0zLjE2OTQwNi4xOTg0MDEufSZhPXtza WQ9MTdjMDM2MjEtZTk2MC00NDg0LWIwNTYtZDJkMDcwM2NkYmE4fHN0aW1lPTE zNTEwNDc3MDU3OTZ8c2RhdGU9MjR8YWxpX2FwYWNoZV9pZD0xMTguMTEyLjI3Lj E2NC43MjU3MzI0NzU5ODMzMS43fGNuYT0tfSZiPXstfSZjPXtjX3NpZ25lZD0wfQ==&p ageid=7f0000017f00000113511803054674156071647816&sys=ie6.0|windowsXP|1366*768|z h-cn&ver=43&t=1351047705828HTTP/1.0"200-"Mozilla/4.0(compatible;MSIE 6.0;Windows NT 5.1;SV1;.NET CLR 2.0.50727)"118.112.27.164.135104760038.61^sid%3D17c03621-e960-4484-b056-d2d0703cdba8%7Cstime%3D1351047705796%7Csdate%3D24|cna=-^-^aid=118.112.27.164.72573247598331.7
过滤后获得的结构化的第一特征数据为:
1,b2b-1633112210,1215596848,1,07/Aug/2013:08:27:22
在本申请实施例中,可以对第一特征数据进行过滤,获得目标特征数据,以训练指定的模型。
若第一特征数据的数量较少,则可以在单个计算机中进行过滤,若第一特征数据的数量较多,则可以在多个计算机中进行过滤,如分布式系统(Distributed System),Hadoop、ODPS(Open Data Processing Service)等等。
分布式系统可以指一个由多个互相连接的处理资源组成的计算机系统,它们在整个系统的控制下协同执行同一个任务,这些资源可以是地理上相邻的,也可以是在地理上分散的。
为使本领域技术人员更好地理解本申请实施例,在本申请实施例中,将Hadoop作为分布式系统的一种实施例进行说明。
Hadoop主要包括两部分,一是分布式文件系统(Hadoop Distributed File System,HDFS),另外是分布式计算框架,即MapReduce。
HDFS是一个高度容错性的系统,能提供高吞吐量的数据访问,适合那些有着超大数据集(large data set)的应用程序。
MapReduce是一套从海量源数据提取分析元素最后返回结果集的编程模型,其基本原理可以是将大的数据分析分成小块逐个分析,最后再将提取出来的数据汇总分析。
在Hadoop中,用于执行MapReduce的机器角色有两个:一个是JobTracker,另一个是TaskTracker。
其中,JobTracker可以用于调度工作,TaskTracker可以用于执行工作。
进一步而言,在Hadoop中TaskTracker可以指所述分布式系统的处理节点,该处理 节点可以包括一个或多个映射(Map)节点和一个或多个化简(Reduce)节点。
在分布式计算中,MapReduce负责处理了并行编程中分布式存储、工作调度、负载均衡、容错均衡、容错处理以及网络通信等复杂问题,把处理过程高度抽象为两个函数:映射函数(map函数)和规约函数(reduce函数),map函数可以把任务分解成多个任务,reduce函数可以把分解后的多任务处理的结果汇总起来。
在Hadoop中,每个MapReduce的任务可以被初始化为一个Job,每个Job又可以分为两种阶段:map阶段和reduce阶段。这两个阶段分别用两个函数表示,即map函数和reduce函数。
map函数可以接收一个<key,value>形式的输入(Input),然后同样产生一个<key,value>形式的中间输出(Output),Hadoop函数可以接收一个如<key,(list of values)>形式的输入(Input),然后对这个value集合进行处理,每个reduce函数产生0或1个输出(Output),reduce函数的输出也是<key,value>形式的。
在具体实现中,可以提取预先采集的第一特征数据统计特征频次,即该第一特征数据的数量,进而基于该特征频次进行过滤。
在本申请的一个实施例中,步骤101可以包括如下子步骤:
子步骤S11,将第一特征数据分配至一个或多个第一工作节点;
在分布式系统中,具有第一工作节点和第二工作节点进行过滤。
例如,在Hadoop、ODPS等分布式系统中,第一工作节点为Map节点,第二工作节点为Reduce节点。
为了保证统计的完整性,在分配第一特征数据时,一般保证每个第一工作节点(如Map节点)上分配的第一特征数据数据都不重叠、即彼此不相同。
需要说明的是,第一特征数据可以为数据ID的形式进行表示。
假设有三个第一特征数据,userid1、userid2和userid3,第一工作节点A分配到第一特征数据为userid1,第一工作节点B分配到的第一特征数据为userid2和userid3,并不分配userid1。
在实际应用中,以哈希取余(hash(x)%N)分配法为例,每个第一工作节点(如Map节点)配置有序号,对每个第一特征数据计算一个哈希值,然后将该哈希值除以一个指定的值,取余数,将该第一特征数据分配至序号的值与该余数相同的第一工作节点(如Map节点)中。
当然,上述分配方法只是作为示例,在实施本申请实施例时,可以根据实际情况设 置其他分配方法,如随机分配法(random(x)%N)等,本申请实施例对此不加以限制。
子步骤S12,由所述第一工作节点统计所分配的第一特征数据的特征频次;
子步骤S13,由所述第一工作节点将已统计的第一特征数据和特征频次传输至第二工作节点;
在本申请实施例中,第一工作节点(如Map节点)可以对所分配的第一特征数据进行统计,获得其特征频次,透传至第二工作节点(如Reduce节点)中。
例如,定义映射函数(map函数)为统计第一特征数据的特征频次。
其中,统计结果的数据格式可以为(第一特征数据,特征频次)。
子步骤S14,由所述第二工作节点合并已统计的第一特征数据和特征频次。
在第二工作节点(如Reduce节点)中,可以对第一工作节点(如Map节点)的统计结果进行合并,获得最终的结果。
例如,定义规约函数(reduce函数)为合并Map节点的统计结果。
其中,合并结果的数据格式可以为(第一特征数据,特征频次)。
步骤102,根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;
在本申请实施例中,可以按照特征频次,将第一特征数据划分出低频特征数据、中频特征数据和高频特征数据。
其中,低频特征数据,可以指特征频次最低的、占据第一特征数据总量第一比例的特征数据;
中频特征数据,可以指特征频次较高(高于低频特征数据的特征频次、低于高频特征数据的特征频次)的、占据第一特征数据总量第二比例的特征数据;
高频特征数据,可以指特征频次最高的、占据第一特征数据总量第三比例的特征数据;
由于低频特征数据、中频特征数据和高频特征数据为各不相同的特征数据,因此,若第一特征数据中仅包括低频特征数据、中频特征数据和高频特征数据,对于中频特征数据,可以认为是在第一特征数据中、除低频特征数据和高频特征数据之外的特征数据。
当然,上述特征数据的划分方式只是作为示例,在实施本申请实施例时,可以根据实际情况设置其他特征数据的划分方式,如超低频特征数据、低频特征数据、中频特征数据、高频特征数据、超高频特征数据等等,本申请实施例对此不加以限制。另外,除了上述特征数据的划分方式外,本领域技术人员还可以根据实际需要采用其它特征数据 的划分方式,本申请实施例对此也不加以限制。
应用本申请实施例,可以预先训练低频阈值,用于过滤低频特征数据。
具体而言,当第一特征数据的特征频次小于预设的低频阈值时,确定第一特征数据为低频特征数据,则可以过滤该第一特征数据,获得第二特征数据。
由于过滤了低频特征数据,那么,第二特征数据中包括中频特征数据和高频特征数据。
假设有5个第一特征数据及其特征频次:
(f1,2)、(f2,4)、(f3,7)、(f4,8)、(f5,9)
若过滤第一特征数据中占第一特征数据总数量20%-25%的低频特征数据,则可以设置低频阈值为3,这样第一特征数据f1会被过滤掉。
需要说明的是,在不同领域中,低频阈值也有所不同,并且,其第一比例不同也会使得低频阈值也有所不同,因此,本领域技术人员可以根据实际情况设定低频阈值,本申请实施例对此不加以限制。
在本申请的一个实施例中,可以通过如下方式训练低频阈值:
子步骤S21,采用第一原始特征数据训练第一测试模型;
所谓第一原始特征数据,实质也为特征数据,具有特征频次,在本申请实施例中,可以指未过滤低频特征数据的源数据,其包括低频特征数据、中频特征数据、高频特征数据。
对于未过滤低频特征数据的原始特征数据,可以进行机器学习,训练得到第一测试模型。
子步骤S22,采用过滤了特征频次小于第一候选阈值之后的第一原始特征数据训练第二测试模型;
在具体实现中,可以预先设定第一候选阈值,作为原始的低频阈值。
从第一原始特征数据中过滤特征频次小于第一候选阈值的特征频次,视为从原始特征数据过滤了低频特征。
采用过滤了低频特征的第一原始特征数据进行机器学习,训练得到第二测试模型。
子步骤S23,对所述第一测试模型和所述第二测试模型进行A/B测试,获得第一分数和第二分数;
子步骤S24,当所述第一点击率与所述第二点击率之间的差值小于预设的第一差距阈值时,确认所述第一候选阈值为低频阈值。
所谓A/B测试(A/B Testing),可以指为同一个目标(如低频阈值)制定A、B两个方案(如,第一测试模型、第二测试模型),让一部分用户使用A方案,另一部分用户使用B方案,记录下用户的使用情况(如在第一测试模型进行测试获得第一分数,在第二测试模型进行测试获得第二分数),判断哪个方案更符合目标。
以网页信息为例,采用第一测试模型提取第一网页信息(如广告数据、新闻数据等),采用第二测试模型提取第二网页信息(如广告数据、新闻数据等)。
对于访问的客户端,按照50%的概率选择第一测试模型或者第二测试模型进行服务,即展示第一网页信息、第二网页信息。
记录第一网页信息的第一点击概率作为第一分数,记录第二网页信息的第二点击率作为第二分数。
若第一分数和第二分数弱相等(即两者差值小于预设的第一差距阈值),则可以认为该第一候选阈值适合作为低频阈值,否则,选取新的第一候选阈值,重新进行训练。
在本申请的一个实施例中,步骤102可以包括如下子步骤:
子步骤S31,将所述第一特征数据及所述特征频次分配至一个或多个第一工作节点;
在分布式系统中,具有第一工作节点和第二工作节点进行过滤。
例如,在Hadoop、ODPS等分布式系统中,第一工作节点为Map节点,第二工作节点为Reduce节点。
在本申请实施例,可以通过哈希取余(hash(x)%N)分配法、随机分配法(random(x)%N)等方式,将第一特征数据及特征频次分配至一个或多个第一工作节点。
需要说明的是,第一特征数据可以为数据ID的形式进行表示。
子步骤S32,由所述第一工作节点根据所分配的特征频次过从所分配的第一特征数据过滤低频特征数据,获得第二特征数据;
子步骤S33,由所述第一工作节点将过滤获得的第二特征数据和特征频次传输至第二工作节点;
在本申请实施例中,第一工作节点(如Map节点)可以从所分配的第一特征数据过滤低频特征,获得第二特征数据,透传至第二工作节点(如Reduce节点)中。
例如,定义映射函数(map函数)为当第一特征数据的特征频次小于预设的低频阈值时,确定第一特征数据为低频特征数据,过滤该第一特征数据。
其中,过滤结果的数据格式可以为(第二特征数据,特征频次)。
需要说明的是,由于第一特征数据及其特征频次是配对的,因此,过滤了低频特征 数据,其特征频次也会一同过滤,所保留的第二特征数据,其特征频次也会一同保留。
子步骤S34,由所述第二工作节点合并过滤获得的第二特征数据和特征频次。
在第二工作节点(如Reduce节点)中,可以对第一工作节点(如Map节点)的过滤结果进行合并,获得最终的结果。
例如,定义规约函数(reduce函数)为合并Map节点的过滤结果。
其中,合并结果的数据格式可以为(第二特征数据,特征频次)。
步骤103,根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。
由于中频特征数据对于模型训练是有用的,因此在本申请实施例中,可以通过随机的方式从第二特征数据过滤中频特征数据。
至于哪部分会被过滤掉,是随机的,即平等对待中频特征数据。
过滤之后剩余的目标特征数据除了包括高频特征数据,可能包括中频特征数据,也可能不包括中频特征数据。
应用本申请实施例,预先训练中频阈值,用于过滤中频特征数据。
具体而言,可以通过泊松分布(poission分布)等方式对第二特征数据配置一随机数值(即随机产生的数值)。
当第二特征数据的特征频次与随机数值的乘积小于预设的中频阈值时,则可以确定该第二特征数据为中频特征数据,过滤该第二特征数据,获得目标特征数据。
以泊松分布(poission分布)为例,由于泊松分布(poission分布)可以产生(0,1)之间的浮点数作为随机数值,因此,可以以0.1作为中频特征,符合以下公式的第二特征数据可以认为是中频特征:
特征频次*p<0.1
其中,p为泊松分布产生的随机数值。
需要说明的是,在不同领域中,中频阈值也有所不同,并且,其第二比例不同也会使得低频阈值也有所不同,因此,本领域技术人员可以根据实际情况设定中频阈值,本申请实施例对此不加以限制。
在本申请的一个实施例中,可以通过如下方式训练低频阈值:
子步骤S41,采用第二原始特征数据训练第三测试模型;
所谓第二原始特征数据,实质也为特征数据,具有特征频次,在本申请实施例中,可以指未过滤中频特征数据的源数据,其包括低频特征数据、中频特征数据、高频特征 数据。
对于未过滤中频特征数据的第二原始特征数据,可以进行机器学习,训练得到第三测试模型。
子步骤S42,采用过滤了特征频次与随机数值的乘积小于第二候选阈值之后的第二原始特征数据训练第四测试模型;
在具体实现中,可以预先设定第二候选阈值,作为原始的中频阈值。
从第二原始特征数据中过滤特征频次与随机数值的乘积小于第二候选阈值的特征频次,视为从原始特征数据过滤了中频特征。
采用过滤了中频特征的第二原始特征数据进行机器学习,训练得到第四测试模型。
子步骤S43,计算第一特征概率和第二特征概率;
子步骤S44,当所述第一特征概率与所述第二特征概率之间的差值小于预设的第二差距阈值时,确认所述第二候选阈值为中频阈值。
在具体实现中,可以提取测试数据(包括正样本、负样本),对第三测试模型和第四测试模型计算AUC(Area under Curve)值。
其中,AUC值为Roc(Receiver Operating Characteristic)曲线下的面积,介于0.1和1之间,可以直观的评价分类器的好坏,一般AUC值越大,分类器的性能越好。
具体而言,AUC值是一个概率值,当随机挑选一个正样本以及负样本,当前的分类器根据计算得到的Score值(分数值)将这个正样本排在负样本前面的概率就是AUC值。
一般而言,AUC值越大,当前分类算法越有可能将正样本排在负样本前面,从而能够更好地分类。
那么,在本申请实施例中,第一特征概率为正样本在所述第三测试模型的分数大于,负样本在第三测试模型的分数的概率;
第二特征概率为正样本在第四测试模型的分数大于,负样本在第四测试模型的分数的概率。
因此,在计算AUC的值时,使用AUC的一个性质(它和Wilcoxon-Mann-Witney Test是等价的)来进行计算。
Wilcoxon-Mann-Witney Test就是测试任意给一个正类样本和一个负类样本,正类样本的score有多大的概率大于负类样本的score。
方法一:统计所有的M×N(M为正样本的数目,N为负样本的数目)个正负样本对中,有多少个组中的正样本的score大于负样本的score。
当二元组中正负样本的score相等的时候,按照0.5计算,然后除以MN:
Figure PCTCN2017072404-appb-000001
方法二:对score从大到小排序,然后令最大score对应的sample的rank为n,第二大score对应sample的rank为n-1,以此类推。
把所有的正样本的rank相加,再减去正样本的score为最小的那M个值的情况。得到的就是所有的样本中有多少对正样本的score大于负样本的score,再除以M×N:
AUC=((所有的正例位置相加)-M*(M+1))/(M*N)
若第一特征概率和第二特征概率弱相等(即两者差值小于预设的第二差距阈值),则可以认为该第二候选阈值适合作为中频阈值,否则,选取新的第二候选阈值,重新进行训练。
在本申请的一个实施例中,步骤103可以包括如下子步骤:
子步骤S51,将所述第二特征数据及所述特征频次分配至一个或多个第一工作节点;
在分布式系统中,具有第一工作节点和第二工作节点进行过滤。
例如,在Hadoop、ODPS等分布式系统中,第一工作节点为Map节点,第二工作节点为Reduce节点。
在本申请实施例,可以通过哈希取余(hash(x)%N)分配法、随机分配法(random(x)%N)等方式,将第一特征数据及特征频次分配至一个或多个第一工作节点。
需要说明的是,第一特征数据可以为数据ID的形式进行表示。
子步骤S52,由所述第二工作节点根据所分配的特征频次过从所分配的第二特征数据过滤至少部分中频特征数据,获得目标特征数据;
子步骤S53,由所述第一工作节点将过滤获得的目标特征数据和特征频次传输至第二工作节点;
在本申请实施例中,第一工作节点(如Map节点)可以从所分配的第二特征数据过滤中频特征,获得目标特征数据,透传至第二工作节点(如Reduce节点)中。
例如,定义映射函数(map函数)为当第二特征数据的特征频次与随机数值的乘积小于预设的中频阈值时,确定第二特征数据为中频特征数据,则过滤该第二特征数据。
其中,过滤结果的数据格式可以为(目标特征数据,特征频次)。
需要说明的是,由于第二特征数据及其特征频次是配对的,因此,过滤了中频特征 数据,其特征频次也会一同过滤,所保留的目标特征数据,其特征频次也会一同保留。
子步骤S54,由所述第二工作节点合并过滤获得的目标特征数据和特征频次。
在第二工作节点(如Reduce节点)中,可以对第一工作节点(如Map节点)的过滤结果进行合并,获得最终的结果。
例如,定义规约函数(reduce函数)为合并Map节点的过滤结果。
其中,合并结果的数据格式可以为(目标特征数据,特征频次)。
对于过滤了低频特征数据和至少部分中频特征数据的目标特征数据,可以采用该目标特征数据训练指定的模型,例如,SVM((Support Vector Machine,支持向量机),逻辑回归模型,深度学习DP模型,等等,本申请实施例对此不加以限制。
很多情况下,低频特征数据、中频特征数据的数量大约占据特征数据总数的80%-90%,高频特征数据大约占据特征数据总数的10%-20%。
所以,理想情况下,仅保留10%-20%的高频特征数据,即可训练模型。
但是,很多中频特征数据能够比较好捕获用户长尾需求,往往不能直接丢弃。
至于低频特征数据,出现频率很低,在特征数据的总量很大的情况下,过滤后对模型性能基本没有影响。
例如,决定用户是否要买一本书,可以考虑的特征数据非常多,包括:
低频特征数据:天气;
中频特征数据:书的封面;
高频特征数据:书的质量。
实际上,大多数用户购书基本不考虑天气,较少考虑书的封面,而着重考虑书的质量。
因此,过滤天气这个低频特征数据或者书的封面这个中频特征数据,保留书的质量这个高频特征数据或者书的封面这个中频特征数据,对训练购书模型的性能基本没有影响。
由此可见,获取到的是整个群体的特征,考虑群体中主要的特征(如书的质量),而过滤次要的特征(如天气),对模型性能基本不会产生影响。
目前,通过一个频次阈值过滤特征,不区分低频特征数据、中频特征数据还是高频特征数据,笼统地过滤特征,可能会把大量有效的特征数据(如中频特征、甚至高频特征)过滤掉,从而造成机器学习的效果显著下降。
本申请实施例过滤低频特征数据至少部分中频特征数据,获得的目标特征数据具有高频特征数据,可能具有部分中频特征数据,基于这样的目标特征数据训练模型,基本不影响模型的性能,在保证机器学习的效果的同时,大大减少了特征的数量,从而大大减少所需的机器数量、资源数量,大大减少训练的时间、提高训练的速度,从而大大降低了训练成本。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图2,示出了本申请的一种目标特征数据的挖掘装置实施例的结构框图,具体可以包括如下模块:
特征频次统计模块201,用于对第一特征数据统计特征频次;
低频特征过滤模块202,用于根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;
中频特征过滤模块203,用于根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。
在本申请的一个实施例中,该装置还可以包括如下模块:
模型训练模块,用于采用所述目标特征数据训练指定的模型。
在本申请的一个实施例中,所述特征频次统计模块201可以包括如下子模块:
第一分配子模块,用于将第一特征数据分配至一个或多个第一工作节点;
频次统计子模块,用于由所述第一工作节点统计所分配的第一特征数据的特征频次;
第一传输子模块,用于由所述第一工作节点将已统计的第一特征数据和特征频次传输至第二工作节点;
第一合并子模块,用于由所述第二工作节点合并已统计的第一特征数据和特征频次。
在本申请的一个实施例中,所述低频特征过滤模块202可以包括如下子模块:
低频特征确定子模块,用于在所述第一特征数据的特征频次小于预设的低频阈值时,确定所述第一特征数据为低频特征数据;
第二特征数据获得子模块,用于过滤所述第一特征数据,获得第二特征数据。
在本申请的另一个实施例中,所述低频特征过滤模块202可以包括如下子模块:
第二分配子模块,用于将所述第一特征数据及所述特征频次分配至一个或多个第一工作节点;
第一过滤子模块,用于由所述第一工作节点根据所分配的特征频次过从所分配的第一特征数据过滤低频特征数据,获得第二特征数据;
第二传输子模块,用于由所述第一工作节点将过滤获得的第二特征数据和特征频次传输至第二工作节点;
第二合并子模块,用于由所述第二工作节点合并过滤获得的第二特征数据和特征频次。
在本申请的一个实施例中,所述中频特征过滤模块203可以包括如下子模块:
随机数值配置子模块,用于对所述第二特征数据配置一随机数值;
中频特征确定子模块,用于在所述第二特征数据的特征频次与所述随机数值的乘积小于预设的中频阈值时,确定所述第二特征数据为中频特征数据;
目标特征数据获得子模块,用于过滤所述第二特征数据,获得目标特征数据。
在本申请的另一个实施例中,所述中频特征过滤模块203可以包括如下子模块:
第三分配子模块,用于将所述第二特征数据及所述特征频次分配至一个或多个第一工作节点;
第二过滤子模块,用于由所述第二工作节点根据所分配的特征频次过从所分配的第二特征数据过滤至少部分中频特征数据,获得目标特征数据;
第三传输子模块,用于由所述第一工作节点将过滤获得的目标特征数据和特征频次传输至第二工作节点;
第三合并子模块,用于由所述第二工作节点合并过滤获得的目标特征数据和特征频次。
在本申请的一个实施例中,该装置还可以包括如下模块:
第一测试模型训练模块,用于采用第一原始特征数据训练第一测试模型;
第二测试模型训练模块,用于采用过滤了特征频次小于第一候选阈值之后的第一原始特征数据训练第二测试模型;
测试模块,用于对所述第一测试模型和所述第二测试模型进行A/B测试,获得第一分数和第二分数;
低频阈值确定模块,用于在所述第一点击率与所述第二点击率之间的差值小于预设的第一差距阈值时,确认所述第一候选阈值为低频阈值。
在本申请的一个实施例中,该装置还可以包括如下模块:
第三测试模型训练模块,用于采用第二原始特征数据训练第三测试模型;
第四测试模型训练模块,用于采用过滤了特征频次与随机数值的乘积小于第二候选阈值之后的第二原始特征数据训练第四测试模型;
概率计算子模块,用于计算第一特征概率和第二特征概率;
中频阈值确定模块,用于在所述第一特征概率与所述第二特征概率之间的差值小于预设的第二差距阈值时,确认所述第二候选阈值为中频阈值;
其中,所述第一特征概率为正样本在所述第三测试模型的分数大于,负样本在所述第三测试模型的分数的概率;
所述第二特征概率为正样本在所述第四测试模型的分数大于,负样本在所述第四测试模型的分数的概率。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
在一个典型的配置中,所述计算机设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、 静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非持续性的电脑可读媒体(transitory media),如调制的数据信号和载波。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终端设备所固有的要素。在没有更多限制的情况下,由语句“包括一 个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种目标特征数据的挖掘方法和一种目标特征数据的挖掘装置,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (18)

  1. 一种目标特征数据的挖掘方法,其特征在于,包括:
    对第一特征数据统计特征频次;
    根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;
    根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。
  2. 根据权利要求1所述的方法,其特征在于,还包括:
    采用所述目标特征数据训练指定的模型。
  3. 根据权利要求1所述的方法,其特征在于,所述对第一特征数据统计特征频次的步骤包括:
    将第一特征数据分配至一个或多个第一工作节点;
    由所述第一工作节点统计所分配的第一特征数据的特征频次;
    由所述第一工作节点将已统计的第一特征数据和特征频次传输至第二工作节点;
    由所述第二工作节点合并已统计的第一特征数据和特征频次。
  4. 根据权利要求1所述的方法,其特征在于,所述根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据的步骤包括:
    当所述第一特征数据的特征频次小于预设的低频阈值时,确定所述第一特征数据为低频特征数据;
    过滤所述第一特征数据,获得第二特征数据。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据的步骤包括:
    将所述第一特征数据及所述特征频次分配至一个或多个第一工作节点;
    由所述第一工作节点根据所分配的特征频次过从所分配的第一特征数据过滤低频特征数据,获得第二特征数据;
    由所述第一工作节点将过滤获得的第二特征数据和特征频次传输至第二工作节点;
    由所述第二工作节点合并过滤获得的第二特征数据和特征频次。
  6. 根据权利要求1所述的方法,其特征在于,所述根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据的步骤包括:
    对所述第二特征数据配置一随机数值;
    当所述第二特征数据的特征频次与所述随机数值的乘积小于预设的中频阈值时,确 定所述第二特征数据为中频特征数据;
    过滤所述第二特征数据,获得目标特征数据。
  7. 根据权利要求1、2、3和6中任一项所述的方法,其特征在于,所述根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据的步骤包括:
    将所述第二特征数据及所述特征频次分配至一个或多个第一工作节点;
    由所述第二工作节点根据所分配的特征频次过从所分配的第二特征数据过滤至少部分中频特征数据,获得目标特征数据;
    由所述第一工作节点将过滤获得的目标特征数据和特征频次传输至第二工作节点;
    由所述第二工作节点合并过滤获得的目标特征数据和特征频次。
  8. 根据权利要求1或4所述的方法,其特征在于,还包括:
    采用第一原始特征数据训练第一测试模型;
    采用过滤了特征频次小于第一候选阈值之后的第一原始特征数据训练第二测试模型;
    对所述第一测试模型和所述第二测试模型进行A/B测试,获得第一分数和第二分数;
    当所述第一点击率与所述第二点击率之间的差值小于预设的第一差距阈值时,确认所述第一候选阈值为低频阈值。
  9. 根据权利要求1或6所述的方法,其特征在于,还包括:
    采用第二原始特征数据训练第三测试模型;
    采用过滤了特征频次与随机数值的乘积小于第二候选阈值之后的第二原始特征数据训练第四测试模型;
    计算第一特征概率和第二特征概率;
    当所述第一特征概率与所述第二特征概率之间的差值小于预设的第二差距阈值时,确认所述第二候选阈值为中频阈值;
    其中,所述第一特征概率为正样本在所述第三测试模型的分数大于,负样本在所述第三测试模型的分数的概率;
    所述第二特征概率为正样本在所述第四测试模型的分数大于,负样本在所述第四测试模型的分数的概率。
  10. 一种目标特征数据的挖掘装置,其特征在于,包括:
    特征频次统计模块,用于对第一特征数据统计特征频次;
    低频特征过滤模块,用于根据所述特征频次过从所述第一特征数据过滤低频特征数据,获得第二特征数据;
    中频特征过滤模块,用于根据所述特征频次过从所述第二特征数据过滤至少部分中频特征数据,获得目标特征数据。
  11. 根据权利要求10所述的装置,其特征在于,还包括:
    模型训练模块,用于采用所述目标特征数据训练指定的模型。
  12. 根据权利要求10所述的装置,其特征在于,所述特征频次统计模块包括:
    第一分配子模块,用于将第一特征数据分配至一个或多个第一工作节点;
    频次统计子模块,用于由所述第一工作节点统计所分配的第一特征数据的特征频次;
    第一传输子模块,用于由所述第一工作节点将已统计的第一特征数据和特征频次传输至第二工作节点;
    第一合并子模块,用于由所述第二工作节点合并已统计的第一特征数据和特征频次。
  13. 根据权利要求10所述的装置,其特征在于,所述低频特征过滤模块包括:
    低频特征确定子模块,用于在所述第一特征数据的特征频次小于预设的低频阈值时,确定所述第一特征数据为低频特征数据;
    第二特征数据获得子模块,用于过滤所述第一特征数据,获得第二特征数据。
  14. 根据权利要求10至13中任一项所述的装置,其特征在于,所述低频特征过滤模块包括:
    第二分配子模块,用于将所述第一特征数据及所述特征频次分配至一个或多个第一工作节点;
    第一过滤子模块,用于由所述第一工作节点根据所分配的特征频次过从所分配的第一特征数据过滤低频特征数据,获得第二特征数据;
    第二传输子模块,用于由所述第一工作节点将过滤获得的第二特征数据和特征频次传输至第二工作节点;
    第二合并子模块,用于由所述第二工作节点合并过滤获得的第二特征数据和特征频次。
  15. 根据权利要求10所述的装置,其特征在于,所述中频特征过滤模块包括:
    随机数值配置子模块,用于对所述第二特征数据配置一随机数值;
    中频特征确定子模块,用于在所述第二特征数据的特征频次与所述随机数值的乘积小于预设的中频阈值时,确定所述第二特征数据为中频特征数据;
    目标特征数据获得子模块,用于过滤所述第二特征数据,获得目标特征数据。
  16. 根据权利要求10、11、12和15中任一项所述的装置,其特征在于,所述中频特征过滤模块包括:
    第三分配子模块,用于将所述第二特征数据及所述特征频次分配至一个或多个第一工作节点;
    第二过滤子模块,用于由所述第二工作节点根据所分配的特征频次过从所分配的第二特征数据过滤至少部分中频特征数据,获得目标特征数据;
    第三传输子模块,用于由所述第一工作节点将过滤获得的目标特征数据和特征频次传输至第二工作节点;
    第三合并子模块,用于由所述第二工作节点合并过滤获得的目标特征数据和特征频次。
  17. 根据权利要求10或13所述的装置,其特征在于,还包括:
    第一测试模型训练模块,用于采用第一原始特征数据训练第一测试模型;
    第二测试模型训练模块,用于采用过滤了特征频次小于第一候选阈值之后的第一原始特征数据训练第二测试模型;
    测试模块,用于对所述第一测试模型和所述第二测试模型进行A/B测试,获得第一分数和第二分数;
    低频阈值确定模块,用于在所述第一点击率与所述第二点击率之间的差值小于预设的第一差距阈值时,确认所述第一候选阈值为低频阈值。
  18. 根据权利要求10或15所述的装置,其特征在于,还包括:
    第三测试模型训练模块,用于采用第二原始特征数据训练第三测试模型;
    第四测试模型训练模块,用于采用过滤了特征频次与随机数值的乘积小于第二候选阈值之后的第二原始特征数据训练第四测试模型;
    概率计算子模块,用于计算第一特征概率和第二特征概率;
    中频阈值确定模块,用于在所述第一特征概率与所述第二特征概率之间的差值小于预设的第二差距阈值时,确认所述第二候选阈值为中频阈值;
    其中,所述第一特征概率为正样本在所述第三测试模型的分数大于,负样本在所述第三测试模型的分数的概率;
    所述第二特征概率为正样本在所述第四测试模型的分数大于,负样本在所述第四测试模型的分数的概率。
PCT/CN2017/072404 2016-02-05 2017-01-24 一种目标特征数据的挖掘方法和装置 WO2017133568A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/063,755 US20200272933A1 (en) 2016-02-05 2017-01-24 Method and apparatus for mining target feature data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610082536.1 2016-02-05
CN201610082536.1A CN107045511B (zh) 2016-02-05 2016-02-05 一种目标特征数据的挖掘方法和装置

Publications (1)

Publication Number Publication Date
WO2017133568A1 true WO2017133568A1 (zh) 2017-08-10

Family

ID=59499365

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/072404 WO2017133568A1 (zh) 2016-02-05 2017-01-24 一种目标特征数据的挖掘方法和装置

Country Status (4)

Country Link
US (1) US20200272933A1 (zh)
CN (1) CN107045511B (zh)
TW (1) TW201732655A (zh)
WO (1) WO2017133568A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353626A (zh) * 2018-12-21 2020-06-30 阿里巴巴集团控股有限公司 数据的审核方法、装置及设备

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108226395B (zh) * 2017-12-28 2020-09-04 广东中联兴环保科技有限公司 工业园区大气环境突发性预警阈值确定方法及装置
CN112106067B (zh) * 2018-05-18 2024-07-02 北京嘀嘀无限科技发展有限公司 一种用于用户分析的系统和方法
CN110825966B (zh) * 2019-10-31 2022-03-04 广州市百果园信息技术有限公司 一种信息推荐的方法、装置、推荐服务器和存储介质
CN112906309B (zh) * 2021-03-30 2024-04-30 第四范式(北京)技术有限公司 机器学习模型的分布式训练方法、装置和系统
TWI773483B (zh) * 2021-08-12 2022-08-01 國立臺東專科學校 感測資料處理方法
US11892989B2 (en) * 2022-03-28 2024-02-06 Bank Of America Corporation System and method for predictive structuring of electronic data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706807A (zh) * 2009-11-27 2010-05-12 清华大学 一种中文网页新词自动获取方法
CN102509174A (zh) * 2011-11-01 2012-06-20 冶金自动化研究设计院 一种基于工业过程数据的报警限自学习系统的方法
CN103020712A (zh) * 2012-12-28 2013-04-03 东北大学 一种海量微博数据的分布式分类装置及方法
CN104008143A (zh) * 2014-05-09 2014-08-27 启秀科技(北京)有限公司 基于数据挖掘的职业能力指标体系构建方法
CN104899190A (zh) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 分词词典的生成方法和装置及分词处理方法和装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4811433B2 (ja) * 2007-09-05 2011-11-09 ソニー株式会社 画像選択装置、画像選択方法、およびプログラム
CN104391835B (zh) * 2014-09-30 2017-09-29 中南大学 文本中特征词选择方法及装置
CN104702492B (zh) * 2015-03-19 2019-10-18 百度在线网络技术(北京)有限公司 垃圾消息模型训练方法、垃圾消息识别方法及其装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706807A (zh) * 2009-11-27 2010-05-12 清华大学 一种中文网页新词自动获取方法
CN102509174A (zh) * 2011-11-01 2012-06-20 冶金自动化研究设计院 一种基于工业过程数据的报警限自学习系统的方法
CN103020712A (zh) * 2012-12-28 2013-04-03 东北大学 一种海量微博数据的分布式分类装置及方法
CN104008143A (zh) * 2014-05-09 2014-08-27 启秀科技(北京)有限公司 基于数据挖掘的职业能力指标体系构建方法
CN104899190A (zh) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 分词词典的生成方法和装置及分词处理方法和装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353626A (zh) * 2018-12-21 2020-06-30 阿里巴巴集团控股有限公司 数据的审核方法、装置及设备
CN111353626B (zh) * 2018-12-21 2023-05-26 阿里巴巴集团控股有限公司 数据的审核方法、装置及设备

Also Published As

Publication number Publication date
TW201732655A (zh) 2017-09-16
CN107045511B (zh) 2021-03-02
CN107045511A (zh) 2017-08-15
US20200272933A1 (en) 2020-08-27

Similar Documents

Publication Publication Date Title
WO2017133568A1 (zh) 一种目标特征数据的挖掘方法和装置
US10565172B2 (en) Adjusting application of a set of data quality rules based on data analysis
US9361343B2 (en) Method for parallel mining of temporal relations in large event file
US11082509B1 (en) Determining session intent
US9892164B2 (en) Reducing a large amount of data to a size available for interactive analysis
US20150051946A1 (en) Weighting sentiment information
CN105162875B (zh) 大数据群体任务分配方法及装置
JP2018116688A5 (zh)
CN103077254A (zh) 网页获取方法和装置
Augenstein et al. Applying machine learning to big data streams: An overview of challenges
CN105022807A (zh) 信息推荐方法及装置
Okewu et al. Design of a learning analytics system for academic advising in Nigerian universities
US20210349920A1 (en) Method and apparatus for outputting information
US8738628B2 (en) Community profiling for social media
US20170300937A1 (en) System and method for inferring social influence networks from transactional data
Kaufhold et al. Big data and multi-platform social media services in disaster management
Okur et al. Big data challenges in information engineering curriculum
US10133997B2 (en) Object lifecycle analysis tool
CN110019771B (zh) 文本处理的方法及装置
Samsudeen et al. Impacts and challenges of big data: A review
US20200110996A1 (en) Machine learning of keywords
US11748453B2 (en) Converting unstructured computer text to domain-specific groups using graph datastructures
CN113342998B (zh) 一种多媒体资源推荐方法、装置、电子设备及存储介质
Yousfi et al. Big Data-as-a-service solution for building graph social networks
Lněnička et al. The performance efficiency of the virtual hadoop using open big data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17746888

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17746888

Country of ref document: EP

Kind code of ref document: A1