WO2017124930A1 - 一种特征数据处理方法及设备 - Google Patents

一种特征数据处理方法及设备 Download PDF

Info

Publication number
WO2017124930A1
WO2017124930A1 PCT/CN2017/070621 CN2017070621W WO2017124930A1 WO 2017124930 A1 WO2017124930 A1 WO 2017124930A1 CN 2017070621 W CN2017070621 W CN 2017070621W WO 2017124930 A1 WO2017124930 A1 WO 2017124930A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
features
auxiliary
important
hash
Prior art date
Application number
PCT/CN2017/070621
Other languages
English (en)
French (fr)
Inventor
代斌
李屾
姜晓燕
杨旭
漆远
褚崴
王少萌
付子豪
Original Assignee
阿里巴巴集团控股有限公司
代斌
李屾
姜晓燕
杨旭
漆远
褚崴
王少萌
付子豪
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 代斌, 李屾, 姜晓燕, 杨旭, 漆远, 褚崴, 王少萌, 付子豪 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2017124930A1 publication Critical patent/WO2017124930A1/zh
Priority to US16/038,780 priority Critical patent/US11188731B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/2163Partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • G06V10/765Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects using rules for classification or partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/771Feature selection, e.g. selecting representative features from a multi-dimensional feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/12Fingerprints or palmprints
    • G06V40/1347Preprocessing; Feature extraction

Definitions

  • the present application relates to the field of Internet technologies, and in particular, to a feature data processing method.
  • the application also relates to a feature data processing device.
  • Data mining generally refers to the process of searching for information hidden in it from a large amount of data through an algorithm. Data mining is often associated with computer science, which accomplishes these goals through statistics, online analytical processing, intelligence retrieval, machine learning, expert systems (reliant on past rules of thumb), and pattern recognition.
  • FIG. 1 a schematic diagram of implementing pseudo code for processing feature data in the prior art, by assigning T data to each worker (processing device), using the data for optimization update processing, and finally by specifying The processing end reduces the output result.
  • the scheme essentially only distributes the original massive data to N different workers, and the total amount of data processed and the characteristics used to process the data does not change, and the amount of data processed by each worker is the total data.
  • the amount of 1/N can be processed for a certain amount of data, but for 100 billion In the case of the sample, in the case of tens of billions of features, the total amount of data may exceed the PB level, which is beyond the calculation range of the general computing cluster, and the running time and efficiency are relatively low.
  • the present application provides a feature data processing method, which introduces the feature fingerprint of each sample to replace the original feature training by introducing IV value selection and hashing, thereby retaining the information value of the feature to the greatest extent, and greatly reducing the feature dimension, and finally Ensure that the training dimension is controllable, reduce the amount of training data, and improve the training speed.
  • the feature data processing method includes:
  • the currently existing features are respectively divided into an important feature set and an auxiliary feature set;
  • the hash feature is merged with features of the important feature set, and the merged feature is set as a fingerprint feature.
  • the information attribute value includes at least the information value IV of the feature and the information gain IG, and the currently existing features are respectively divided into an important feature set and an auxiliary feature set according to the information attribute value of each feature, specifically:
  • the feature in the auxiliary feature set is converted into a hash feature, specifically:
  • the auxiliary feature is converted into a vector containing parameters corresponding to the hash algorithm according to a preset hash algorithm.
  • the method further includes:
  • the data to be processed is trained and predicted according to the fingerprint feature.
  • the application provides a feature data processing device, including:
  • a dividing module configured to respectively divide the currently existing features into an important feature set and an auxiliary feature set according to the information attribute values of the respective features
  • a conversion module configured to convert features in the auxiliary feature set into hash features
  • a merging module configured to combine the hash feature with features in the important feature set, and set the merged feature as a fingerprint feature.
  • the information attribute value includes at least the information value IV of the feature and the information gain IG, and the dividing further includes:
  • a setting submodule configured to set a feature that the information attribute value is greater than or equal to a preset threshold as an important feature, and set a feature that the information attribute value is smaller than the threshold as an auxiliary feature;
  • the conversion module converts the feature in the auxiliary feature set into a hash feature, specifically:
  • the auxiliary feature is converted into a vector containing parameters corresponding to the hash algorithm according to a preset hash algorithm.
  • the merging module further includes:
  • a replacement submodule configured to replace the original feature corresponding to the data to be processed with the fingerprint feature
  • a training submodule configured to perform training and prediction on the to-be-processed data according to the fingerprint feature.
  • the embodiment of the present application has at least the following advantages: in the embodiment of the present application, the original feature is divided into an important feature and an auxiliary feature, and all the important features are retained as they are, and the hash feature is processed by the hashing method to obtain the hash value. Finally, the important feature set and the hash value are combined to obtain the original feature fingerprint, and then the cluster learning training and prediction are performed. This method reduces the feature dimension while retaining the feature information, thereby reducing the training data and improving the data calculation operation efficiency, thereby improving the efficiency. Data processing efficiency.
  • FIG. 1 is a schematic diagram of a specific implementation pseudo code of feature training in the prior art
  • FIG. 2 is a schematic flowchart of a feature data processing method according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an output form of information value calculation according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a hashing algorithm according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of an efficient training process according to an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a feature data processing device according to an embodiment of the present application.
  • Step 201 Divide the currently existing features into an important feature set and an auxiliary feature set according to the information attribute values of the respective features.
  • the present application divides the original features of the data into two feature sets: an important feature set and an auxiliary feature set, wherein the features in the important feature set are all retained as they are, and the features in the auxiliary feature set are generally Higher, subsequent processing will be done to reduce its dimensions.
  • IV Information Value
  • the measure of importance is to see how much information a feature can bring to a classification system. The more information it brings, the more important it is. Therefore, for a feature, the information gain is the amount of information when the system has the feature and the feature does not exist. The difference between the two is the amount of information that the feature brings to the system, that is, the information gain IG (Information Gain) ).
  • the information attribute value should include at least the IV and IG of the feature, and the feature attribute value calculation needs to be performed for each feature data before the step. To obtain the information attribute value of each feature.
  • an IV threshold and/or an IG threshold are preset. Will be calculated by the above formula
  • the information attribute value of each feature is compared with a preset IV threshold and/or an IG threshold. If the information attribute value of a feature is greater than or equal to a preset threshold, the feature is determined to be an important feature; If the information attribute value is less than the threshold, the feature is determined to be an auxiliary feature.
  • the original features are divided into two parts: an important feature set and an auxiliary feature set depending on the IV value and/or the IG value.
  • the information attribute value in the important feature set is greater than or equal to the preset threshold, and the information attribute value in the auxiliary feature set is less than the preset threshold.
  • IV and IG are taken as an example to illustrate the division of the important feature set and the auxiliary feature set feature, those skilled in the art may adopt other attributes or means to achieve the same effect. These are all within the scope of this application.
  • Step 202 Convert the features in the auxiliary feature set into hash features.
  • the b-bit minwise hashing scheme may be used to convert the auxiliary feature set into a hash value identifier.
  • the preset hash algorithm is specifically shown in Figure 4.
  • the auxiliary features in the auxiliary feature set are converted into a vector of k*2 b dimensions, where k and b are parameters specified by the algorithm.
  • the embodiment of the present application proposes a b-bit minwise hashing algorithm for processing auxiliary feature sets by using a hash algorithm b-bit minwise hashing algorithm, and the b-bit minwise hashing algorithm is mainly for reducing storage space and speeding up. As a result of the calculation, the accuracy will decrease as b decreases.
  • the process only converts the auxiliary features in the auxiliary feature set into the hash value identifier, and does not process the features in the important feature set, and the features in the important feature set are all retained as they are.
  • Step 203 Combine the hash feature with the feature in the important feature set, and set the merged feature as a fingerprint feature.
  • the hash values are merged and the merged feature is set to the fingerprint feature.
  • the original features corresponding to the data to be processed are then replaced with fingerprint features, and the data to be processed is trained and predicted according to the fingerprint features.
  • FIG. 5 it is a schematic diagram of an efficient training process proposed in the embodiment of the present application.
  • the embodiment of the present application calculates the IV value corresponding to each feature based on the original feature data, and determines whether the IV value is greater than a preset IV threshold; if yes, extracts a feature greater than the IV threshold.
  • Adding an important feature set if not, extracting a feature smaller than the IV threshold into the auxiliary feature set, and performing a hashing operation on the data in all the auxiliary feature sets to obtain a hash value corresponding to the original auxiliary feature, that is, corresponding to each original auxiliary feature
  • the hash feature is combined with the features in the important feature set, and the merged feature is used as the feature fingerprint corresponding to each original feature; finally, the feature fingerprint is subjected to LR training prediction, and the process ends.
  • the feature data processing device includes:
  • the dividing module 601 is configured to divide the currently existing features into an important feature set and an auxiliary feature set according to the information attribute values of the respective features;
  • a conversion module 602 configured to convert features in the auxiliary feature set into hash features
  • the merging module 603 is configured to combine the hash feature with features in the important feature set, and set the merged feature as a fingerprint feature.
  • the information attribute value includes at least the information value IV of the feature and the information gain IG, and the dividing module 601 further includes:
  • a setting submodule configured to set a feature that the information attribute value is greater than or equal to a preset threshold as an important feature, and set a feature that the information attribute value is smaller than the threshold as an auxiliary feature;
  • the conversion module 602 converts the features in the auxiliary feature set into a hash feature, specifically:
  • the auxiliary feature is converted into a vector containing parameters corresponding to the hash algorithm according to a preset hash algorithm.
  • the merging module 603 further includes:
  • a replacement submodule configured to replace the original feature corresponding to the data to be processed with the fingerprint feature
  • a training submodule configured to perform training and prediction on the to-be-processed data according to the fingerprint feature.
  • the modules of the device of the present application may be integrated into one or may be deployed separately.
  • the above modules can be combined into one module, or can be further split into multiple sub-modules.
  • the present application can be implemented by hardware, or by software plus a necessary general hardware platform.
  • the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a USB flash drive, a mobile hard disk, etc.), including several The instructions are for causing a computer device (which may be a personal computer, server, or network device, etc.) to perform the methods described in various implementation scenarios of the present application.
  • modules in the apparatus in the implementation scenario may be distributed in the apparatus for implementing the scenario according to the implementation scenario description, or may be differently located in the corresponding change.
  • the modules of the above implementation scenarios may be combined into one module, or may be further split into multiple sub-modules.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Collating Specific Patterns (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种特征数据处理方法及设备,根据各个特征的信息属性值,将当前存在的特征分别划分为重要特征集以及辅助特征集(S201);将辅助特征集中的特征转换为散列特征(S202);将散列特征与所述重要特征集中的特征进行合并,并将合并后的特征设置为指纹特征(S203)。通过该方法能够在保证训练维度可控的同时降低训练数据量,从而提升了数据处理效率。

Description

一种特征数据处理方法及设备 技术领域
本申请涉及互联网技术领域,特别涉及一种特征数据处理方法。本申请同时还涉及一种特征数据处理设备。
背景技术
随着互联网的不断发展,大量用户在使用互联网过程中所产出的数据可被广泛使用并转换成有用的信息和知识。这些获取的信息和知识可以广泛用于各种应用,包括商务管理,生产控制,市场分析,工程设计和科学探索等。因此数据挖掘技术越来越为信息产业界所关注。数据挖掘一般是指从大量的数据中通过算法搜索隐藏于其中信息的过程。数据挖掘通常与计算机科学有关,其通过统计、在线分析处理、情报检索、机器学习、专家系统(依靠过去的经验法则)和模式识别等诸多方法来实现上述目标。
在数据挖掘的业务场景中,经常需要针对超大规模的数据使用机器学习算法进行分类或者回归计算,在当前互联网的环境下,经常需要对数十亿甚至上千亿的数据进行训练,训练特征随着业务扩展,也会到一个非常惊人的量级,以CTR(Click-Through-Rate,广告点击率)业务为例,参与计算的特征有可能达到百亿的规模,对于此类问题,常规的解决方案是使用并行计算的方式进行,但是对于百亿特征*千亿数据的规模下,经常使用到超大规模的计算集群,得到最终最优结果的是时间也非常长,满足不了业务的更新需求。
如图1所示,为现有技术中针对特征数据进行处理的实现伪代码示意图,该方法通过为每台worker(处理设备)分配T个数据,利用这些数据进行优化更新处理,最后由指定的处理端对输出结果进行归约。然而,该方案实质上仅是将原本海量的数据分发到N个不同的worker中计算,其处理的数据总量以及用于处理数据的特征不发生变化,每个worker处理的数据量为总数据量的1/N,对于数据量在一定范围内可以处理,但是对于千亿 样本,百亿特征的情况下,数据总量可能超过PB级别,超出了一般计算集群的计算范围,运行时间和效率都比较低。
由此可见,在面对海量的待处理数据时,如何在保留特征信息的同时降低特征维度,从而降低训练数据以及提升数据计算运行效率,成为本领域技术人员亟待解决的技术问题。
发明内容
本申请提供了一种特征数据处理方法,通过引入IV值选择和Hashing生成每个样本的特征指纹替代原始特征训练,最大程度的保留了特征的信息值,又极大的降低了特征维度,最终保证训练维度可控,降低了训练数据量,提升了训练速度。所述特征数据处理方法包括:
根据各个特征的信息属性值,将当前存在的特征分别划分为重要特征集以及辅助特征集;
将所述辅助特征集中的特征转换为散列特征;
将所述散列特征与所述重要特征集中的特征进行合并,并将所述合并后的特征设置为指纹特征。
优选地,所述信息属性值至少包括所述特征的信息值IV以及信息增益IG,根据各个特征的信息属性值将当前存在的特征分别划分为重要特征集以及辅助特征集,具体为:
获取各所述特征的信息属性值;
将所述信息属性值大于或等于预设的阈值的特征设置为重要特征,以及将所述信息属性值小于所述阈值的特征设置为辅助特征;
根据所述特征中的重要特征生成所述重要特征集,以及根据所述特征中的辅助特征生成所述辅助特征集。
优选地,将所述辅助特征集中的特征转换为散列特征,具体为:
根据预设的哈希算法,将所述辅助特征转换为包含所述哈希算法对应的参数的向量。
优选地,在将所述合并后的特征设置为指纹特征之后,还包括:
将与待处理数据对应的原始特征替换为所述指纹特征;
根据所述指纹特征对所述待处理数据进行训练以及预测。
本申请提供了一种特征数据处理设备,包括:
划分模块,用于根据各个特征的信息属性值,将当前存在的特征分别划分为重要特征集以及辅助特征集;
转换模块,用于将所述辅助特征集中的特征转换为散列特征;
合并模块,用于将所述散列特征与所述重要特征集中的特征进行合并,并将所述合并后的特征设置为指纹特征。
优选地,所述信息属性值至少包括所述特征的信息值IV以及信息增益IG,所述划分还包括:
获取子模块,用于获取各所述特征的信息属性值;
设置子模块,用于将所述信息属性值大于或等于预设的阈值的特征设置为重要特征,以及将所述信息属性值小于所述阈值的特征设置为辅助特征;
生成子模块,用于根据所述特征中的重要特征生成所述重要特征集,以及根据所述特征中的辅助特征生成所述辅助特征集。
优选地,所述转换模块将所述辅助特征集中的特征转换为散列特征,具体为:
根据预设的哈希算法,将所述辅助特征转换为包含所述哈希算法对应的参数的向量。
优选地,所述合并模块还包括:
替换子模块,用于将与待处理数据对应的原始特征替换为所述指纹特征;
训练子模块,用于根据所述指纹特征对所述待处理数据进行训练以及预测。
与现有技术相比,本申请实施例至少具有以下优点:本申请实施例中将原有特征分为重要特征和辅助特征,重要特征全部原样保留,通过hashing的方法处理辅助特征得到散列值,最终重要特征集合和散列值合并得到原始特征指纹,进而执行集群学习训练和预测,该方式在保留特征信息的同时降低了特征维度,进而降低训练数据以及提升数据计算运行效率,从而提升了数据处理效率。
附图说明
为了更清楚地说明本申请的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为现有技术中进行特征训练的具体实现伪代码示意图;
图2为本申请实施例提出的一种特征数据处理方法的流程示意图;
图3为本申请实施例提出的信息值计算的输出形式示意图;
图4为本申请实施例提出的Hashing算法示意图;
图5为本申请实施例提出的一种高效训练流程示意图;
图6为本申请实施例提出的一种特征数据处理设备的结构示意图。
具体实施方式
为了进一步阐述本申请的技术思想,现结合具体的应用场景,对本申请的技术方案进行说明。显然,所描述的实施例仅仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
如图2所示,为本申请实施例提供的一种特征数据处理方法的流程, 具体包括以下步骤:
步骤201,根据各个特征的信息属性值,将当前存在的特征分别划分为重要特征集以及辅助特征集。
为了实现数据特征的简化,本申请通过将数据原有的特征分为两个特征集合:重要特征集和辅助特征集,其中重要特征集中的特征全部原样保留,而辅助特征集中的特征由于维度一般较高,后续将进行处理以降低其维度。
在当今数据接口越来越多的情况下,数据集的原始变量、衍生变量会越来越多,因此信息值IV(Information Value)在实际数据应用中十分重要。IV用来表示每一个变量对目标变量来说有多少“信息”的量,从而使得特征选择变得简单快速。
但凡是特征选择,总是在将特征的重要程度量化之后再进行选择,而如何量化特征,就成了各种方法间最大的不同。在信息增益中,重要性的衡量标准就是看特征能够为分类系统带来多少信息,带来的信息越多,该特征越重要。因此对于一个特征而言,信息增益为系统存在该特征和不存在该特征时候的信息量各是多少,两者的差值就是这个特征给系统带来的信息量,即信息增益IG(Information Gain)。
由于IV以及IG是与特征密切相关的两个属性,在本申请的优选实施例中,信息属性值应至少包括特征的IV以及IG,在该步骤之前需要针对各个特征数据进行特征属性值的计算,用以获取各个特征的信息属性值。
以二分为例,本申请具体的实施例中信息属性值计算公式如下所示:
Figure PCTCN2017070621-appb-000001
Figure PCTCN2017070621-appb-000002
IV=∑MIV
该实施例中对应的输出形式如图3所示,其中col列为特征名,IV为信息值,IG为信息增益。
在此基础上,本申请实施例中为了实现对各个特征是重要特征或辅助特征的划分,预先设定了IV阈值和/或IG阈值。将通过上述公式计算得到 的各个特征的信息属性值与预设的IV阈值和/或IG阈值进行比较,若某一特征的信息属性值大于或等于预设的阈值,则判断该特征为重要特征;若某一特征的信息属性值小于所述阈值,则判断该特征为辅助特征。
根据上述判断的重要特征与辅助特征,依赖于IV值和/或IG值将原有特征分成了重要特征集和辅助特征集两个部分。其中,重要特征集中的信息属性值均大于或等于预设的阈值,辅助特征集中的信息属性值则小于预设的阈值。
需要说明的是,以上优选实施例尽管以IV以及IG为例说明了重要特征集以及辅助特征集中特征的划分,但是本领域技术人员可以在此基础上采取其他属性或是手段实现相同的效果,这些都属于本申请的保护范围。
步骤202,将所述辅助特征集中的特征转换为散列特征。
如S201所述,由于辅助特征集可能较大,因此在本申请的优选实施例中,可采用哈希算法b-bit minwise hashing方案将辅助特征集转为散列值标识。预设的哈希算法具体如图4所示。
通过上述算法,辅助特征集中的辅助特征被转换为一个k*2b维度的向量,其中,k和b为算法指定参数。本申请实施例提出利用哈希算法b-bit minwise hashing处理辅助特征集,b-bit minwise hashing算法广泛应用于海量数据下的信息检索,而b-bit minwise hashing算法主要是针对降低存储空间、加快计算提出来的,精度会随着b的降低而降低。通过b-bit minwise hashing处理辅助特征集能够实现将b=64缩小到b位,降低了存储空间和计算时间。
需要说明的是,该过程只是将辅助特征集中的辅助特征转换为散列值标识,并不对重要特征集中的特征进行处理,重要特征集中的特征全部原样保留。
步骤203,将所述散列特征与所述重要特征集中的特征进行合并,并将所述合并后的特征设置为指纹特征。
在本申请的优选实施例中,将重要特征集中的特征与辅助特征转换后 的散列值合并,并将合并后的特征设置为指纹特征。之后将与待处理数据对应的原始特征替换为指纹特征,并根据指纹特征对待处理数据进行训练以及预测。
如图5所示,为本申请实施例提出的一种高效训练流程示意图。
以IV值为例,本申请实施例在原有的特征数据基础上,计算各个特征对应的IV值,并判断该IV值是否大于预先设定的IV阈值;若是,抽取大于该IV阈值的特征,加入重要特征集合;若否,抽取小于该IV阈值的特征加入辅助特征集合,并对所有辅助特征集合中的数据进行Hashing运算得到原辅助特征对应的散列值,即得到各原辅助特征对应的散列特征;将运算得到的散列特征与重要特征集合中的特征进行合并,将合并得到的特征作为各个原始特征对应的特征指纹;最后对该特征指纹进行LR训练预测,流程结束。
应用上述高效训练方式对特征数据进行训练,以1亿特征为例,取K=200,B=12,重要特征取top10000,生成指纹特征后,降维比约为0.008292,特征和数据不到原先数据量的1%。通过上述改动,在1亿特征数据上,AUC交全量训练提升2%,数据大小为原先数据量的1%。
基于与上述方法同样的发明构思,本申请实施例中还提供了一种特征数据处理设备,如图6所示,该特征数据处理设备包括:
划分模块601,用于根据各个特征的信息属性值,将当前存在的特征分别划分为重要特征集以及辅助特征集;
转换模块602,用于将所述辅助特征集中的特征转换为散列特征;
合并模块603,用于将所述散列特征与所述重要特征集中的特征进行合并,并将所述合并后的特征设置为指纹特征。
优选地,所述信息属性值至少包括所述特征的信息值IV以及信息增益IG,所述划分模块601还包括:
获取子模块,用于获取各所述特征的信息属性值;
设置子模块,用于将所述信息属性值大于或等于预设的阈值的特征设置为重要特征,以及将所述信息属性值小于所述阈值的特征设置为辅助特征;
生成子模块,用于根据所述特征中的重要特征生成所述重要特征集,以及根据所述特征中的辅助特征生成所述辅助特征集。
优选地,所述转换模块602将所述辅助特征集中的特征转换为散列特征,具体为:
根据预设的哈希算法,将所述辅助特征转换为包含所述哈希算法对应的参数的向量。
优选地,所述合并模块603还包括:
替换子模块,用于将与待处理数据对应的原始特征替换为所述指纹特征;
训练子模块,用于根据所述指纹特征对所述待处理数据进行训练以及预测。
其中,本申请装置的各个模块可以集成于一体,也可以分离部署。上述模块可以合并为一个模块,也可以进一步拆分成多个子模块。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可以通过硬件实现,也可以借助软件加必要的通用硬件平台的方式来实现。基于这样的理解,本申请的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施场景所述的方法。
本领域技术人员可以理解附图只是一个优选实施场景的示意图,附图中的模块或流程并不一定是实施本申请所必须的。
本领域技术人员可以理解实施场景中的装置中的模块可以按照实施场景描述进行分布于实施场景的装置中,也可以进行相应变化位于不同于 本实施场景的一个或多个装置中。上述实施场景的模块可以合并为一个模块,也可以进一步拆分成多个子模块。
上述本申请序号仅仅为了描述,不代表实施场景的优劣。
以上公开的仅为本申请的几个具体实施场景,但是,本申请并非局限于此,任何本领域的技术人员能思之的变化都应落入本申请的保护范围。

Claims (8)

  1. 一种特征数据处理方法,其特征在于,包括:
    根据各个特征的信息属性值,将当前存在的特征分别划分为重要特征集以及辅助特征集;
    将所述辅助特征集中的特征转换为散列特征;
    将所述散列特征与所述重要特征集中的特征进行合并,并将所述合并后的特征设置为指纹特征。
  2. 如权利要求1所述的方法,其特征在于,所述信息属性值至少包括所述特征的信息值IV以及信息增益IG,根据各个特征的信息属性值将当前存在的特征分别划分为重要特征集以及辅助特征集,具体为:
    获取各所述特征的信息属性值;
    将所述信息属性值大于或等于预设的阈值的特征设置为重要特征,以及将所述信息属性值小于所述阈值的特征设置为辅助特征;
    根据所述特征中的重要特征生成所述重要特征集,以及根据所述特征中的辅助特征生成所述辅助特征集。
  3. 如权利要求2所述的方法,其特征在于,将所述辅助特征集中的特征转换为散列特征,具体为:
    根据预设的哈希算法,将所述辅助特征转换为包含所述哈希算法对应的参数的向量。
  4. 如权利要求1所述的方法,其特征在于,在将所述合并后的特征设置为指纹特征之后,还包括:
    将与待处理数据对应的原始特征替换为所述指纹特征;
    根据所述指纹特征对所述待处理数据进行训练以及预测。
  5. 一种特征数据处理设备,其特征在于,包括:
    划分模块,根据各个特征的信息属性值,将当前存在的特征分别划分 为重要特征集以及辅助特征集;
    转换模块,将所述辅助特征集中的特征转换为散列特征;
    合并模块,将所述散列特征与所述重要特征集中的特征进行合并,并将所述合并后的特征设置为指纹特征。
  6. 如权利要求5所述的设备,其特征在于,所述信息属性值至少包括所述特征的信息值IV以及信息增益IG,所述划分还包括:
    获取子模块,获取各所述特征的信息属性值;
    设置子模块,将所述信息属性值大于或等于预设的阈值的特征设置为重要特征,以及将所述信息属性值小于所述阈值的特征设置为辅助特征;
    生成子模块,根据所述特征中的重要特征生成所述重要特征集,以及根据所述特征中的辅助特征生成所述辅助特征集。
  7. 如权利要求5所述的设备,其特征在于,所述转换模块将所述辅助特征集中的特征转换为散列特征,具体为:
    根据预设的哈希算法,将所述辅助特征转换为包含所述哈希算法对应的参数的向量。
  8. 如权利要求5所述的设备,其特征在于,所述合并模块还包括:
    替换子模块,将与待处理数据对应的原始特征替换为所述指纹特征;
    训练子模块,根据所述指纹特征对所述待处理数据进行训练以及预测。
PCT/CN2017/070621 2016-01-18 2017-01-09 一种特征数据处理方法及设备 WO2017124930A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/038,780 US11188731B2 (en) 2016-01-18 2018-07-18 Feature data processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610031819.3 2016-01-18
CN201610031819.3A CN106980900A (zh) 2016-01-18 2016-01-18 一种特征数据处理方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/038,780 Continuation US11188731B2 (en) 2016-01-18 2018-07-18 Feature data processing method and device

Publications (1)

Publication Number Publication Date
WO2017124930A1 true WO2017124930A1 (zh) 2017-07-27

Family

ID=59340545

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/070621 WO2017124930A1 (zh) 2016-01-18 2017-01-09 一种特征数据处理方法及设备

Country Status (4)

Country Link
US (1) US11188731B2 (zh)
CN (1) CN106980900A (zh)
TW (1) TW201730788A (zh)
WO (1) WO2017124930A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111401464A (zh) * 2020-03-25 2020-07-10 北京字节跳动网络技术有限公司 分类方法、装置、电子设备及计算机可读存储介质
US11188731B2 (en) 2016-01-18 2021-11-30 Alibaba Group Holding Limited Feature data processing method and device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE202017007517U1 (de) * 2016-08-11 2022-05-03 Twitter, Inc. Aggregatmerkmale für maschinelles Lernen
CN110210506B (zh) * 2018-04-04 2023-10-20 腾讯科技(深圳)有限公司 基于大数据的特征处理方法、装置和计算机设备
CN109840274B (zh) * 2018-12-28 2021-11-30 北京百度网讯科技有限公司 数据处理方法及装置、存储介质
CN111738297A (zh) * 2020-05-26 2020-10-02 平安科技(深圳)有限公司 特征选择方法、装置、设备及存储介质
CN112399448B (zh) * 2020-11-18 2024-01-09 中国联合网络通信集团有限公司 无线通讯优化方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069678A1 (en) * 2004-09-30 2006-03-30 Wu Chou Method and apparatus for text classification using minimum classification error to train generalized linear classifier
CN103324610A (zh) * 2013-06-09 2013-09-25 苏州大学 一种应用于移动设备的样本训练方法及装置
CN103679160A (zh) * 2014-01-03 2014-03-26 苏州大学 一种人脸识别方法和装置
CN104298791A (zh) * 2014-11-19 2015-01-21 中国石油大学(华东) 一种基于集成哈希编码的快速图像检索方法
CN104636493A (zh) * 2015-03-04 2015-05-20 浪潮电子信息产业股份有限公司 一种基于多分类器融合的动态数据分级方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060104484A1 (en) * 2004-11-16 2006-05-18 Bolle Rudolf M Fingerprint biometric machine representations based on triangles
US8488901B2 (en) * 2007-09-28 2013-07-16 Sony Corporation Content based adjustment of an image
US8656182B2 (en) * 2011-09-12 2014-02-18 Microsoft Corporation Security mechanism for developmental operating systems
US9245191B2 (en) * 2013-09-05 2016-01-26 Ebay, Inc. System and method for scene text recognition
US10254383B2 (en) * 2013-12-06 2019-04-09 Digimarc Corporation Mobile device indoor navigation
KR102475820B1 (ko) * 2015-07-07 2022-12-08 삼성메디슨 주식회사 의료 영상 처리 장치 및 그 동작방법
CN106980900A (zh) 2016-01-18 2017-07-25 阿里巴巴集团控股有限公司 一种特征数据处理方法及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060069678A1 (en) * 2004-09-30 2006-03-30 Wu Chou Method and apparatus for text classification using minimum classification error to train generalized linear classifier
CN103324610A (zh) * 2013-06-09 2013-09-25 苏州大学 一种应用于移动设备的样本训练方法及装置
CN103679160A (zh) * 2014-01-03 2014-03-26 苏州大学 一种人脸识别方法和装置
CN104298791A (zh) * 2014-11-19 2015-01-21 中国石油大学(华东) 一种基于集成哈希编码的快速图像检索方法
CN104636493A (zh) * 2015-03-04 2015-05-20 浪潮电子信息产业股份有限公司 一种基于多分类器融合的动态数据分级方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11188731B2 (en) 2016-01-18 2021-11-30 Alibaba Group Holding Limited Feature data processing method and device
CN111401464A (zh) * 2020-03-25 2020-07-10 北京字节跳动网络技术有限公司 分类方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
US20180341801A1 (en) 2018-11-29
TW201730788A (zh) 2017-09-01
US11188731B2 (en) 2021-11-30
CN106980900A (zh) 2017-07-25

Similar Documents

Publication Publication Date Title
WO2017124930A1 (zh) 一种特征数据处理方法及设备
Cai et al. Yolobile: Real-time object detection on mobile devices via compression-compilation co-design
JP6726246B2 (ja) 畳み込みニューラルネットワークにおいて演算を実行する方法および装置並びに非一時的な記憶媒体
US11741361B2 (en) Machine learning-based network model building method and apparatus
US20180046896A1 (en) Method and device for quantizing complex artificial neural network
WO2020168796A1 (zh) 一种基于高维空间采样的数据增强方法
CN111984414B (zh) 一种数据处理的方法、系统、设备及可读存储介质
JPWO2017090475A1 (ja) 情報処理システム、関数作成方法および関数作成プログラム
JP2022050622A (ja) 分野フレーズマイニング方法、装置及び電子機器
WO2019198618A1 (ja) 単語ベクトル変更装置、方法、及びプログラム
CN106326005B (zh) 一种迭代型MapReduce作业的参数自动调优方法
CN103824075A (zh) 图像识别系统及方法
CN104731891A (zh) 一种etl中海量数据抽取的方法
TW202001701A (zh) 影像的量化方法及神經網路的訓練方法
US9443168B1 (en) Object detection approach using an ensemble strong classifier
CN107193979B (zh) 一种同源图片检索的方法
CN105573726B (zh) 一种规则处理方法及设备
CN106611012A (zh) 一种大数据环境下异构数据实时检索方法
CN113032367A (zh) 面向动态负载场景的大数据系统跨层配置参数协同调优方法和系统
CN116976461A (zh) 联邦学习方法、装置、设备及介质
CN108021935B (zh) 一种基于大数据技术的维度约简方法及装置
WO2016086731A1 (zh) 多级并行关键帧云提取方法及系统
CN103678545A (zh) 进行网络资源聚类的方法及装置
CN108717444A (zh) 一种基于分布式结构的大数据聚类方法和装置
Wang et al. An adaptively disperse centroids k-means algorithm based on mapreduce model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17740961

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17740961

Country of ref document: EP

Kind code of ref document: A1