CN105183914A

CN105183914A - Data characteristic formatting method and device

Info

Publication number: CN105183914A
Application number: CN201510660660.7A
Authority: CN
Inventors: 章岑; 杨田; 雷龙艳; 周盛; 潘柏宇; 王冀
Original assignee: 1Verge Internet Technology Beijing Co Ltd
Current assignee: 1Verge Internet Technology Beijing Co Ltd
Priority date: 2015-10-14
Filing date: 2015-10-14
Publication date: 2015-12-23

Abstract

The invention relates to the technical field of data mining, and particularly discloses a data characteristic formatting method and device. The method comprises the following steps: acquiring a first configuration file, determining attributes that need to be processed for the formatting and the formatting sequence of each attribute according to switch setting in the first configuration file; acquiring a second configuration file, determining the characteristic sequence and the characteristic value meaning of characteristics to be formatted in the attributes according to the characteristic configuration of the attributes in the second configuration file; determining the characteristic sequence number of each characteristic according to the formatting sequence of each attribute and the characteristic sequence of the characteristics to be formatted in the attributes; determining the characteristic value of the corresponding characteristic according to the attribute value and the characteristic value meaning of practical samples; formatting the practical samples into characteristic vectors according to the characteristic sequence numbers and the characteristic values. According to the data characteristic formatting method and device, provided by the technical scheme of the invention, each characteristic does not need to be set with a definitive sequence to assign the characteristic sequence number, the processed attributes/characteristics can be added and deleted at any time, and the characteristic formatting efficiency can be greatly improved.

Description

Data characteristics formatting method and device

Technical field

The present invention relates to data mining technology field, particularly a kind of data characteristics formatting method and device.

Background technology

Under the large data environment of network, the main task of data mining work is exactly from magnanimity information, find that the common trait of data is to carry out data statistics and analysis.It is obviously worthless for relying on the data mining manually carrying out large data, and relies on the data mining that machine carries out on discrimination, have natural defect; Therefore mainly through improving the discrimination of automatic mining based on the machine learning of model training in prior art.In the process of associated machine study, often need from raw data, extract some features to represent a sample, then the characteristic set of each sample is expressed as the form that algorithm can identify, so that algorithm can read these sample characteristics to carry out model training.

At present, existing machine learning algorithm storehouse, as libsvm, xgboost, sparkmllib etc., all formats training data based on common recognition form.In common recognition form, first to whole feature-set sequence number, carry out each feature of digitized representations and record sample subsequently in " feature sequence number: eigenwert " mode.For saving space, usually only need store the feature that eigenwert is not 0, but correspondingly, the sequence number of each feature and implication must be fixed, can determine the real meaning of feature by sequence number.

But, in Practical Project, because feature space dimension is very large, (hundreds of is thousands of, even trillion dimensional features are also very common), the difficulty being feature-set one both definite sequence of each sample before format is very large, and also likely newly-increased feature or delete feature at any time in real data processing procedure, so adopt the common recognition form of prior art determination feature to need the time and efforts of at substantial, how carrying out characteristic format is efficiently a more difficult problem.

Summary of the invention

Based on the defect of prior art, the object of this invention is to provide a kind of data characteristics formatting method and device, to carry out the characteristic format of data efficiently.

According to an aspect of the present invention, provide a kind of data characteristics formatting method, comprise step:

Obtain the first configuration file, determine that this format needs the format order of attribute to be processed and each attribute according to the switch-linear hybrid in described first configuration file;

Obtain the second configuration file, in the feature configuration determination attribute according to attribute in described second configuration file, treat characteristic sequence and the eigenwert implication of stylized facts;

Format according to each attribute described sequentially and in described attribute treats that the characteristic sequence of stylized facts determines the feature sequence number of each feature, according to the property value of actual sample and the eigenwert of described eigenwert implication determination character pair;

Actual sample described in each is formatted as proper vector according to described feature sequence number and described eigenwert.

Preferably, described switch-linear hybrid comprises: attribute switch labels or attribute record situation; Described format order according to described actual sample raw data natural quality order or freely specify according to the needs of model training.

Preferably, described feature configuration comprises: the format mode of discretize switch and described attribute.

Preferably, described discretize switch and described attribute format mode perceived model training algorithm model demand and freely arrange.

Preferably, in described proper vector a selected characteristic value be not 0 feature store.

According to another aspect of the present invention, additionally provide a kind of data characteristics formatting mechanism, comprising:

According to the switch-linear hybrid in described first configuration file, first configuration module, for obtaining the first configuration file, determines that this format needs the format order of attribute to be processed and each attribute;

Second configuration module, for obtaining the second configuration file, treats characteristic sequence and the eigenwert implication of stylized facts in the feature configuration determination attribute according to attribute in described second configuration file;

Feature processing block, sequentially and in described attribute treat that the characteristic sequence of stylized facts determines the feature sequence number of each feature, according to the property value of actual sample and the eigenwert of described eigenwert implication determination character pair for the format according to each attribute described;

Formatting module, for being formatted as proper vector by actual sample described in each according to described feature sequence number and described eigenwert.

Preferably, described first configuration module comprises:

Attribute switch module, for determining that according to attribute switch labels or attribute record situation this format needs attribute to be processed;

Attribute sequent modular, sequentially or according to the order that the needs of model training are freely specified determines the format of each attribute sequentially for the natural quality according to described actual sample raw data.

Preferably, described second configuration module comprises:

Discretize switch module, needs to carry out discretize for determining whether according to discretize switch;

Format configuration module, for configuring the format mode of described attribute.

Preferably, described discretize switch module and described format configuration module perceived model training algorithm model demand and freely arrange.

Preferably, described formatting module comprises: Vector Processing module, and the feature not being 0 for only selected characteristic value generates described proper vector and stores.

Embodiments provide a kind of data characteristics formatting method and device, its technical scheme can the free setting attribute that need process and character representation form thereof by two stage arrangement, thus the characteristic format that performs as required and model training can be realized, because the technical scheme of the embodiment of the present invention is without the need to fixing feature sequence number for each feature arranges set order in advance, process attribute/feature can also carry out additions and deletions at any time, thus can significantly lifting feature format efficiency.

Accompanying drawing explanation

Fig. 1 is the basic procedure schematic diagram of data characteristics formatting method in one embodiment of the invention;

Fig. 2 is the modular structure schematic diagram of data characteristics formatting mechanism in one embodiment of the invention.

Embodiment

For making the object, technical solutions and advantages of the present invention clearly understand, below in conjunction with embodiment also with reference to accompanying drawing, the present invention is described in more detail.Should be appreciated that, these describe just exemplary, and do not really want to limit the scope of the invention.In addition, in the following description, the description to known features and technology is eliminated, to avoid unnecessarily obscuring concept of the present invention.

In model training, a data characteristics format requisite step often, only has the data through characteristic format could be identified fast and efficiently, sort out and analyze at model training.The characteristic formatization of prior art is mainly carried out based on common recognition form, as its name suggests, the prerequisite of common recognition form is used to be need just all features to reach common understanding, namely use before need to identify whole feature and for each feature arrangement sequence number, this brings great pressure to characteristic format virtually, has had a strong impact on the efficiency of data characteristics format.

In embodiments of the present invention, by using aspect configuration file to carry out aid identification feature, determining the application mode of feature and attribute thereof, thus feature can be selected neatly to carry out relatively formaing freely, improve the efficiency of data characteristics format.As shown in Figure 1, in embodiments of the present invention, data characteristics formatting method comprises step:

S1, obtains the first configuration file, determines that this format needs the format order of attribute to be processed and each attribute according to the switch-linear hybrid in described first configuration file;

S2, obtains the second configuration file, treats characteristic sequence and the eigenwert implication of stylized facts in the feature configuration determination attribute according to attribute in described second configuration file;

S3, the format according to each attribute described sequentially and in described attribute treats that the characteristic sequence of stylized facts determines the feature sequence number of each feature, according to the property value of actual sample and the eigenwert of described eigenwert implication determination character pair;

S4, is formatted as proper vector by actual sample described in each according to described feature sequence number and described eigenwert.

Particularly, in embodiments of the present invention, need multiple actual sample to be formatted as multiple proper vector respectively.The raw data of each actual sample adopts the multiple attribute representations with specific object value, and the such as raw data of sample " user A " is " sex: man's age: 24 client types: PC holds "; Each proper vector is then comprise the digitized representations that multiple form is " feature sequence number: eigenwert ", and the proper vector after such as sample " user A " format may be " 2:1.06:1.013:1.0 ".The format completing a sample needs to realize the conversion of raw data to digitized representations, and the format having unified whole sample then needs to determine unified conversion regime.

First, in step S1, the first configuration file is preferably feature switchgear distribution file, which provides in sample the switch needing attribute to be processed.It can be the switch of whole attribute in this first configuration file, such as in initialization procedure, need the switch labels of attribute to be processed to be set to open mode (such as putting 1) this format, this switch labels formatd without the need to the attribute of process is set to closed condition (such as setting to 0); Also only can record this and format attribute to be processed, Unrecorded, be considered as without the need to process.Also being provided with the format order of each attribute in first configuration file simultaneously, according to this order, feature permutation being become proper vector when formaing; Format order according to the natural quality order of sample raw data, also freely can be specified according to the needs of model training.

In step S2, the second configuration file is preferably in the aspect configuration file of each attribute.First the feature designating this attribute in aspect configuration file, the need of discretize (such as discretize switch being put 1), designates the format mode of this attribute: the internal sequence of characteristic dimension, property value character pair implication and character pair further when needing discretize.Such as, for " sex " attribute, first the feature designating this attribute in its aspect configuration file needs discretize, when next designates discretize, the feature of this attribute accounts for 3 dimensions, wherein 0 represents that women, 1 represents that the male sex, 2 represents unknown, according to sample actual attribute, corresponding dimension is put 1 during generating feature vector.If without the need to discretize (such as being set to 0 by discretize switch), then the format mode of this attribute is: feature only has 1 dimension (characteristic sequence be 0 or override), and property value is actual characteristic value; Such as, when " age ", attribute was without the need to discretize, the eigenwert at " age: 24 " is " 24 "; If desired discretize, its discretize of further hypothesis accounts for 8 dimensions, wherein, 0 for cannot segmentation, 1 be under-18s, 2 for 18-24 year, 3 for 25-29 year, 4 for 30-34 year, 5 for 35-39 year, 6 for 40-49 year, 7 be more than 50 years old, then the eigenwert at " age: 24 " is and dimension 2 (i.e. the 3rd dimension) is put 1.

In step S3, according to characteristic sequence assigned characteristics sequence number successively in the format of each attribute order and each attribute, the eigenwert of property value with specific features sequence number is associated with the corresponding relation of eigenwert implication according to property value simultaneously.Such as, suppose that attribute formatization order is for " sex " → " age " → " client type ", three equal discretizes of attribute, the discretize mode at " sex " and " age " as described above, " client type " discretize accounts for 3 dimensions, and 0 is mobile App end, 1 is that PC end, 2 is for unknown; Then in proper vector, 1-3 dimension is the feature of " sex " attribute, corresponding assigned characteristics sequence number 1-3,4-11 dimension is the feature of " age " attribute, corresponding assigned characteristics sequence number 4-11,12-14 dimension is the feature of " client type " attribute, corresponding assigned characteristics sequence number 12-14; The eigenwert of individual features sequence number represents when putting 1 that actual property value conforms to this feature sequence number/dimension.

In step S4, in the manner described above each sample format is turned to proper vector.Particularly, such as above-mentioned " user A " sample, by the property value of " sex: man ", the eigenwert of the 2nd dimension (i.e. feature sequence number 2) is put 1, by the property value at " age: 24 ", the eigenwert of the 6th dimension (i.e. feature sequence number 6) is put 1, by the property value of " client type: PC holds ", the eigenwert of the 13rd dimension (i.e. feature sequence number 13) is put 1; Selected characteristic value be not 0 feature store, then the proper vector after above-mentioned " user A " sample format is expressed as " 2:1.06:1.013:1.0 ".

In embodiments of the present invention, the character representation form needing attribute to be processed and attribute can be formatd by free setting, thus statistical study can be carried out according to the demand unrestricted choice special characteristic of model training.The more important thing is, in the embodiment of the present invention, without the need in advance for each feature arranges set order and fixes feature sequence number, the attribute/feature of process can also carry out additions and deletions at any time, thus can the efficiency of significantly lifting feature format.

Particularly, may use a variety of attributive character in a lot of Machine Learning Problems such as clicking rate prediction model, some attribute is natural has discrete nature, such as " sex " attribute; Some attribute then possesses Continuous property, such as the attribute such as " age " or " video duration ".Continuous feature is formatd and needs the selection depending on algorithm model and make different changes, describe the format mode different to continuous feature respectively for " video duration " attribute here: the first, need discretize; Such as ad material duration is not generally at 5 seconds to 1 minute etc., can with 5 seconds, to carry out segmentation discrete for a bit of by duration, arrange in the second configuration file (i.e. the aspect configuration file of this attribute) feature discrete time dimension, feature implication and internal sequence: wherein, 0 be 0-4 second, 1 be 5-9 second, 2 be 10-14 second, 3 be 15-19 second ..., 11 be 55-59 second, 12 be more than 1 minute, the feature of this attribute final occupies 13 dimensions in characteristic vector space, and each sample only has the eigenwert of a dimension to be 1 in these 13 dimensions.The second, does not need discretize; The direct eigenwert as a characteristic dimension adds in proper vector by material duration in this case, can write material ID and the corresponding table of length, search this table obtain concrete material duration characteristics value when doing feature extraction in configuration file.

Discretize can be beneficial to carries out statistic of classification when model training, but not the continuous feature of discretize then can accurate analysis sample, can reduce the dimension of proper vector simultaneously.Particularly, if use linear model such as Logic Regression Models, be then necessary to carry out discretize to continuous feature; If use nonlinear such as tree-model, then discretize can not be carried out.Whether carry out discretize further by free setting in configuration file in the embodiment of the present invention and how to carry out discretize, can carry out formaing and model training for different algorithm requirements, also significantly improve degree of freedom and the applicability of characteristic format.

As shown in Figure 2, the embodiment of the present invention also provides a kind of data characteristics formatting mechanism 1 simultaneously, comprising:

According to the switch-linear hybrid in described first configuration file, first configuration module 101, for obtaining the first configuration file, determines that this format needs the format order of attribute to be processed and each attribute;

Second configuration module 102, for obtaining the second configuration file, treats characteristic sequence and the eigenwert implication of stylized facts in the feature configuration determination attribute according to attribute in described second configuration file;

Feature processing block 103, sequentially and in described attribute treat that the characteristic sequence of stylized facts determines the feature sequence number of each feature, according to the property value of actual sample and the eigenwert of described eigenwert implication determination character pair for the format according to each attribute described;

Formatting module 104, for being formatted as proper vector by actual sample described in each according to described feature sequence number and described eigenwert.

Relevant technical staff in the field be appreciated that with said method correspondingly, also there is each functional module corresponding with various method steps in the device of the embodiment of the present invention, this is no longer going to repeat them simultaneously.In actual applications, above-mentioned data characteristics formatting mechanism can be independently computing equipment, also can be the separate functional unit loaded by computing equipment, can also be computing equipment directly realize virtual/solid element.Equally, each module in device all can by being arranged in the central processor CPU of computing equipment, microprocessor MPU, the realization such as digital signal processor DSP or on-site programmable gate array FPGA, and the realization rate of said apparatus and module should not be considered as the restriction to the specific embodiment of the invention.

Should be understood that, above-mentioned embodiment of the present invention only for exemplary illustration or explain principle of the present invention, and is not construed as limiting the invention.Therefore, any amendment made when without departing from the spirit and scope of the present invention, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.In addition, claims of the present invention be intended to contain fall into claims scope and border or this scope and border equivalents in whole change and modification.

Claims

1. a data characteristics formatting method, is characterized in that, described method comprises step:

2. method according to claim 1, is characterized in that, described switch-linear hybrid comprises: attribute switch labels or attribute record situation;

Described format order according to described actual sample raw data natural quality order or freely specify according to the needs of model training.

3. method according to claim 1, is characterized in that, described feature configuration comprises: the format mode of discretize switch and described attribute.

4. method according to claim 3, is characterized in that, the demand of the algorithm model of the format mode perceived model of described discretize switch and described attribute training and freely arranging.

5. method according to claim 1, is characterized in that, in described proper vector a selected characteristic value be not 0 feature store.

6. a data characteristics formatting mechanism, is characterized in that, described device comprises:

7. device according to claim 6, is characterized in that, described first configuration module comprises:

8. device according to claim 6, is characterized in that, described second configuration module comprises:

9. device according to claim 8, is characterized in that, described discretize switch module and described format configuration module perceived model training algorithm model demand and freely arrange.

10. device according to claim 6, is characterized in that, described formatting module comprises:

Vector Processing module, the feature not being 0 for only selected characteristic value generates described proper vector and stores.