CN106897776A - A kind of continuous type latent structure method based on nominal attribute - Google Patents
A kind of continuous type latent structure method based on nominal attribute Download PDFInfo
- Publication number
- CN106897776A CN106897776A CN201710034428.1A CN201710034428A CN106897776A CN 106897776 A CN106897776 A CN 106897776A CN 201710034428 A CN201710034428 A CN 201710034428A CN 106897776 A CN106897776 A CN 106897776A
- Authority
- CN
- China
- Prior art keywords
- feature
- field
- latent structure
- index
- sample
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of continuous type latent structure method based on nominal attribute, including step:1) data prediction;2) latent structure framework is set according to business background knowledge;3) specific latent structure path is produced;4) according to the corresponding feature of latent structure path configuration and training set is produced;5) feature selecting is carried out to training set and builds forecast model;6) related data set and forecast model are preserved and terminates off-line training process;7) sample data predicted on line will be needed to be pre-processed and feature extraction;8) forecast model obtained using off-line training is predicted to sample.The present invention can be applied not only to have " consumer articles " to scene, the more general classification with nominal attribute or classified variable feature and regression forecasting problem are also applied for simultaneously, compared with traditional One Hot and Dummy are encoded, difference is more obvious between feature produced by the present invention causes sample, and the feature of generation has stronger interpretation.
Description
Technical field
The present invention relates to the Feature Engineering field in machine learning, a kind of continuous type based on nominal attribute is referred in particular to special
Levy building method.
Background technology
With the arrival and the rise of internet in big data epoch, various machine learning algorithms are used in mining data
The commercially valuable information for being contained, and Feature Engineering is then a committed step in machine learning system, is decide
The upper limit of the precision of system, latent structure is then an important component in Feature Engineering.It is big on latent structure at present
Mostly it is rule-based artificial extraction, is largely dependent upon understanding of the engineer for business background, it is difficult to when shorter
It is interior disposably to extract more comprehensive feature, in particular for nominal attributive character or classified variable feature as " it is yellow, red,
It is blue " color characteristic of a class, often by nominal attribute be converted between distance be the sparse vector of equal length, use
One-Hot is encoded or the mode of Dummy codings is constructed to feature.Although this coded system each dimension represents certain
Whether nominal attribute or classified variable occur, with certain physical significance, but for different samples, this character representation
Distance between form acquiescence is identical definite value, and this point may be runed counter to reality, and when nominal attribute takes
When being worth excessive, this coded system can cause the excessive problem of characteristic dimension.
The present invention provides a kind of continuous type latent structure method based on nominal attribute, and this method can be realized semi-automatic
Latent structure, the otherness between the sample that can make generation compared with currently more conventional nominal attribute One-Hot is encoded is more
Plus substantially, and with stronger scalability, it is possible to use parallel computing is accelerated, allow machine learning algorithm engineering
Teacher can be absorbed in the combination of feature name attribute without the specific construction process of worry about, the spy of the method construction
It is often linear character to levy, and with obvious physical significance, interpretation is stronger, by specific feature selection process and letter
Single linear prediction model can just reach preferable prediction effect, the machine learning system being particularly suitable in commercial Application
Build.
The content of the invention
Shortcoming and deficiency it is an object of the invention to overcome prior art, there is provided a kind of continuous type based on nominal attribute
Latent structure method, this method is divided on off-line training and line two parts of prediction, can be applied not only to have " user-
Article " to scene, while being also applied for the more general classification with nominal attribute or classified variable feature and returning pre-
Survey problem, compared with traditional One-Hot and Dummy are encoded, the feature produced by building method of the invention cause sample it
Between difference it is more obvious, the feature of generation has stronger interpretation, and can alleviate to a certain extent because feature
Caused by higher-dimension is sparse the problems such as over-fitting.
To achieve the above object, technical scheme provided by the present invention is:A kind of continuous type feature based on nominal attribute
Building method, comprises the following steps:
1) data prediction, including tables of data integration, data representation format, missing values treatment;
2) latent structure framework is set according to business background knowledge;
3) specific latent structure path is produced;
4) according to the corresponding feature of latent structure path configuration and training set is produced;
5) feature selecting is carried out to training set and builds forecast model;
6) related data set and forecast model are preserved and terminates off-line training process;
7) sample data predicted on line will be needed to be pre-processed and feature extraction;
8) forecast model obtained using off-line training is predicted to sample.
In step 1) in, the tables of data is integrated and refers to being integrated existing tables of data, by the institute in data set
There is field to be placed in same table.The data representation format is referred to current nominal attribute field in the case of necessary
New nominal attribute field is converted to, specific method for expressing is determined according to different application scenarios and different forecast demands
It is fixed.The missing values treatment includes that absent field is rejected and two kinds of situations of Missing Data Filling, more tight for wherein shortage of data
The field of weight is rejected, the field not serious for deletion condition, if it is nominal attribute field, by the use of lack part
One new property value filling, or it is filled using KNN algorithms, if it is continuous type field, filled out using average
Fill or other fill methods.
In step 2) in, the latent structure framework of current predictive or classification problem is determined with reference to business background knowledge, including
Following steps:
2.1) all host nodes on the trunk and trunk of latent structure framework are determined.It is this for " user-article "
Application scenarios, trunk is divided into " user-article-index-calculation ", " user-user index-calculation ", " article-thing
Three kinds of product index-calculation ", host node refers to the node on trunk, including " user ", " article ", " index ", " calculating
Mode ", " article index ", six kinds of " user's index ";It is main for the general application scenarios with nominal attribute or classified variable
Dry then only have that " window-index-calculation " is a kind of, corresponding host node only has " window ", " index " and " calculation " three
Kind.
2.2) leaf node under host node is determined.At least include a leaf node, each leaf under each host node
Node all stores a nominal attribute field name in tables of data.For " user-article " this application scenarios, host node
The characteristics of leaf node under " user " generally all represents user and attribute, are all divided into multiple different classifications by user,
And the leaf node under host node " article " is then represented the characteristics of be article and attribute.Leaf node under " index " is then represented
The degree of the matching between user and article, similarity, certain user and article such as between user's description and article description
Whether occur in same sample etc., the leaf node under " user's index " then only represents certain index of user, such as user
Age, the remaining sum of account etc., the leaf node under " article index " then represents certain index of article itself, such as article
Price etc..For the general application scenarios with nominal attribute or classified variable, it is logical that the leaf node under " index " is stored
It is often the continuous type feature field name in addition to nominal attribute, all leaf nodes under " window " store all of nominal attribute
Field name.And the leaf node under " calculation " is then referred to according to set by current forecast demand or business background knowledge
The statistical put, such as summation, average, standard deviation, median, mode.
In step 3) in, according to step 2) determined by latent structure framework produce specific latent structure path, for
For " user-article " this application scenarios, wherein a latent structure path includes each host node on trunk and trunk
A selected leaf node and according to all possible latent structure of order traversal of " trunk-host node-leaf node "
Path, and for the general application scenarios with nominal attribute, then need by following steps:
3.1) determine the size of window, that is, determine to include how many leaf nodes in a window;
3.2) leaf node rule of combination is set:Leaf node under " window " host node is combined, traversal is all
Meet the leaf node combination of window size and rule of combination;
3.3) combined for the every kind of leaf node under window, the leaf node and " meter different under host node " index "
Different leaf nodes is combined under calculation mode ", finally constitutes all possible latent structure path.
In step 4) in, according to the corresponding feature of latent structure path configuration, comprise the following steps:
4.1) all nominal attribute field included in current signature construction path is determined, for " user-article " occasion,
The nominal attribute field of current path is determined according to the leaf node that the host node " user " of current path and " article " are selected, right
In the general application scenarios with nominal attribute, then included name is combined according to the leaf node under " window " host node
Attribute field is determined.
4.2) set step 4.1) in determine nominal attribute field collection be combined into C={ A, B ... }, wherein A and B represent name
Adopted attribute field name, under " user-article " occasion, the size of set is 1 or 2, for general answering with nominal attribute
With scene, set sizes are at least 1., and per paths, finally the features of generation are all given by:
FCyf,i=f (Yi)
Wherein y represents " user's index " in current path, in the leaf node under " article index " or " index " host node
Field, f represents customized calculation, and Cyf determines the composition structure of each paths, FCyf,iRepresent i-th sample
Feature value on path Cyf, YiRepresent i-th sample index set S of sampleCyf,iIn the index field of all samples take
The set of value, i.e.,:
Yi={ yj|j∈SCyf,i}
SCyf,iDefinition expression formula be shown below:
Wherein S represents the index set of all samples, CiRepresent in i-th sample for each nominal attribute in set C
The set of all values of field, Cj=CiRepresent and taken for all of each nominal attribute field in set C in i-th sample
The set of value is identical with j-th sample.
If calculation f is respectively defined as to sue for peace (sum), average (average) and standard deviation (std), then these three
The feature produced under calculation can be given by following formula respectively:
4.3) will be per paths all according to step 4.2) latent structure is carried out, when the latent structure path of all samples produces
After feature, be placed in same table as training set, wherein sample is row, the field that field is characterized, field name with
Construct the path name of this feature.
In step 5) in, the spy best for precision of prediction effect is picked out from all features by feature selecting algorithm
Levy subset and build forecast model.
In step 6) in, the related data set refers to involved when the training set after feature selecting and training set structure
And all nominal attribute field data, these data sets will be used for line on sample feature generation, the spy in training set
The field name levied remains unchanged holding with the name of latent structure path, and all fields of two datasets are placed in same table, protects
The forecast model deposited is then by for the prediction of sample on line.
In step 7) in, it would be desirable to enter the sample data predicted on line and pre-processed and feature extraction, including with
Lower step:
7.1) sample data to be predicted on line is pre-processed, here with off-line training during pre-treatment step
Correspondence, the field being removed because missing is serious during off-line training is rejected in current sample data, right
It is not removed during off-line training in other and there is the field of shortage of data in current sample data, then using KNN
Algorithm or average filling are processed.
7.2) feature extraction is carried out to sample data to be predicted on line, this process remains unchanged relative with off-line training process
Should, first from step 6) in read each latent structure path i.e. feature field name in the tables of data that obtains, according to feature structure
The corresponding name attribute field in path and path is made, will be with the sample to be predicted name attribute field identical training of value
The feature value in the current path of collection sample correspondence is copied in sample to be predicted.
In step 8) in, the forecast model obtained using off-line training is to step 7) in sample to be predicted after feature extraction
It is predicted.
The present invention compared with prior art, has the following advantages that and beneficial effect:
1st, many forecasting problems or recommendation problem are classified or are predicted with preferable using the feature based on temperature
Effect, the invention provides it is a kind of can by varigrained all features based on temperature be set out come method.
2nd, the latent structure method in the present invention has more good autgmentability, is allowed in " user-article " method
User is based on business background and creates customized leaf node so that method can automatically be constructed and relatively meet reality
Feature, has then broken away from the limitation of " user-article ", it is only necessary to which the size for setting window can in general latent structure method
Latent structure is carried out with for nominal attribute.
3rd, the latent structure method in the present invention is independent of one another between every latent structure path in implementation, is adapted to
Parallelization.
4th, the feature interpretation of " user-article " method construct in the present invention is stronger, with preferable realistic meaning,
As " total number of clicks of the active user to all advertisements " often represents possibility of the user to the click of all advertisements.
5th, the feature out of the latent structure method construct in the present invention is often linear character, can be simply by skin
Your inferior coefficient correlation carries out feature selecting, and can be obtained by preferably classification or pre- using relatively simple linear model
Survey effect.
6th, the otherness between the feature that can make generation compared with currently more conventional nominal attribute One-Hot is encoded is more
Plus substantially, nominal attribute is definite value using the distance that One-Hot encodes the characteristic vector between often different attribute, the present invention
The feature that goes out of method construct zoomed in or out by the field in index node for the distance between different attribute.
7th, in predicting on line, without reconfiguring feature, it is only necessary to which directly extracting feature from Offline training data is
Can, it is to avoid cause the excessive problem of time overhead because algorithm complex is too high when being predicted on line.
Brief description of the drawings
Fig. 1 is latent structure method of the invention and its corresponding whole machine learning system.
Fig. 2 is the latent structure method general frame based on " user-article " application scenarios.
Fig. 3 is the latent structure framework for being normally applied scene containing nominal attribute field.
Specific embodiment
With reference to specific embodiment, the invention will be further described.
As shown in figure 1, the continuous type latent structure method based on nominal attribute described in the present embodiment is whole engineering
An important ring in learning system, is responsible for all features needed for producing training pattern, decides the upper of whole precision of forecasting model
Limit, while this method is divided into two parts of prediction, offline structural feature, according to existing training set on line on off-line training and line
Sample characteristics to be predicted are produced, without recalculating.Specifically include following steps:
1) data prediction, including tables of data integration, data representation format, missing values treatment etc..The tables of data is integrated
Refer to being integrated existing tables of data, all fields in data set are placed in same table;The data are represented
Form refers to for current nominal attribute field being converted to new nominal attribute field, and specific method for expressing is according to different
Application scenario and different forecast demands are determined;The missing values treatment includes that absent field is rejected and two kinds of Missing Data Filling
Situation, the field serious for wherein shortage of data is rejected, the field not serious for deletion condition, if it is nominal category
Property field, then by the property value filling that the use one of lack part is new, or it is filled using KNN algorithms, if it is
Continuous type field, then using average filling or other fill methods.
The dataset representation form of table 1
ID | ID | User property A | User property B | Article ID | Goods attribute C | Whether occur | Similarity |
1 | 1 | 1 | 2 | 1 | 2 | 0 | 0.25 |
2 | 1 | 2 | 1 | 2 | 2 | 1 | 0.45 |
3 | 2 | 2 | 2 | 3 | 1 | 1 | 0.80 |
… | … | … | … | … | … | … | … |
Actually realize effect as shown in table 1, all nominal attribute and the index field of correlation that current data is concentrated
Etc. the line number Data preprocess, wherein field " ID " representative sample label of being gone forward side by side in same table of data Cun Chudao.
2) latent structure framework is set according to business background knowledge, is comprised the following steps:
2.1) all host nodes on the trunk and trunk of latent structure framework are determined.It is this for " user-article "
Application scenarios, trunk is divided into " user-article-index-calculation ", " user-user index-calculation ", " article-thing
Three kinds of product index-calculation ", host node refers to the node on trunk, including " user ", " article ", " index ", " calculating
Mode ", " article index ", six kinds of " user's index ";It is main for the general application scenarios with nominal attribute or classified variable
Dry then only have that " window-index-calculation " is a kind of, corresponding host node only has " window ", " index " and " calculation " three
Kind.
2.2) leaf node under host node is determined.At least include a leaf node, each leaf under each host node
Node all stores a nominal attribute field name in tables of data.For " user-article " this application scenarios, host node
The characteristics of leaf node under " user " generally all represents user and attribute, are all divided into multiple different classifications by user,
And the leaf node under host node " article " is then represented the characteristics of be article and attribute.Leaf node under " index " is then represented
The degree of the matching between user and article, similarity, certain user and article such as between user's description and article description
Whether occur in same sample etc., the leaf node under " user's index " then only represents certain index of user, such as user
Age, the remaining sum of account etc., the leaf node under " article index " then represents certain index of article itself, such as article
Price etc..For the general application scenarios with nominal attribute or classified variable, it is logical that the leaf node under " index " is stored
It is often the continuous type feature field name in addition to nominal attribute, all leaf nodes under " window " store all of nominal attribute
Field name.And the leaf node under " calculation " is then referred to according to set by current forecast demand or business background knowledge
The statistical put, such as summation, average, standard deviation, median, mode.
Rule of thumb each node in establishing method, and corresponding field name is stored in node, and using JSON texts
Part the structure of whole method is described and with step 1) in pretreated data set together as latent structure process
Input, wherein for " user-article " this application scenarios, its JSON file content is (corresponding with table 1) as follows, more one
As frame structure as shown in Fig. 2 only showing that three users name attributes include " ID " here, " user property A " " is used
Family attribute B " and two goods attributes " article ID ", " goods attribute C " is mutually corresponding with table 1, in actual use according to institute
The number of field perhaps is different to set different number of leaf node;
For the more general application scenarios containing nominal attribute, then " user " or " article " both masters are not differentiated between
Node, is used uniformly across " window ", and all of nominal attribute field is included under window, and its frame structure is as shown in Figure 3.
3) specific latent structure path is produced:For " user-article " this application scenarios, according in JSON files
Leaf node carries out combination of paths, first the leaf node on each host node of selection trunk reselection, such as " ID "-" article
Attribute C "-" similarity "-" mean " represents the average similarity of the article under active user and current item attribute C, real
Leaf node in addition to calculation node can only be carried out combination of paths by the combination in the realization of border here, because calculation
The calculating of node be calculated with floating number based on, and the calculating of other nodes is then with the computing such as match query and intersection of sets
It is main, and after all possible path is combined into, some unnecessary or not attainable paths can in advance be carried out
Removal, it is " current as value of " total temperature of all users to all items " this kind of feature in each sample is
Whether user occurs to current item " this kind of feature is then the target for needing to predict or classify, and belongs to not attainable path.
For the more general application scenarios containing nominal attribute, then need to predefine the size of window, that is, determine
How many leaf nodes are included in one window;Setting leaf node rule of combination:By the leaf node under " window " host node
It is combined, travels through all leaf node combinations for meeting window size and rule of combination;For the every kind of leaf section under window
Point combination, different leaf nodes is combined under the leaf node different under host node " index " and " calculation ", most
All possible latent structure path is constituted eventually.
4) according to the corresponding feature of latent structure path configuration and training set is produced, is comprised the following steps:
4.1) all nominal attribute field included in current signature construction path is determined, for " user-article " occasion,
The nominal attribute field of current path is determined according to the leaf node that the host node " user " of current path and " article " are selected, right
In the general application scenarios with nominal attribute, then included name is combined according to the leaf node under " window " host node
Attribute field is determined.
4.2) set step 4.1) in determine nominal attribute field collection be combined into C={ A, B ... }, wherein A and B represent name
Adopted attribute field name, under " user-article " occasion, the size of set is 1 or 2, for general answering with nominal attribute
With scene, set sizes are at least 1., and per paths, finally the features of generation are all given by:
FCyf,i=f (Yi)
Wherein y represents " user's index " in current path, in the leaf node under " article index " or " index " host node
Field, f represents customized calculation, and Cyf determines the composition structure of each paths, FCyf,iRepresent i-th sample
Feature value on path Cyf, YiRepresent i-th sample index set S of sampleCyf,iIn the index field of all samples take
The set of value, i.e.,:
Yi={ yj|j∈SCyf,i}
SCyf,iDefinition expression formula be shown below:
Wherein S represents the index set of all samples, CiRepresent in i-th sample for each nominal attribute in set C
The set of all values of field, Cj=CiRepresent and taken for all of each nominal attribute field in set C in i-th sample
The set of value is identical with j-th sample.
If calculation f is respectively defined as to sue for peace (sum), average (average) and standard deviation (std), then these three
The feature produced under calculation can be given by following formula respectively:
In practical implementations, the calculating to above-mentioned formula is completed using data query sentence to the calculating of feature, first
All leaf nodes according to calculation determine aggregate function, finally realize the construction of feature using GROUP BY operations again
(by taking " user-article " occasion as an example, concrete operations sentence is as shown in table 2 below).
Often performing a GROUP BY operation will produce the feature of multiple dimensions (specifically how many individual depending on calculating side
Formula node how many leaf node), for " user-article " this application scenario, the field of each dimension of feature space
Entitled user_field@item_field@indication_field@std_dev or user_field@item_
Field@indication_field@mean, wherein user_field, item_field and indication_field are character
String variable, represents the field name of leaf node under user and article and index host node.For more general application scenario,
Then using attributes1@attributes2@...@indication_field@operation, this form is represented, due to
The step for every time inquiry be all separate, therefore can easily carry out parallelization.
The latent structure of table 2 is operated
Line number | SQL |
1 | SELECT user_field,item_field, |
2 | STD(indication_field)AS user_field@item_field@std_dev, |
3 | MEAN(indication_field)AS user_field@item_field@mean |
4 | FROM tables 1 |
5 | GROUP BY user_field,item_field |
4.3) will be per paths all according to step 4.2) latent structure is carried out, when the latent structure path of all samples produces
After feature, be placed in same table as training set, wherein sample is row, the field that field is characterized, field name with
Construct the path name of this feature.
5) feature selecting is carried out to training set and builds forecast model:Training set is carried out using Pearson correlation coefficient such as
All features in training set are carried out feature selecting by feature selecting, and calculate each feature with and target coefficient correlation, when
Coefficient correlation be more than specified threshold when, then retain this feature, otherwise remove this feature, by above-mentioned steps obtain character subset it
Afterwards, coefficient correlation two-by-two between feature is calculated, picks out between feature that the weaker subset of correlation is made two-by-two from the subset
It is final characteristic set, the last preferable forecast model of choice accuracy carries out model training.
6) before being predicted on line is entered, need first to preserve necessary data during off-line training, so as to online
Used during upper prediction, including:
6.1) by off-line training the step of 5) in feature preserve, with its step 1) in addition to index field
Other fields are put into same table, as shown in table 3 below.Wherein the actual field name of feature 1 is with user_field@item_
The character string forms name of field@indication_field@mean;
6.2) by step 5) model that obtains and relevant parameter preserve.
The training samples information of table 3
ID | User property A | User property B | Article ID | Goods attribute C | Feature 1 | Feature 2 | … |
1 | 1 | 2 | 1 | 2 | 0 | 0.25 | … |
1 | 2 | 1 | 2 | 2 | 1 | 0.45 | … |
2 | 2 | 2 | 3 | 1 | 1 | 0.80 | … |
… | … | … | … | … | … | … | … |
7) sample data predicted on line will be needed to be pre-processed and feature extraction:By the feature in table 3
Field name takes out, and obtains producing all combination of paths of feature, for each paths, according to corresponding nominal attribute field pair
Table 3 carries out duplicate removal, then carries out left connection with sample data table to be predicted, so as to obtain the feature of current path.
8) forecast model for being obtained using off-line training is to step 7) in sample to be predicted after feature extraction be predicted.
In sum, after using above scheme, the present invention provides new method for nominal attributive character construction, not only
Can apply to have " user-article " to scene, while being also applied for more general becoming with nominal attribute or classification
The classification of measure feature and regression forecasting problem, compared with traditional One-Hot and Dummy is encoded, building method institute of the invention
Difference is more obvious between the feature of generation causes sample, and the feature of generation has stronger interpretation, with actual popularization
Value, is worthy to be popularized.
Embodiment described above is only the preferred embodiments of the invention, not limits practical range of the invention with this, therefore
The change that all shapes according to the present invention, principle are made, all should cover within the scope of the present invention.
Claims (7)
1. a kind of continuous type latent structure method based on nominal attribute, it is characterised in that comprise the following steps:
1) data prediction, including tables of data integration, data representation format, missing values treatment;
2) latent structure framework is set according to business background knowledge;
3) specific latent structure path is produced;
4) according to the corresponding feature of latent structure path configuration and training set is produced;
5) feature selecting is carried out to training set and builds forecast model;
6) related data set and forecast model are preserved and terminates off-line training process;
7) sample data predicted on line will be needed to be pre-processed and feature extraction;
8) forecast model obtained using off-line training is predicted to sample.
2. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step
It is rapid 1) in, the tables of data is integrated and refers to being integrated existing tables of data, and all fields in data set are placed in together
In one table;The data representation format refers to for current nominal attribute field being converted to new nominal attribute field, tool
The method for expressing of body is determined according to different application scenarios and different forecast demands;The missing values treatment includes lacking word
Section is rejected and two kinds of situations of Missing Data Filling, and the field serious for wherein shortage of data is rejected, for deletion condition not
Serious field, if it is nominal attribute field, by the new property value filling of the use one of lack part, or uses KNN
Algorithm is filled to it, if it is continuous type field, using average filling or other fill methods.
3. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step
It is rapid 2) in, the latent structure framework of current predictive or classification problem is determined with reference to business background knowledge, comprise the following steps:
2.1) all host nodes on the trunk and trunk of latent structure framework are determined
For " user-article " this application scenarios, trunk is divided into " user-article-index-calculation ", " user-user
Three kinds of index-calculation ", " article-article index-calculation ", host node refers to the node on trunk, including " uses
Family ", " article ", " index ", " calculation ", " article index ", six kinds of " user's index ";For with nominal attribute or classification
The application scenarios of variable, trunk then only has " window-index-calculation " one kind, and corresponding host node only has " window ", " refers to
Mark " and three kinds of " calculation ";
2.2) leaf node under host node is determined
At least include a leaf node under each host node, each leaf node stores a name category in tables of data
Property field name, for " user-article " this application scenarios, the leaf node under host node " user " generally all represents user
The characteristics of and attribute, user is all divided into multiple different classifications, and what the leaf node under host node " article " was then represented
The characteristics of being article and attribute, the leaf node under " index " then represent the degree of the matching between user and article, " user
Leaf node under index " then only represents certain index of user, and the leaf node under " article index " then represents article certainly
Certain index of body, for the application scenarios with nominal attribute or classified variable, it is logical that the leaf node under " index " is stored
It is often the continuous type feature field name in addition to nominal attribute, all leaf nodes under " window " store all of nominal attribute
Field name, and the leaf node under " calculation " is then referred to according to set by current forecast demand or business background knowledge
The statistical put;
In step 3) in, according to step 2) determined by latent structure framework produce specific latent structure path, for " using
For this application scenarios in family-article ", wherein a latent structure path is comprising selected by each host node on trunk and trunk
A leaf node selecting and according to all possible latent structure path of order traversal of " trunk-host node-leaf node ",
And for the application scenarios with nominal attribute, then need by following steps:
3.1) determine the size of window, that is, determine to include how many leaf nodes in a window;
3.2) leaf node rule of combination is set:Leaf node under " window " host node is combined, all meeting is traveled through
The leaf node combination of window size and rule of combination;
3.3) for the every kind of leaf node combination under window, the leaf node different under host node " index " and " calculating side
Different leaf nodes is combined under formula ", finally constitutes all possible latent structure path;
In step 4) in, according to the corresponding feature of latent structure path configuration, comprise the following steps:
4.1) all nominal attribute field that includes in current signature construction path is determined, for " user-article " occasion, currently
The nominal attribute field in path is determined according to the leaf node that the host node " user " of current path and " article " are selected, for band
There are the application scenarios of nominal attribute, then the nominal attribute field for being included according to the leaf node combination under " window " host node is determined
It is fixed;
4.2) set step 4.1) in determine nominal attribute field collection be combined into C={ A, B ... }, wherein A and B represents nominal category
Property field name, under " user-article " occasion, the size of set is 1 or 2, for the application scenarios with nominal attribute, set
Size is at least 1, and the feature finally produced per paths is all given by:
FCyf,i=f (Yi)
In formula, y represents " user's index " in current path, the word in the leaf node under " article index " or " index " host node
Section, f represents customized calculation, and Cyf determines the composition structure of each paths, FCyf,iRepresent i-th sample on
The feature value of path Cyf, YiRepresent i-th sample index set S of sampleCyf,iIn all samples index field value
Set, i.e.,:
Yi={ yj|j∈SCyf,i}
SCyf,iDefinition expression formula be shown below:
In formula, S represents the index set of all samples, CiRepresent in i-th sample for each nominal attribute field in set C
All values set, Cj=CiRepresent in i-th sample for all values of each nominal attribute field in set C
Set is identical with j-th sample;
If calculation f is respectively defined as into sue for peace sum, average average and standard deviation std, then under these three calculations
The feature of generation is given by following formula respectively:
4.3) will be per paths all according to step 4.2) latent structure is carried out, when the latent structure path of all samples produces feature
Afterwards, it is placed in same table as training set, wherein sample is row, the field that field is characterized, field name is constructing
The path name of this feature.
4. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step
It is rapid 5) in, the character subset best for precision of prediction effect is picked out from all features by feature selecting algorithm and is built
Forecast model.
5. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step
It is rapid 6) in, involved all names when the related data set refers to the training set after feature selecting and training set builds
The data of attribute field, these data sets will be used for the generation of the feature of sample on line, the field name of the feature in training set according to
Old holding is named with latent structure path, and all fields of two datasets is placed in same table, the forecast model of preservation
Then by for the prediction of sample on line.
6. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step
It is rapid 7) in, it would be desirable to enter on line predict sample data pre-processed and feature extraction, comprise the following steps:
7.1) sample data to be predicted on line is pre-processed, here with off-line training during pre-treatment step pair
Should, the field being removed because missing is serious during off-line training is rejected in current sample data, for
Other are not removed during off-line training and there is the field of shortage of data in current sample data, then calculated using KNN
Method or average filling are processed;
7.2) feature extraction is carried out to sample data to be predicted on line, this process remain unchanged it is corresponding with off-line training process,
First from step 6) in read each latent structure path i.e. feature field name in the tables of data that obtains, according to latent structure road
The corresponding name attribute field in footpath and path, will be with the sample to be predicted name attribute field identical training set sample of value
The feature value in the current path of this correspondence is copied in sample to be predicted.
7. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step
It is rapid 8) in, the forecast model obtained using off-line training is to step 7) in sample to be predicted after feature extraction be predicted.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710034428.1A CN106897776A (en) | 2017-01-17 | 2017-01-17 | A kind of continuous type latent structure method based on nominal attribute |
PCT/CN2017/116131 WO2018133596A1 (en) | 2017-01-17 | 2017-12-14 | Continuous feature construction method based on nominal attribute |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710034428.1A CN106897776A (en) | 2017-01-17 | 2017-01-17 | A kind of continuous type latent structure method based on nominal attribute |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106897776A true CN106897776A (en) | 2017-06-27 |
Family
ID=59197925
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710034428.1A Pending CN106897776A (en) | 2017-01-17 | 2017-01-17 | A kind of continuous type latent structure method based on nominal attribute |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN106897776A (en) |
WO (1) | WO2018133596A1 (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107844560A (en) * | 2017-10-30 | 2018-03-27 | 北京锐安科技有限公司 | A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing |
WO2018133596A1 (en) * | 2017-01-17 | 2018-07-26 | 华南理工大学 | Continuous feature construction method based on nominal attribute |
CN108776673A (en) * | 2018-05-23 | 2018-11-09 | 哈尔滨工业大学 | Automatic switching method, device and the storage medium of relation schema |
CN108932647A (en) * | 2017-07-24 | 2018-12-04 | 上海宏原信息科技有限公司 | A kind of method and apparatus for predicting its model of similar article and training |
CN109146083A (en) * | 2018-08-06 | 2019-01-04 | 阿里巴巴集团控股有限公司 | Feature coding method and apparatus |
CN111651524A (en) * | 2020-06-05 | 2020-09-11 | 第四范式(北京)技术有限公司 | Auxiliary implementation method and device for online prediction by using machine learning model |
CN113892939A (en) * | 2021-09-26 | 2022-01-07 | 燕山大学 | Method for monitoring respiratory frequency of human body in resting state based on multi-feature fusion |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101226521A (en) * | 2008-02-18 | 2008-07-23 | 南京大学 | Machine learning method for ambiguity data object estimation modeling |
US7792770B1 (en) * | 2007-08-24 | 2010-09-07 | Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. | Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104134017A (en) * | 2014-07-18 | 2014-11-05 | 华南理工大学 | Protein interaction relationship pair extraction method based on compact character representation |
CN105550275A (en) * | 2015-12-09 | 2016-05-04 | 中国科学院重庆绿色智能技术研究院 | Microblog forwarding quantity prediction method |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7451065B2 (en) * | 2002-03-11 | 2008-11-11 | International Business Machines Corporation | Method for constructing segmentation-based predictive models |
CN106897776A (en) * | 2017-01-17 | 2017-06-27 | 华南理工大学 | A kind of continuous type latent structure method based on nominal attribute |
-
2017
- 2017-01-17 CN CN201710034428.1A patent/CN106897776A/en active Pending
- 2017-12-14 WO PCT/CN2017/116131 patent/WO2018133596A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7792770B1 (en) * | 2007-08-24 | 2010-09-07 | Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. | Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree |
CN101226521A (en) * | 2008-02-18 | 2008-07-23 | 南京大学 | Machine learning method for ambiguity data object estimation modeling |
CN104063472A (en) * | 2014-06-30 | 2014-09-24 | 电子科技大学 | KNN text classifying method for optimizing training sample set |
CN104134017A (en) * | 2014-07-18 | 2014-11-05 | 华南理工大学 | Protein interaction relationship pair extraction method based on compact character representation |
CN105550275A (en) * | 2015-12-09 | 2016-05-04 | 中国科学院重庆绿色智能技术研究院 | Microblog forwarding quantity prediction method |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018133596A1 (en) * | 2017-01-17 | 2018-07-26 | 华南理工大学 | Continuous feature construction method based on nominal attribute |
CN108932647A (en) * | 2017-07-24 | 2018-12-04 | 上海宏原信息科技有限公司 | A kind of method and apparatus for predicting its model of similar article and training |
CN107844560A (en) * | 2017-10-30 | 2018-03-27 | 北京锐安科技有限公司 | A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing |
CN108776673A (en) * | 2018-05-23 | 2018-11-09 | 哈尔滨工业大学 | Automatic switching method, device and the storage medium of relation schema |
CN108776673B (en) * | 2018-05-23 | 2020-08-18 | 哈尔滨工业大学 | Automatic conversion method and device of relation mode and storage medium |
CN109146083A (en) * | 2018-08-06 | 2019-01-04 | 阿里巴巴集团控股有限公司 | Feature coding method and apparatus |
CN109146083B (en) * | 2018-08-06 | 2021-07-23 | 创新先进技术有限公司 | Feature encoding method and apparatus |
CN111651524A (en) * | 2020-06-05 | 2020-09-11 | 第四范式(北京)技术有限公司 | Auxiliary implementation method and device for online prediction by using machine learning model |
CN111651524B (en) * | 2020-06-05 | 2023-10-03 | 第四范式(北京)技术有限公司 | Auxiliary implementation method and device for on-line prediction by using machine learning model |
CN113892939A (en) * | 2021-09-26 | 2022-01-07 | 燕山大学 | Method for monitoring respiratory frequency of human body in resting state based on multi-feature fusion |
Also Published As
Publication number | Publication date |
---|---|
WO2018133596A1 (en) | 2018-07-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106897776A (en) | A kind of continuous type latent structure method based on nominal attribute | |
Bai et al. | Integrating Fuzzy C-Means and TOPSIS for performance evaluation: An application and comparative analysis | |
CN105975916B (en) | Age estimation method based on multi output convolutional neural networks and ordinal regression | |
CN104008203B (en) | A kind of Users' Interests Mining method for incorporating body situation | |
CN113590900A (en) | Sequence recommendation method fusing dynamic knowledge maps | |
CN112463980A (en) | Intelligent plan recommendation method based on knowledge graph | |
CN108829763A (en) | A kind of attribute forecast method of the film review website user based on deep neural network | |
CN112884551B (en) | Commodity recommendation method based on neighbor users and comment information | |
CN104346440A (en) | Neural-network-based cross-media Hash indexing method | |
CN106600052A (en) | User attribute and social network detection system based on space-time locus | |
CN107357793A (en) | Information recommendation method and device | |
CN107391542A (en) | A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates | |
CN110110372B (en) | Automatic segmentation prediction method for user time sequence behavior | |
CN102591915A (en) | Recommending method based on label migration learning | |
CN113706251B (en) | Model-based commodity recommendation method, device, computer equipment and storage medium | |
CN111582538A (en) | Community value prediction method and system based on graph neural network | |
CN112801425B (en) | Method and device for determining information click rate, computer equipment and storage medium | |
CN114971784B (en) | Session recommendation method and system based on graph neural network by fusing self-attention mechanism | |
CN110263236A (en) | Social network user multi-tag classification method based on dynamic multi-view learning model | |
CN110516165A (en) | A kind of cross-cutting recommended method of hybrid neural networks based on text UGC | |
Huynh et al. | Joint age estimation and gender classification of Asian faces using wide ResNet | |
CN103440651A (en) | Multi-label image annotation result fusion method based on rank minimization | |
CN114723535A (en) | Supply chain and knowledge graph-based item recommendation method, equipment and medium | |
CN112487305B (en) | GCN-based dynamic social user alignment method | |
CN117573972A (en) | Interest tag learning method based on long-short-term behaviors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170627 |
|
RJ01 | Rejection of invention patent application after publication |