CN106897776A - A kind of continuous type latent structure method based on nominal attribute - Google Patents

A kind of continuous type latent structure method based on nominal attribute Download PDF

Info

Publication number
CN106897776A
CN106897776A CN201710034428.1A CN201710034428A CN106897776A CN 106897776 A CN106897776 A CN 106897776A CN 201710034428 A CN201710034428 A CN 201710034428A CN 106897776 A CN106897776 A CN 106897776A
Authority
CN
China
Prior art keywords
feature
field
latent structure
index
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710034428.1A
Other languages
Chinese (zh)
Inventor
董守斌
马雅从
张晶
胡金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201710034428.1A priority Critical patent/CN106897776A/en
Publication of CN106897776A publication Critical patent/CN106897776A/en
Priority to PCT/CN2017/116131 priority patent/WO2018133596A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of continuous type latent structure method based on nominal attribute, including step:1) data prediction;2) latent structure framework is set according to business background knowledge;3) specific latent structure path is produced;4) according to the corresponding feature of latent structure path configuration and training set is produced;5) feature selecting is carried out to training set and builds forecast model;6) related data set and forecast model are preserved and terminates off-line training process;7) sample data predicted on line will be needed to be pre-processed and feature extraction;8) forecast model obtained using off-line training is predicted to sample.The present invention can be applied not only to have " consumer articles " to scene, the more general classification with nominal attribute or classified variable feature and regression forecasting problem are also applied for simultaneously, compared with traditional One Hot and Dummy are encoded, difference is more obvious between feature produced by the present invention causes sample, and the feature of generation has stronger interpretation.

Description

A kind of continuous type latent structure method based on nominal attribute
Technical field
The present invention relates to the Feature Engineering field in machine learning, a kind of continuous type based on nominal attribute is referred in particular to special Levy building method.
Background technology
With the arrival and the rise of internet in big data epoch, various machine learning algorithms are used in mining data The commercially valuable information for being contained, and Feature Engineering is then a committed step in machine learning system, is decide The upper limit of the precision of system, latent structure is then an important component in Feature Engineering.It is big on latent structure at present Mostly it is rule-based artificial extraction, is largely dependent upon understanding of the engineer for business background, it is difficult to when shorter It is interior disposably to extract more comprehensive feature, in particular for nominal attributive character or classified variable feature as " it is yellow, red, It is blue " color characteristic of a class, often by nominal attribute be converted between distance be the sparse vector of equal length, use One-Hot is encoded or the mode of Dummy codings is constructed to feature.Although this coded system each dimension represents certain Whether nominal attribute or classified variable occur, with certain physical significance, but for different samples, this character representation Distance between form acquiescence is identical definite value, and this point may be runed counter to reality, and when nominal attribute takes When being worth excessive, this coded system can cause the excessive problem of characteristic dimension.
The present invention provides a kind of continuous type latent structure method based on nominal attribute, and this method can be realized semi-automatic Latent structure, the otherness between the sample that can make generation compared with currently more conventional nominal attribute One-Hot is encoded is more Plus substantially, and with stronger scalability, it is possible to use parallel computing is accelerated, allow machine learning algorithm engineering Teacher can be absorbed in the combination of feature name attribute without the specific construction process of worry about, the spy of the method construction It is often linear character to levy, and with obvious physical significance, interpretation is stronger, by specific feature selection process and letter Single linear prediction model can just reach preferable prediction effect, the machine learning system being particularly suitable in commercial Application Build.
The content of the invention
Shortcoming and deficiency it is an object of the invention to overcome prior art, there is provided a kind of continuous type based on nominal attribute Latent structure method, this method is divided on off-line training and line two parts of prediction, can be applied not only to have " user- Article " to scene, while being also applied for the more general classification with nominal attribute or classified variable feature and returning pre- Survey problem, compared with traditional One-Hot and Dummy are encoded, the feature produced by building method of the invention cause sample it Between difference it is more obvious, the feature of generation has stronger interpretation, and can alleviate to a certain extent because feature Caused by higher-dimension is sparse the problems such as over-fitting.
To achieve the above object, technical scheme provided by the present invention is:A kind of continuous type feature based on nominal attribute Building method, comprises the following steps:
1) data prediction, including tables of data integration, data representation format, missing values treatment;
2) latent structure framework is set according to business background knowledge;
3) specific latent structure path is produced;
4) according to the corresponding feature of latent structure path configuration and training set is produced;
5) feature selecting is carried out to training set and builds forecast model;
6) related data set and forecast model are preserved and terminates off-line training process;
7) sample data predicted on line will be needed to be pre-processed and feature extraction;
8) forecast model obtained using off-line training is predicted to sample.
In step 1) in, the tables of data is integrated and refers to being integrated existing tables of data, by the institute in data set There is field to be placed in same table.The data representation format is referred to current nominal attribute field in the case of necessary New nominal attribute field is converted to, specific method for expressing is determined according to different application scenarios and different forecast demands It is fixed.The missing values treatment includes that absent field is rejected and two kinds of situations of Missing Data Filling, more tight for wherein shortage of data The field of weight is rejected, the field not serious for deletion condition, if it is nominal attribute field, by the use of lack part One new property value filling, or it is filled using KNN algorithms, if it is continuous type field, filled out using average Fill or other fill methods.
In step 2) in, the latent structure framework of current predictive or classification problem is determined with reference to business background knowledge, including Following steps:
2.1) all host nodes on the trunk and trunk of latent structure framework are determined.It is this for " user-article " Application scenarios, trunk is divided into " user-article-index-calculation ", " user-user index-calculation ", " article-thing Three kinds of product index-calculation ", host node refers to the node on trunk, including " user ", " article ", " index ", " calculating Mode ", " article index ", six kinds of " user's index ";It is main for the general application scenarios with nominal attribute or classified variable Dry then only have that " window-index-calculation " is a kind of, corresponding host node only has " window ", " index " and " calculation " three Kind.
2.2) leaf node under host node is determined.At least include a leaf node, each leaf under each host node Node all stores a nominal attribute field name in tables of data.For " user-article " this application scenarios, host node The characteristics of leaf node under " user " generally all represents user and attribute, are all divided into multiple different classifications by user, And the leaf node under host node " article " is then represented the characteristics of be article and attribute.Leaf node under " index " is then represented The degree of the matching between user and article, similarity, certain user and article such as between user's description and article description Whether occur in same sample etc., the leaf node under " user's index " then only represents certain index of user, such as user Age, the remaining sum of account etc., the leaf node under " article index " then represents certain index of article itself, such as article Price etc..For the general application scenarios with nominal attribute or classified variable, it is logical that the leaf node under " index " is stored It is often the continuous type feature field name in addition to nominal attribute, all leaf nodes under " window " store all of nominal attribute Field name.And the leaf node under " calculation " is then referred to according to set by current forecast demand or business background knowledge The statistical put, such as summation, average, standard deviation, median, mode.
In step 3) in, according to step 2) determined by latent structure framework produce specific latent structure path, for For " user-article " this application scenarios, wherein a latent structure path includes each host node on trunk and trunk A selected leaf node and according to all possible latent structure of order traversal of " trunk-host node-leaf node " Path, and for the general application scenarios with nominal attribute, then need by following steps:
3.1) determine the size of window, that is, determine to include how many leaf nodes in a window;
3.2) leaf node rule of combination is set:Leaf node under " window " host node is combined, traversal is all Meet the leaf node combination of window size and rule of combination;
3.3) combined for the every kind of leaf node under window, the leaf node and " meter different under host node " index " Different leaf nodes is combined under calculation mode ", finally constitutes all possible latent structure path.
In step 4) in, according to the corresponding feature of latent structure path configuration, comprise the following steps:
4.1) all nominal attribute field included in current signature construction path is determined, for " user-article " occasion, The nominal attribute field of current path is determined according to the leaf node that the host node " user " of current path and " article " are selected, right In the general application scenarios with nominal attribute, then included name is combined according to the leaf node under " window " host node Attribute field is determined.
4.2) set step 4.1) in determine nominal attribute field collection be combined into C={ A, B ... }, wherein A and B represent name Adopted attribute field name, under " user-article " occasion, the size of set is 1 or 2, for general answering with nominal attribute With scene, set sizes are at least 1., and per paths, finally the features of generation are all given by:
FCyf,i=f (Yi)
Wherein y represents " user's index " in current path, in the leaf node under " article index " or " index " host node Field, f represents customized calculation, and Cyf determines the composition structure of each paths, FCyf,iRepresent i-th sample Feature value on path Cyf, YiRepresent i-th sample index set S of sampleCyf,iIn the index field of all samples take The set of value, i.e.,:
Yi={ yj|j∈SCyf,i}
SCyf,iDefinition expression formula be shown below:
Wherein S represents the index set of all samples, CiRepresent in i-th sample for each nominal attribute in set C The set of all values of field, Cj=CiRepresent and taken for all of each nominal attribute field in set C in i-th sample The set of value is identical with j-th sample.
If calculation f is respectively defined as to sue for peace (sum), average (average) and standard deviation (std), then these three The feature produced under calculation can be given by following formula respectively:
4.3) will be per paths all according to step 4.2) latent structure is carried out, when the latent structure path of all samples produces After feature, be placed in same table as training set, wherein sample is row, the field that field is characterized, field name with Construct the path name of this feature.
In step 5) in, the spy best for precision of prediction effect is picked out from all features by feature selecting algorithm Levy subset and build forecast model.
In step 6) in, the related data set refers to involved when the training set after feature selecting and training set structure And all nominal attribute field data, these data sets will be used for line on sample feature generation, the spy in training set The field name levied remains unchanged holding with the name of latent structure path, and all fields of two datasets are placed in same table, protects The forecast model deposited is then by for the prediction of sample on line.
In step 7) in, it would be desirable to enter the sample data predicted on line and pre-processed and feature extraction, including with Lower step:
7.1) sample data to be predicted on line is pre-processed, here with off-line training during pre-treatment step Correspondence, the field being removed because missing is serious during off-line training is rejected in current sample data, right It is not removed during off-line training in other and there is the field of shortage of data in current sample data, then using KNN Algorithm or average filling are processed.
7.2) feature extraction is carried out to sample data to be predicted on line, this process remains unchanged relative with off-line training process Should, first from step 6) in read each latent structure path i.e. feature field name in the tables of data that obtains, according to feature structure The corresponding name attribute field in path and path is made, will be with the sample to be predicted name attribute field identical training of value The feature value in the current path of collection sample correspondence is copied in sample to be predicted.
In step 8) in, the forecast model obtained using off-line training is to step 7) in sample to be predicted after feature extraction It is predicted.
The present invention compared with prior art, has the following advantages that and beneficial effect:
1st, many forecasting problems or recommendation problem are classified or are predicted with preferable using the feature based on temperature Effect, the invention provides it is a kind of can by varigrained all features based on temperature be set out come method.
2nd, the latent structure method in the present invention has more good autgmentability, is allowed in " user-article " method User is based on business background and creates customized leaf node so that method can automatically be constructed and relatively meet reality Feature, has then broken away from the limitation of " user-article ", it is only necessary to which the size for setting window can in general latent structure method Latent structure is carried out with for nominal attribute.
3rd, the latent structure method in the present invention is independent of one another between every latent structure path in implementation, is adapted to Parallelization.
4th, the feature interpretation of " user-article " method construct in the present invention is stronger, with preferable realistic meaning, As " total number of clicks of the active user to all advertisements " often represents possibility of the user to the click of all advertisements.
5th, the feature out of the latent structure method construct in the present invention is often linear character, can be simply by skin Your inferior coefficient correlation carries out feature selecting, and can be obtained by preferably classification or pre- using relatively simple linear model Survey effect.
6th, the otherness between the feature that can make generation compared with currently more conventional nominal attribute One-Hot is encoded is more Plus substantially, nominal attribute is definite value using the distance that One-Hot encodes the characteristic vector between often different attribute, the present invention The feature that goes out of method construct zoomed in or out by the field in index node for the distance between different attribute.
7th, in predicting on line, without reconfiguring feature, it is only necessary to which directly extracting feature from Offline training data is Can, it is to avoid cause the excessive problem of time overhead because algorithm complex is too high when being predicted on line.
Brief description of the drawings
Fig. 1 is latent structure method of the invention and its corresponding whole machine learning system.
Fig. 2 is the latent structure method general frame based on " user-article " application scenarios.
Fig. 3 is the latent structure framework for being normally applied scene containing nominal attribute field.
Specific embodiment
With reference to specific embodiment, the invention will be further described.
As shown in figure 1, the continuous type latent structure method based on nominal attribute described in the present embodiment is whole engineering An important ring in learning system, is responsible for all features needed for producing training pattern, decides the upper of whole precision of forecasting model Limit, while this method is divided into two parts of prediction, offline structural feature, according to existing training set on line on off-line training and line Sample characteristics to be predicted are produced, without recalculating.Specifically include following steps:
1) data prediction, including tables of data integration, data representation format, missing values treatment etc..The tables of data is integrated Refer to being integrated existing tables of data, all fields in data set are placed in same table;The data are represented Form refers to for current nominal attribute field being converted to new nominal attribute field, and specific method for expressing is according to different Application scenario and different forecast demands are determined;The missing values treatment includes that absent field is rejected and two kinds of Missing Data Filling Situation, the field serious for wherein shortage of data is rejected, the field not serious for deletion condition, if it is nominal category Property field, then by the property value filling that the use one of lack part is new, or it is filled using KNN algorithms, if it is Continuous type field, then using average filling or other fill methods.
The dataset representation form of table 1
ID ID User property A User property B Article ID Goods attribute C Whether occur Similarity
1 1 1 2 1 2 0 0.25
2 1 2 1 2 2 1 0.45
3 2 2 2 3 1 1 0.80
Actually realize effect as shown in table 1, all nominal attribute and the index field of correlation that current data is concentrated Etc. the line number Data preprocess, wherein field " ID " representative sample label of being gone forward side by side in same table of data Cun Chudao.
2) latent structure framework is set according to business background knowledge, is comprised the following steps:
2.1) all host nodes on the trunk and trunk of latent structure framework are determined.It is this for " user-article " Application scenarios, trunk is divided into " user-article-index-calculation ", " user-user index-calculation ", " article-thing Three kinds of product index-calculation ", host node refers to the node on trunk, including " user ", " article ", " index ", " calculating Mode ", " article index ", six kinds of " user's index ";It is main for the general application scenarios with nominal attribute or classified variable Dry then only have that " window-index-calculation " is a kind of, corresponding host node only has " window ", " index " and " calculation " three Kind.
2.2) leaf node under host node is determined.At least include a leaf node, each leaf under each host node Node all stores a nominal attribute field name in tables of data.For " user-article " this application scenarios, host node The characteristics of leaf node under " user " generally all represents user and attribute, are all divided into multiple different classifications by user, And the leaf node under host node " article " is then represented the characteristics of be article and attribute.Leaf node under " index " is then represented The degree of the matching between user and article, similarity, certain user and article such as between user's description and article description Whether occur in same sample etc., the leaf node under " user's index " then only represents certain index of user, such as user Age, the remaining sum of account etc., the leaf node under " article index " then represents certain index of article itself, such as article Price etc..For the general application scenarios with nominal attribute or classified variable, it is logical that the leaf node under " index " is stored It is often the continuous type feature field name in addition to nominal attribute, all leaf nodes under " window " store all of nominal attribute Field name.And the leaf node under " calculation " is then referred to according to set by current forecast demand or business background knowledge The statistical put, such as summation, average, standard deviation, median, mode.
Rule of thumb each node in establishing method, and corresponding field name is stored in node, and using JSON texts Part the structure of whole method is described and with step 1) in pretreated data set together as latent structure process Input, wherein for " user-article " this application scenarios, its JSON file content is (corresponding with table 1) as follows, more one As frame structure as shown in Fig. 2 only showing that three users name attributes include " ID " here, " user property A " " is used Family attribute B " and two goods attributes " article ID ", " goods attribute C " is mutually corresponding with table 1, in actual use according to institute The number of field perhaps is different to set different number of leaf node;
For the more general application scenarios containing nominal attribute, then " user " or " article " both masters are not differentiated between Node, is used uniformly across " window ", and all of nominal attribute field is included under window, and its frame structure is as shown in Figure 3.
3) specific latent structure path is produced:For " user-article " this application scenarios, according in JSON files Leaf node carries out combination of paths, first the leaf node on each host node of selection trunk reselection, such as " ID "-" article Attribute C "-" similarity "-" mean " represents the average similarity of the article under active user and current item attribute C, real Leaf node in addition to calculation node can only be carried out combination of paths by the combination in the realization of border here, because calculation The calculating of node be calculated with floating number based on, and the calculating of other nodes is then with the computing such as match query and intersection of sets It is main, and after all possible path is combined into, some unnecessary or not attainable paths can in advance be carried out Removal, it is " current as value of " total temperature of all users to all items " this kind of feature in each sample is Whether user occurs to current item " this kind of feature is then the target for needing to predict or classify, and belongs to not attainable path.
For the more general application scenarios containing nominal attribute, then need to predefine the size of window, that is, determine How many leaf nodes are included in one window;Setting leaf node rule of combination:By the leaf node under " window " host node It is combined, travels through all leaf node combinations for meeting window size and rule of combination;For the every kind of leaf section under window Point combination, different leaf nodes is combined under the leaf node different under host node " index " and " calculation ", most All possible latent structure path is constituted eventually.
4) according to the corresponding feature of latent structure path configuration and training set is produced, is comprised the following steps:
4.1) all nominal attribute field included in current signature construction path is determined, for " user-article " occasion, The nominal attribute field of current path is determined according to the leaf node that the host node " user " of current path and " article " are selected, right In the general application scenarios with nominal attribute, then included name is combined according to the leaf node under " window " host node Attribute field is determined.
4.2) set step 4.1) in determine nominal attribute field collection be combined into C={ A, B ... }, wherein A and B represent name Adopted attribute field name, under " user-article " occasion, the size of set is 1 or 2, for general answering with nominal attribute With scene, set sizes are at least 1., and per paths, finally the features of generation are all given by:
FCyf,i=f (Yi)
Wherein y represents " user's index " in current path, in the leaf node under " article index " or " index " host node Field, f represents customized calculation, and Cyf determines the composition structure of each paths, FCyf,iRepresent i-th sample Feature value on path Cyf, YiRepresent i-th sample index set S of sampleCyf,iIn the index field of all samples take The set of value, i.e.,:
Yi={ yj|j∈SCyf,i}
SCyf,iDefinition expression formula be shown below:
Wherein S represents the index set of all samples, CiRepresent in i-th sample for each nominal attribute in set C The set of all values of field, Cj=CiRepresent and taken for all of each nominal attribute field in set C in i-th sample The set of value is identical with j-th sample.
If calculation f is respectively defined as to sue for peace (sum), average (average) and standard deviation (std), then these three The feature produced under calculation can be given by following formula respectively:
In practical implementations, the calculating to above-mentioned formula is completed using data query sentence to the calculating of feature, first All leaf nodes according to calculation determine aggregate function, finally realize the construction of feature using GROUP BY operations again (by taking " user-article " occasion as an example, concrete operations sentence is as shown in table 2 below).
Often performing a GROUP BY operation will produce the feature of multiple dimensions (specifically how many individual depending on calculating side Formula node how many leaf node), for " user-article " this application scenario, the field of each dimension of feature space Entitled user_field@item_field@indication_field@std_dev or user_field@item_ Field@indication_field@mean, wherein user_field, item_field and indication_field are character String variable, represents the field name of leaf node under user and article and index host node.For more general application scenario, Then using attributes1@attributes2@...@indication_field@operation, this form is represented, due to The step for every time inquiry be all separate, therefore can easily carry out parallelization.
The latent structure of table 2 is operated
Line number SQL
1 SELECT user_field,item_field,
2 STD(indication_field)AS user_field@item_field@std_dev,
3 MEAN(indication_field)AS user_field@item_field@mean
4 FROM tables 1
5 GROUP BY user_field,item_field
4.3) will be per paths all according to step 4.2) latent structure is carried out, when the latent structure path of all samples produces After feature, be placed in same table as training set, wherein sample is row, the field that field is characterized, field name with Construct the path name of this feature.
5) feature selecting is carried out to training set and builds forecast model:Training set is carried out using Pearson correlation coefficient such as All features in training set are carried out feature selecting by feature selecting, and calculate each feature with and target coefficient correlation, when Coefficient correlation be more than specified threshold when, then retain this feature, otherwise remove this feature, by above-mentioned steps obtain character subset it Afterwards, coefficient correlation two-by-two between feature is calculated, picks out between feature that the weaker subset of correlation is made two-by-two from the subset It is final characteristic set, the last preferable forecast model of choice accuracy carries out model training.
6) before being predicted on line is entered, need first to preserve necessary data during off-line training, so as to online Used during upper prediction, including:
6.1) by off-line training the step of 5) in feature preserve, with its step 1) in addition to index field Other fields are put into same table, as shown in table 3 below.Wherein the actual field name of feature 1 is with user_field@item_ The character string forms name of field@indication_field@mean;
6.2) by step 5) model that obtains and relevant parameter preserve.
The training samples information of table 3
ID User property A User property B Article ID Goods attribute C Feature 1 Feature 2
1 1 2 1 2 0 0.25
1 2 1 2 2 1 0.45
2 2 2 3 1 1 0.80
7) sample data predicted on line will be needed to be pre-processed and feature extraction:By the feature in table 3 Field name takes out, and obtains producing all combination of paths of feature, for each paths, according to corresponding nominal attribute field pair Table 3 carries out duplicate removal, then carries out left connection with sample data table to be predicted, so as to obtain the feature of current path.
8) forecast model for being obtained using off-line training is to step 7) in sample to be predicted after feature extraction be predicted.
In sum, after using above scheme, the present invention provides new method for nominal attributive character construction, not only Can apply to have " user-article " to scene, while being also applied for more general becoming with nominal attribute or classification The classification of measure feature and regression forecasting problem, compared with traditional One-Hot and Dummy is encoded, building method institute of the invention Difference is more obvious between the feature of generation causes sample, and the feature of generation has stronger interpretation, with actual popularization Value, is worthy to be popularized.
Embodiment described above is only the preferred embodiments of the invention, not limits practical range of the invention with this, therefore The change that all shapes according to the present invention, principle are made, all should cover within the scope of the present invention.

Claims (7)

1. a kind of continuous type latent structure method based on nominal attribute, it is characterised in that comprise the following steps:
1) data prediction, including tables of data integration, data representation format, missing values treatment;
2) latent structure framework is set according to business background knowledge;
3) specific latent structure path is produced;
4) according to the corresponding feature of latent structure path configuration and training set is produced;
5) feature selecting is carried out to training set and builds forecast model;
6) related data set and forecast model are preserved and terminates off-line training process;
7) sample data predicted on line will be needed to be pre-processed and feature extraction;
8) forecast model obtained using off-line training is predicted to sample.
2. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step It is rapid 1) in, the tables of data is integrated and refers to being integrated existing tables of data, and all fields in data set are placed in together In one table;The data representation format refers to for current nominal attribute field being converted to new nominal attribute field, tool The method for expressing of body is determined according to different application scenarios and different forecast demands;The missing values treatment includes lacking word Section is rejected and two kinds of situations of Missing Data Filling, and the field serious for wherein shortage of data is rejected, for deletion condition not Serious field, if it is nominal attribute field, by the new property value filling of the use one of lack part, or uses KNN Algorithm is filled to it, if it is continuous type field, using average filling or other fill methods.
3. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step It is rapid 2) in, the latent structure framework of current predictive or classification problem is determined with reference to business background knowledge, comprise the following steps:
2.1) all host nodes on the trunk and trunk of latent structure framework are determined
For " user-article " this application scenarios, trunk is divided into " user-article-index-calculation ", " user-user Three kinds of index-calculation ", " article-article index-calculation ", host node refers to the node on trunk, including " uses Family ", " article ", " index ", " calculation ", " article index ", six kinds of " user's index ";For with nominal attribute or classification The application scenarios of variable, trunk then only has " window-index-calculation " one kind, and corresponding host node only has " window ", " refers to Mark " and three kinds of " calculation ";
2.2) leaf node under host node is determined
At least include a leaf node under each host node, each leaf node stores a name category in tables of data Property field name, for " user-article " this application scenarios, the leaf node under host node " user " generally all represents user The characteristics of and attribute, user is all divided into multiple different classifications, and what the leaf node under host node " article " was then represented The characteristics of being article and attribute, the leaf node under " index " then represent the degree of the matching between user and article, " user Leaf node under index " then only represents certain index of user, and the leaf node under " article index " then represents article certainly Certain index of body, for the application scenarios with nominal attribute or classified variable, it is logical that the leaf node under " index " is stored It is often the continuous type feature field name in addition to nominal attribute, all leaf nodes under " window " store all of nominal attribute Field name, and the leaf node under " calculation " is then referred to according to set by current forecast demand or business background knowledge The statistical put;
In step 3) in, according to step 2) determined by latent structure framework produce specific latent structure path, for " using For this application scenarios in family-article ", wherein a latent structure path is comprising selected by each host node on trunk and trunk A leaf node selecting and according to all possible latent structure path of order traversal of " trunk-host node-leaf node ", And for the application scenarios with nominal attribute, then need by following steps:
3.1) determine the size of window, that is, determine to include how many leaf nodes in a window;
3.2) leaf node rule of combination is set:Leaf node under " window " host node is combined, all meeting is traveled through The leaf node combination of window size and rule of combination;
3.3) for the every kind of leaf node combination under window, the leaf node different under host node " index " and " calculating side Different leaf nodes is combined under formula ", finally constitutes all possible latent structure path;
In step 4) in, according to the corresponding feature of latent structure path configuration, comprise the following steps:
4.1) all nominal attribute field that includes in current signature construction path is determined, for " user-article " occasion, currently The nominal attribute field in path is determined according to the leaf node that the host node " user " of current path and " article " are selected, for band There are the application scenarios of nominal attribute, then the nominal attribute field for being included according to the leaf node combination under " window " host node is determined It is fixed;
4.2) set step 4.1) in determine nominal attribute field collection be combined into C={ A, B ... }, wherein A and B represents nominal category Property field name, under " user-article " occasion, the size of set is 1 or 2, for the application scenarios with nominal attribute, set Size is at least 1, and the feature finally produced per paths is all given by:
FCyf,i=f (Yi)
In formula, y represents " user's index " in current path, the word in the leaf node under " article index " or " index " host node Section, f represents customized calculation, and Cyf determines the composition structure of each paths, FCyf,iRepresent i-th sample on The feature value of path Cyf, YiRepresent i-th sample index set S of sampleCyf,iIn all samples index field value Set, i.e.,:
Yi={ yj|j∈SCyf,i}
SCyf,iDefinition expression formula be shown below:
S C y f , i = { j | j ∈ S , C j = C i } , C j = C i ⇔ A j = A i B j = B i ...
In formula, S represents the index set of all samples, CiRepresent in i-th sample for each nominal attribute field in set C All values set, Cj=CiRepresent in i-th sample for all values of each nominal attribute field in set C Set is identical with j-th sample;
If calculation f is respectively defined as into sue for peace sum, average average and standard deviation std, then under these three calculations The feature of generation is given by following formula respectively:
F C y , s u m , i = s u m ( Y i ) = Σ j ∈ S C y f , i y j
F C y , a v e r a g e , i = a v e r a g e ( Y i ) = Σ j ∈ S C y , a v e r a g e , i y j Σ j = 1 n w j , w j = 1 , i f j ∈ S C y , a v e r a g e , i 0 , i f j ∉ S C y , a v e r a g e , i
F C y , s t d , i = s t d ( Y i ) = Σ j ∈ S C y , s t d , i ( y j - F C y , a v e r a g e , i ) 2 Σ j = 1 n w j
4.3) will be per paths all according to step 4.2) latent structure is carried out, when the latent structure path of all samples produces feature Afterwards, it is placed in same table as training set, wherein sample is row, the field that field is characterized, field name is constructing The path name of this feature.
4. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step It is rapid 5) in, the character subset best for precision of prediction effect is picked out from all features by feature selecting algorithm and is built Forecast model.
5. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step It is rapid 6) in, involved all names when the related data set refers to the training set after feature selecting and training set builds The data of attribute field, these data sets will be used for the generation of the feature of sample on line, the field name of the feature in training set according to Old holding is named with latent structure path, and all fields of two datasets is placed in same table, the forecast model of preservation Then by for the prediction of sample on line.
6. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step It is rapid 7) in, it would be desirable to enter on line predict sample data pre-processed and feature extraction, comprise the following steps:
7.1) sample data to be predicted on line is pre-processed, here with off-line training during pre-treatment step pair Should, the field being removed because missing is serious during off-line training is rejected in current sample data, for Other are not removed during off-line training and there is the field of shortage of data in current sample data, then calculated using KNN Method or average filling are processed;
7.2) feature extraction is carried out to sample data to be predicted on line, this process remain unchanged it is corresponding with off-line training process, First from step 6) in read each latent structure path i.e. feature field name in the tables of data that obtains, according to latent structure road The corresponding name attribute field in footpath and path, will be with the sample to be predicted name attribute field identical training set sample of value The feature value in the current path of this correspondence is copied in sample to be predicted.
7. a kind of continuous type latent structure method based on nominal attribute according to claim 1, it is characterised in that:In step It is rapid 8) in, the forecast model obtained using off-line training is to step 7) in sample to be predicted after feature extraction be predicted.
CN201710034428.1A 2017-01-17 2017-01-17 A kind of continuous type latent structure method based on nominal attribute Pending CN106897776A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201710034428.1A CN106897776A (en) 2017-01-17 2017-01-17 A kind of continuous type latent structure method based on nominal attribute
PCT/CN2017/116131 WO2018133596A1 (en) 2017-01-17 2017-12-14 Continuous feature construction method based on nominal attribute

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710034428.1A CN106897776A (en) 2017-01-17 2017-01-17 A kind of continuous type latent structure method based on nominal attribute

Publications (1)

Publication Number Publication Date
CN106897776A true CN106897776A (en) 2017-06-27

Family

ID=59197925

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710034428.1A Pending CN106897776A (en) 2017-01-17 2017-01-17 A kind of continuous type latent structure method based on nominal attribute

Country Status (2)

Country Link
CN (1) CN106897776A (en)
WO (1) WO2018133596A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107844560A (en) * 2017-10-30 2018-03-27 北京锐安科技有限公司 A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing
WO2018133596A1 (en) * 2017-01-17 2018-07-26 华南理工大学 Continuous feature construction method based on nominal attribute
CN108776673A (en) * 2018-05-23 2018-11-09 哈尔滨工业大学 Automatic switching method, device and the storage medium of relation schema
CN108932647A (en) * 2017-07-24 2018-12-04 上海宏原信息科技有限公司 A kind of method and apparatus for predicting its model of similar article and training
CN109146083A (en) * 2018-08-06 2019-01-04 阿里巴巴集团控股有限公司 Feature coding method and apparatus
CN111651524A (en) * 2020-06-05 2020-09-11 第四范式(北京)技术有限公司 Auxiliary implementation method and device for online prediction by using machine learning model
CN113892939A (en) * 2021-09-26 2022-01-07 燕山大学 Method for monitoring respiratory frequency of human body in resting state based on multi-feature fusion

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101226521A (en) * 2008-02-18 2008-07-23 南京大学 Machine learning method for ambiguity data object estimation modeling
US7792770B1 (en) * 2007-08-24 2010-09-07 Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104134017A (en) * 2014-07-18 2014-11-05 华南理工大学 Protein interaction relationship pair extraction method based on compact character representation
CN105550275A (en) * 2015-12-09 2016-05-04 中国科学院重庆绿色智能技术研究院 Microblog forwarding quantity prediction method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7451065B2 (en) * 2002-03-11 2008-11-11 International Business Machines Corporation Method for constructing segmentation-based predictive models
CN106897776A (en) * 2017-01-17 2017-06-27 华南理工大学 A kind of continuous type latent structure method based on nominal attribute

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7792770B1 (en) * 2007-08-24 2010-09-07 Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree
CN101226521A (en) * 2008-02-18 2008-07-23 南京大学 Machine learning method for ambiguity data object estimation modeling
CN104063472A (en) * 2014-06-30 2014-09-24 电子科技大学 KNN text classifying method for optimizing training sample set
CN104134017A (en) * 2014-07-18 2014-11-05 华南理工大学 Protein interaction relationship pair extraction method based on compact character representation
CN105550275A (en) * 2015-12-09 2016-05-04 中国科学院重庆绿色智能技术研究院 Microblog forwarding quantity prediction method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018133596A1 (en) * 2017-01-17 2018-07-26 华南理工大学 Continuous feature construction method based on nominal attribute
CN108932647A (en) * 2017-07-24 2018-12-04 上海宏原信息科技有限公司 A kind of method and apparatus for predicting its model of similar article and training
CN107844560A (en) * 2017-10-30 2018-03-27 北京锐安科技有限公司 A kind of method, apparatus of data access, computer equipment and readable storage medium storing program for executing
CN108776673A (en) * 2018-05-23 2018-11-09 哈尔滨工业大学 Automatic switching method, device and the storage medium of relation schema
CN108776673B (en) * 2018-05-23 2020-08-18 哈尔滨工业大学 Automatic conversion method and device of relation mode and storage medium
CN109146083A (en) * 2018-08-06 2019-01-04 阿里巴巴集团控股有限公司 Feature coding method and apparatus
CN109146083B (en) * 2018-08-06 2021-07-23 创新先进技术有限公司 Feature encoding method and apparatus
CN111651524A (en) * 2020-06-05 2020-09-11 第四范式(北京)技术有限公司 Auxiliary implementation method and device for online prediction by using machine learning model
CN111651524B (en) * 2020-06-05 2023-10-03 第四范式(北京)技术有限公司 Auxiliary implementation method and device for on-line prediction by using machine learning model
CN113892939A (en) * 2021-09-26 2022-01-07 燕山大学 Method for monitoring respiratory frequency of human body in resting state based on multi-feature fusion

Also Published As

Publication number Publication date
WO2018133596A1 (en) 2018-07-26

Similar Documents

Publication Publication Date Title
CN106897776A (en) A kind of continuous type latent structure method based on nominal attribute
Bai et al. Integrating Fuzzy C-Means and TOPSIS for performance evaluation: An application and comparative analysis
CN105975916B (en) Age estimation method based on multi output convolutional neural networks and ordinal regression
CN104008203B (en) A kind of Users' Interests Mining method for incorporating body situation
CN113590900A (en) Sequence recommendation method fusing dynamic knowledge maps
CN112463980A (en) Intelligent plan recommendation method based on knowledge graph
CN108829763A (en) A kind of attribute forecast method of the film review website user based on deep neural network
CN112884551B (en) Commodity recommendation method based on neighbor users and comment information
CN104346440A (en) Neural-network-based cross-media Hash indexing method
CN106600052A (en) User attribute and social network detection system based on space-time locus
CN107357793A (en) Information recommendation method and device
CN107391542A (en) A kind of open source software community expert recommendation method based on document knowledge collection of illustrative plates
CN110110372B (en) Automatic segmentation prediction method for user time sequence behavior
CN102591915A (en) Recommending method based on label migration learning
CN113706251B (en) Model-based commodity recommendation method, device, computer equipment and storage medium
CN111582538A (en) Community value prediction method and system based on graph neural network
CN112801425B (en) Method and device for determining information click rate, computer equipment and storage medium
CN114971784B (en) Session recommendation method and system based on graph neural network by fusing self-attention mechanism
CN110263236A (en) Social network user multi-tag classification method based on dynamic multi-view learning model
CN110516165A (en) A kind of cross-cutting recommended method of hybrid neural networks based on text UGC
Huynh et al. Joint age estimation and gender classification of Asian faces using wide ResNet
CN103440651A (en) Multi-label image annotation result fusion method based on rank minimization
CN114723535A (en) Supply chain and knowledge graph-based item recommendation method, equipment and medium
CN112487305B (en) GCN-based dynamic social user alignment method
CN117573972A (en) Interest tag learning method based on long-short-term behaviors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170627

RJ01 Rejection of invention patent application after publication