WO2019015631A1 - 生成机器学习样本的组合特征的方法及系统 - Google Patents

生成机器学习样本的组合特征的方法及系统 Download PDF

Info

Publication number
WO2019015631A1
WO2019015631A1 PCT/CN2018/096233 CN2018096233W WO2019015631A1 WO 2019015631 A1 WO2019015631 A1 WO 2019015631A1 CN 2018096233 W CN2018096233 W CN 2018096233W WO 2019015631 A1 WO2019015631 A1 WO 2019015631A1
Authority
WO
WIPO (PCT)
Prior art keywords
binning
feature
features
machine learning
attribute information
Prior art date
Application number
PCT/CN2018/096233
Other languages
English (en)
French (fr)
Inventor
陈雨强
戴文渊
杨强
罗远飞
涂威威
Original Assignee
第四范式(北京)技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 第四范式(北京)技术有限公司 filed Critical 第四范式(北京)技术有限公司
Publication of WO2019015631A1 publication Critical patent/WO2019015631A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present disclosure relates generally to the field of artificial intelligence and, more particularly, to a method and system for generating combined features of machine learning samples.
  • each data record can be viewed as a description of an event or object, corresponding to an example or example.
  • the data record it includes various items that reflect the performance or nature of the event or object in a certain aspect, which may be called "attributes".
  • the predictive effect of the machine learning model is related to the choice of the model, the available data and the extraction of features. That is to say, on the one hand, the model prediction effect can be improved by improving the feature extraction method, and conversely, if the feature extraction is not appropriate, the prediction effect will be deteriorated.
  • Exemplary embodiments of the present disclosure are directed to overcoming the deficiencies in the prior art that it is difficult to automatically combine features of machine learning samples.
  • a method of generating a combined feature of machine learning samples performed by at least one computing device, comprising:
  • the combined features of the machine learning samples are generated by performing feature combinations between at least one of the discrete features including the binned group features and other discrete features generated based on the plurality of attribute information.
  • a system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least A computing device performs the following steps for generating a combined feature of a machine learning sample:
  • the combined features of the machine learning samples are generated by performing feature combinations between at least one of the discrete features including the binned group features and other discrete features generated based on the plurality of attribute information.
  • a computer readable storage medium storing instructions, wherein when the instructions are executed by at least one computing device, causing the at least one computing device to perform the generation as described above A method of machine learning the combined features of a sample.
  • a system for generating a combined feature of a machine learning sample comprising:
  • a data record obtaining device configured to acquire a data record, wherein the data record includes a plurality of attribute information
  • a bin group feature generating means configured to perform at least one binning operation for each of the at least one continuous feature generated based on the plurality of attribute information to obtain a score consisting of at least one binning feature Box group feature, wherein each binning operation corresponds to a binning feature;
  • Feature combining means for generating a combined feature of the machine learning sample by performing feature combination between at least one of the discrete features including the binned group feature and the other discrete features generated based on the plurality of attribute information.
  • one or more binning operations are performed for continuous features, and the obtained bin group features are combined with other features such that The combined features that make up the machine learning sample are more efficient, thus improving the effectiveness of the machine learning model.
  • FIG. 1 illustrates a block diagram of a system for generating combined features of machine learning samples, in accordance with an exemplary embodiment of the present disclosure
  • FIG. 2 illustrates a block diagram of a training system of a machine learning model in accordance with an exemplary embodiment of the present disclosure
  • FIG. 3 illustrates a block diagram of a prediction system of a machine learning model according to an exemplary embodiment of the present disclosure
  • FIG. 4 illustrates a block diagram of a training and prediction system of a machine learning model, in accordance with an exemplary embodiment of the present disclosure
  • FIG. 5 illustrates a block diagram of a system for generating combined features of machine learning samples, in accordance with another exemplary embodiment of the present disclosure
  • FIG. 6 illustrates a flowchart of a method of generating combined features of machine learning samples, according to an exemplary embodiment of the present disclosure
  • FIG. 7 illustrates an example of a search strategy for generating a combined feature, according to an exemplary embodiment of the present disclosure
  • FIG. 8 illustrates a flowchart of a training method of a machine learning model according to an exemplary embodiment of the present disclosure
  • FIG. 9 illustrates a flowchart of a prediction method of a machine learning model according to an exemplary embodiment of the present disclosure
  • FIG. 10 illustrates a flow chart of a method of generating combined features of machine learning samples, in accordance with another exemplary embodiment of the present disclosure.
  • automatic feature combination is performed by performing at least one binning operation on each of the at least one continuous feature to generate one or a corresponding one of the single continuous features
  • Multiple binning features, combining binning group features composed of these binning features with other discrete features (eg, single discrete features and/or other binning group features) may make the generated machine learning samples more suitable for the machine Learn so that you can get better predictions.
  • machine learning is an inevitable outcome of the development of artificial intelligence research to a certain stage. It is dedicated to improving the performance of the system itself through computational means and experience.
  • experience usually exists in the form of “data.”
  • Machine learning algorithms can generate “models” from data. That is, empirical data can be provided to machine learning algorithms based on these empirical data. The model, in the face of new situations, the model will provide the corresponding judgment, that is, the prediction results. Whether training a machine learning model or using a trained machine learning model for prediction, the data needs to be transformed into machine learning samples that include various features.
  • Machine learning may be implemented in the form of "supervised learning,” “unsupervised learning,” or “semi-supervised learning.” It should be noted that the exemplary embodiments of the present disclosure are not specifically limited to specific machine learning algorithms. In addition, it should be noted that in the process of training and applying the model, other means such as statistical algorithms can be combined.
  • FIG. 1 illustrates a block diagram of a system for generating combined features of machine learning samples, in accordance with an exemplary embodiment of the present disclosure.
  • the system performs at least one binning operation on each successive feature to be combined, so that a single continuous feature can be converted into a binning group feature composed of at least one binning operation feature, and further, The bin group feature is combined with other discrete features to enable the original data record to be drawn from different angles, scales/levels simultaneously.
  • combined features of machine learning samples can be automatically generated, and corresponding machine learning samples can help improve machine learning effects (eg, model stability, model generalization, etc.).
  • the data record obtaining apparatus 100 is configured to acquire a data record, wherein the data record includes a plurality of attribute information.
  • the above data record may be data generated online, data generated in advance and stored, or data received from the outside through an input device or a transmission medium.
  • This data can relate to attribute information of individuals, businesses, or organizations, such as identity, education, occupation, assets, contact information, liabilities, income, profit, taxation, and more.
  • the data may also relate to attribute information of the business related item, for example, information about the transaction amount of the sales contract, the parties to the transaction, the subject matter, the place of the transaction, and the like.
  • the attribute information content mentioned in the exemplary embodiments of the present disclosure may relate to the performance or properties of any object or transaction in some aspect, and is not limited to individuals, objects, organizations, units, institutions, projects, events, and the like. Limited or described.
  • the data record acquisition device 100 can acquire structured or unstructured data from different sources, such as text data or numerical data, and the like.
  • the acquired data records can be used to form machine learning samples and participate in the training/prediction process of machine learning.
  • These data may be derived from entities within the entity that are expected to obtain model predictions, for example, from banks, businesses, schools, etc. that are expected to obtain predictions; such data may also be derived from outside the entities, for example, from data providers, the Internet ( For example, social networking sites), mobile operators, APP operators, courier companies, credit agencies, etc.
  • the above internal data and external data may be used in combination to form a machine learning sample carrying more information.
  • the above data may be input to the data record acquisition device 100 through an input device, or may be automatically generated by the data record acquisition device 100 based on existing data, or may be from the network by the data record acquisition device 100 (eg, a storage medium on a network (eg, The data warehouse)), in addition, an intermediate data exchange device such as a server can facilitate the data record acquisition device 100 to acquire corresponding data from an external data source.
  • the acquired data can be converted into a format that is easy to process by a data conversion module such as a text analysis module in the data record acquisition device 100.
  • the data record acquisition device 100 can be configured as individual modules comprised of software, hardware, and/or firmware, some or all of which can be integrated or co-operated to accomplish a particular function.
  • the bin group feature generating device 200 is configured to perform at least one binning operation for each of the at least one continuous feature generated based on the plurality of attribute information to obtain a score composed of at least one binning feature Box group features, wherein each binning operation corresponds to a binning feature.
  • a corresponding continuous feature may be generated.
  • the continuous feature is a feature opposite to the discrete feature (for example, the category feature), and the value may be a value having a certain continuity. For example, distance, age, amount, etc.
  • the value of the discrete feature does not have continuity.
  • it may be an unordered classification such as “from Beijing”, “from Shanghai” or “from Tianjin”, “gender is male”, “gender is female”, etc. Characteristics. It can be seen that, as a whole of the plurality of attribute information of the data record, at least one continuous feature can be generated accordingly.
  • the exemplary embodiments of the present disclosure do not limit the specific manner in which each successive feature is generated (eg, from which attribute information field or attributes).
  • the bin group feature generating apparatus 200 may directly use a certain continuous value attribute in the data record as a corresponding continuous feature in the machine learning sample, for example, the distance, the age, the amount, and the like may be directly used as the corresponding continuous feature. . That is, each of the continuous features may be formed by the continuous value attribute information itself among the plurality of attribute information.
  • the bin group feature generating apparatus 200 may also process certain attribute information (for example, continuous value attribute and/or discrete value attribute information) in the data record to obtain corresponding continuous features, for example, height and The ratio of body weight is taken as a corresponding continuous feature.
  • the continuous feature may be formed by continuously transforming discrete value attribute information among the plurality of attribute information.
  • the continuous transformation may indicate that the value of the discrete value attribute information is counted.
  • the continuous feature may indicate statistical information of certain discrete value attribute information regarding the predicted target of the machine learning model.
  • the discrete value attribute information of the seller merchant number may be transformed into a probabilistic statistical feature of the historical purchase behavior with respect to the corresponding seller merchant code.
  • the bin group feature generation device 200 can also generate other discrete features of the machine learning samples.
  • the above features may also be produced by other feature generating means (not shown). According to an exemplary embodiment of the present disclosure, any combination of the above features may be made, wherein the continuous features have been converted to binning group features when combined.
  • binning group feature generation device 200 can perform at least one binning operation to enable simultaneous acquisition of multiple discrete features that characterize certain attributes of the original data record from different angles, scales/levels.
  • the binning operation refers to a specific way of discretizing continuous features, that is, dividing the value range of the continuous feature into a plurality of intervals (ie, multiple boxes), and determining based on the divided boxes.
  • the corresponding binning feature value can be roughly divided into supervised binning and unsupervised binning, each of which includes some specific binning methods, for example, supervised binning including minimum entropy binning, minimum description length binning, etc.
  • Unsupervised bins include equal-width bins, equal-depth bins, binning based on k-means clustering, and so on.
  • the corresponding binning parameters can be set, for example, width, depth, and so on.
  • the binning operation performed by the bin group feature generating apparatus 200 does not limit the kind of the binning mode, nor the parameters of the binning operation, and the corresponding binning features are generated.
  • the specific representation is also not limited.
  • the binning operation performed by the bin group feature generating device 200 may differ in binning mode and/or binning parameters.
  • the at least one binning operation may be a binning operation of the same type but having different operational parameters (eg, depth, width, etc.), or may be a different type of binning operation.
  • each binning operation can obtain a binning feature, which together constitute a binning group feature, which can reflect different binning operations, thereby improving the effectiveness of machine learning materials. It provides a good foundation for the training/prediction of machine learning models.
  • the feature combining device 300 is configured to generate a machine learning sample by performing feature crosses between at least one of the discrete features including the bin group feature and the other discrete features generated based on the plurality of attribute information Combined features.
  • feature combining device 300 can cause any combination between discrete features that are binned group features or other discrete features to obtain corresponding combined features.
  • feature combinations can be made between any number of bin group features, and any number of the other discrete features can be feature-combined, or any number of bin-group features can be associated with any number of Other discrete features are combined for features.
  • feature combinations may be performed in accordance with a Cartesian product between the bin group features and/or the other discrete features.
  • the exemplary embodiments of the present disclosure are not limited to the combination of Cartesian products, and any manner in which the above discrete features can be combined can be applied to the exemplary embodiments of the present disclosure.
  • feature combining device 300 may generate combined features of machine learning samples in an iterative manner according to a search strategy for the combined features. For example, according to a heuristic search strategy such as beam search, at each level of the search tree, the nodes are sorted according to the heuristic cost, and then only a specific number of (Beam Width) nodes are left, only These nodes continue to expand on the next layer, while other nodes are clipped.
  • a search strategy such as beam search
  • data record acquisition Device 100 may be a device that has the ability to receive and process data records, or it may simply be a device that provides data records that have been prepared.
  • system shown in Figure 1 can also be integrated into the system of model training and/or model prediction as part of completing feature processing.
  • FIG. 2 illustrates a block diagram of a training system of a machine learning model in accordance with an exemplary embodiment of the present disclosure.
  • a machine learning sample generation device 400 and a machine learning model training device 500 are included.
  • the data record acquisition device 100, the bin group feature generation device 200, and the feature combination device 300 can operate in the manner of the system shown in FIG. 1, wherein the data record
  • the acquisition device 100 can acquire historical data records that have been marked.
  • the machine learning sample generating device 400 is configured to generate machine learning samples including at least a portion of the generated combined features. That is, in the machine learning samples generated by the machine learning sample generating device 400, some or all of the combined features generated by the feature combining device 300 are included, and further, as an alternative, the machine learning samples may further include data recording based Any other feature generated by the attribute information, for example, each feature directly attributed to the attribute information of the data record, a feature obtained by performing feature processing on the attribute information, and the like. As described above, these other features may be generated by the bin group feature generating device 200 as an example, or may be generated by other devices.
  • the machine learning sample generation device 400 can generate machine learning training samples, in particular, by way of example, in the case of supervised learning, the machine learning training samples generated by the machine learning sample generation device 400 can include features and indicia ( Label) Two parts.
  • the machine learning model training device 500 is for training a machine learning model based on machine learning training samples.
  • the machine learning model training device 500 can employ any suitable machine learning algorithm (eg, log probability regression) to learn an appropriate machine learning model from the machine learning training samples.
  • FIG. 3 illustrates a block diagram of a prediction system of a machine learning model, according to an exemplary embodiment of the present disclosure.
  • the system of FIG. 3 includes a machine learning sample generating device 400 and a machine learning model predicting device in addition to the data record acquiring device 100, the bin group feature generating device 200, and the feature combining device 300. 600.
  • the data record acquisition device 100, the bin group feature generation device 200, and the feature combination device 300 can operate in the manner of the system shown in FIG. 1, wherein the data record
  • the acquisition device 100 can acquire a data record that will be predicted (eg, a new data record that does not contain a tag or a historical data record for testing).
  • the machine learning sample generating device 400 can generate machine learning predicted samples including only the feature portions in a manner similar to that shown in FIG. 2.
  • the machine learning model predicting device 600 is configured to provide a predicted result corresponding to the machine learning predicted sample using the already trained machine learning model.
  • the machine learning model prediction apparatus 600 may provide prediction results in batches for a plurality of machine learning prediction samples.
  • FIG. 4 illustrates a block diagram of a training and prediction system of a machine learning model in accordance with an exemplary embodiment of the present disclosure.
  • the above-described data record acquisition device 100, bin group feature generation device 200, feature combination device 300, machine learning sample generation device 400, machine learning model training device 500, and machine learning model prediction device 600 are included. .
  • the data record acquisition device 100, the bin group feature generation device 200, and the feature combination device 300 can operate in the manner of the system shown in FIG. 1, wherein the data record acquisition device 100 can obtain historical data records or data records to be predicted in a targeted manner.
  • the machine learning sample generating device 400 may generate a machine learning training sample or a machine learning prediction sample according to circumstances, in particular, in the model training phase, the machine learning sample generating device 400 may generate a machine learning training sample, in particular, as an example In the case of supervised learning, the machine learning training samples generated by the machine learning sample generating device 400 may include features and labels.
  • the machine learning sample generation device 400 can generate machine learning prediction samples, where it should be understood that the feature portions of the machine learning prediction samples are consistent with the feature portions of the machine learning training samples.
  • the machine learning sample generating device 400 supplies the generated machine learning training samples to the machine learning model training device 500 such that the machine learning model training device 500 trains the machine learning model based on the machine learning training samples.
  • the machine learning model training device 500 learns the machine learning model
  • the machine learning model training device 500 provides the trained machine learning model to the machine learning model predicting device 600.
  • the machine learning sample generation device 400 provides the generated machine learning prediction samples to the machine learning model prediction device 600 such that the machine learning model prediction device 600 utilizes the machine learning model to provide a prediction sample for machine learning. The predicted result.
  • At least one binning operation needs to be performed on consecutive features.
  • the at least one binning operation can be determined by any suitable means, for example, by the experience of a technician or a business person, or automatically by technical means.
  • a particular binning operation can be effectively determined based on the importance of binning features.
  • FIG. 5 illustrates a block diagram of a system for generating combined features of machine learning samples, in accordance with another exemplary embodiment of the present disclosure.
  • the system of FIG. 5 includes a binning operation selection device 150 in comparison with the system shown in FIG.
  • the data record acquisition device 100, the bin group feature generation device 200, and the feature combination device 300 can operate in the manner shown in the system shown in FIG. 1.
  • the binning operation selecting means 150 is configured to select the at least one binning operation from a predetermined number of binning operations such that the importance of the binning feature corresponding to the selected binning operation is not less than or not selected The importance of the binning feature corresponding to the binning operation. In this way, it is possible to ensure the effect of machine learning while reducing the size of the feature space after combination.
  • a predetermined number of binning operations may indicate a plurality of binning operations that differ in binning mode and/or binning parameters.
  • binning operation selecting means 150 can determine the importance of the binning features, and further select the more important binning features.
  • the binning operation is performed as at least one binning operation to be performed by the bin group feature generating device 200.
  • the binning operation selection device 150 can automatically determine the importance of the binning feature in any suitable manner.
  • the binning operation selecting means 150 may construct a single feature machine learning model for each of the binning features corresponding to the predetermined number of binning operations, based on the effects of the individual single feature machine learning models. The importance of each binning feature is determined, and the at least one binning operation is selected based on the importance of each binning feature, wherein the single feature machine learning model corresponds to each of the binning features.
  • the binning operation selecting means 150 may construct a compound machine learning model for each of the binning features corresponding to the predetermined number of binning operations, and determine based on the effects of the respective compound machine learning models. The importance of each binning feature, and selecting the at least one binning operation based on the importance of each binning feature, wherein the composite machine learning model includes a basic sub-model and an additional sub-model based on the lifting framework, wherein The sub-model corresponds to a basic feature subset, and the additional sub-model corresponds to each of the binning features.
  • the basic feature subset may be fixedly applied to the basic sub-models in all related composite machine learning models, and here, any feature generated based on the attribute information of the data record may be taken as a basic feature.
  • any feature generated based on the attribute information of the data record may be taken as a basic feature.
  • at least a portion of the attribute information of the data record can be directly used as a basic feature.
  • actual machine learning issues may be considered, based on test calculations or based on business person designation to determine relatively important or basic features as basic features.
  • the binning operation selecting means 150 may select a binning operation for each iteration of the iteration, and the combined features generated in each iteration of the iteration are added as new discrete features. A subset of the basic features.
  • the binning operation selection device 150 shown in FIG. 5 can be incorporated into the training system and/or prediction system shown in FIGS. 2 through 4.
  • FIG. 6 A flowchart of a method of generating a combined feature of a machine learning sample according to an exemplary embodiment of the present disclosure is described below with reference to FIG.
  • the method illustrated in FIG. 6 may be performed by the system illustrated in FIG. 1, or may be implemented entirely in software by a computer program, and the method illustrated in FIG. 6 may also be performed by a specially configured computing device.
  • the method illustrated in FIG. 6 is performed by the system illustrated in FIG. 1.
  • step S100 a data record is acquired by the data record acquisition device 100, wherein the data record includes a plurality of attribute information.
  • the data record acquisition apparatus 100 may collect data by manual, semi-automatic, or fully automatic methods, or process the collected raw data such that the processed data record has an appropriate format or form.
  • the data record acquisition device 100 can collect data in batches.
  • the data record acquisition means 100 can receive a data record manually input by the user through an input means (for example, a workstation).
  • the data record acquisition apparatus 100 can systematically retrieve data records from a data source in a fully automated manner, for example, by requesting a data source and obtaining a response from a response by a timer mechanism implemented in software, firmware, hardware, or a combination thereof.
  • the requested data can include one or more databases or other servers.
  • the manner in which data is fully automated can be achieved via an internal network and/or an external network, which can include transmitting encrypted data over the Internet. In the case where servers, databases, networks, etc.
  • the semi-automatic mode is between manual mode and fully automatic mode.
  • the difference between the semi-automatic mode and the fully automatic mode is that a trigger mechanism activated by the user replaces, for example, a timer mechanism.
  • a request to extract data is generated only when a specific user input is received.
  • Each time data is acquired, preferably, the captured data can be stored in a non-volatile memory.
  • a data warehouse can be utilized to store raw data collected during acquisition as well as processed data.
  • the data records obtained above may be derived from the same or different data sources, that is, each data record may also be a splicing result of different data records.
  • each data record may also be a splicing result of different data records.
  • the data record obtaining apparatus 100 may further acquire the customer at the Other data records of the bank, such as loan records, daily transaction data, etc., can be spliced into complete data records.
  • the data record obtaining apparatus 100 can also acquire data derived from other private sources or public sources, for example, data from a data provider, data from the Internet (for example, a social networking site), data from a mobile operator. From the data of the APP operator, the data from the courier company, the data from the credit institution, and so on.
  • the data record obtaining apparatus 100 may store and/or process the collected data by means of a hardware cluster (such as a Hadoop cluster, a Spark cluster, etc.), for example, storage, classification, and other offline operations.
  • a hardware cluster such as a Hadoop cluster, a Spark cluster, etc.
  • the data record acquisition device 100 can also perform on-line stream processing on the collected data.
  • the data record acquisition apparatus 100 may include a data conversion module such as a text analysis module. Accordingly, in step S100, the data record acquisition apparatus 100 may convert unstructured data such as text into structured data that is easier to use. Further processing or reference is made later.
  • Text-based data can include emails, documents, web pages, graphics, spreadsheets, call center logs, transaction reports, and the like.
  • the binning group feature generating apparatus 200 is configured to perform at least one binning operation for each of the at least one continuous feature generated based on the plurality of attribute information to obtain A binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to a binning feature.
  • step S200 is directed to generating binning group features consisting of binning features that can participate in the automatic combination of discrete features in place of the original continuous features.
  • a corresponding at least one binning feature can be obtained by performing at least one binning operation separately.
  • the continuous feature can be generated from at least a portion of the attribute information of the data record.
  • the attribute information of the continuous value of the distance, age, and amount of the data record may directly be a continuous feature; as another example, the continuous feature may be obtained by further processing certain attribute information of the data record, for example, The ratio of height to weight can be taken as a continuous feature; for example, a continuous feature can be formed by continuously transforming the discrete value attribute information among the attribute information, for example, the continuous transform here can indicate the attribute of the discrete value The value of the information is counted, and the obtained statistical information is used as a continuous feature.
  • At least one binning operation may be performed on the obtained continuous feature by the bin group feature generating device 200, where the bin group feature generating device 200 may follow various binning modes and/or binning parameters. To perform the binning operation.
  • the continuous feature has a value interval of [0, 100] and the corresponding binning parameter (ie, width) is 50
  • two boxes can be separated, in which case The continuous feature with a value of 61.5 corresponds to the second box. If the labels of the two boxes are 0 and 1, the box corresponding to the continuous feature is numbered 1. Or, assuming that the bin width is 10, 10 boxes can be separated. In this case, a continuous feature with a value of 61.5 corresponds to the seventh box, and if the ten boxes are numbered 0 to 9, The box corresponding to the continuous feature is numbered 6. Or, assuming that the bin width is 2, 50 boxes can be separated. In this case, a continuous feature with a value of 61.5 corresponds to the 31st box, and if the 50 boxes are numbered 0 to 49, Then the box corresponding to the continuous feature is labeled 30.
  • the corresponding feature value can be any value that is customized.
  • the binning feature may indicate which bin the continuous feature was assigned to in accordance with the corresponding binning operation. That is, a binning operation is performed to generate a multi-dimensional binning feature corresponding to each successive feature, wherein, as an example, each dimension may indicate whether a corresponding continuous feature is assigned to the corresponding box, for example, "1" indicates that the continuous feature is assigned to the corresponding box, and "0" indicates that the continuous feature is not assigned to the corresponding box.
  • the basic score is
  • the box feature can be a feature of 10 dimensions
  • the basic binning feature corresponding to a continuous feature with a value of 61.5 can be represented as [0, 0, 0, 0, 0, 0, 1, 0, 0, 0].
  • noise in the data record can also be reduced by removing possible outliers in the data samples before performing the binning operation. In this way, the effectiveness of machine learning using binning features can be further improved.
  • the out-of-group box can be additionally provided such that continuous features having outliers are assigned to the out-of-group box.
  • a certain number of samples can be selected for pre-binning. For example, first divide the bin by the bin width of 10, and then record each bin. The number of samples within, for boxes with a small number of samples (eg, less than a threshold), can be combined into at least one out of the box.
  • the boxes with fewer samples can be combined into the out-of-group box, and the remaining boxes are retained, assuming that the number of samples in the 0-10 box is small,
  • the 0-10 boxes are merged into an out-of-group box, and the continuous features with the value [0,100] are uniformly divided into the out-of-group boxes.
  • the at least one binning operation may be a binning operation in which the binning modes are the same but the binning parameters are different; or the at least one binning operation may be a binning method with different binning modes. Operation.
  • the binning methods here include various binning methods under supervised binning and/or unsupervised binning.
  • supervised binning includes minimum entropy binning, minimum description bin binning, etc.
  • unsupervised binning includes equal-width binning, equal-depth binning, binning based on k-means clustering, and the like.
  • At least one binning operation may correspond to equal width binning operations of different widths, respectively. That is to say, the binning method adopted is the same but the granularity of the division is different, which makes the generated binning feature better describe the law of the original data record, thereby facilitating the training and prediction of the machine learning model.
  • the different widths used in at least one binning operation may constitute a geometric sequence in numerical value.
  • the binning operation may perform equal-width binning according to the width of the value 2, the value 4, the value 8, the value 16, and the like.
  • the different widths used in at least one of the binning operations may numerically form an arithmetic progression.
  • the binning operation may perform equal-width binning according to the width of the value 2, the value 4, the value 6, the value 8, and the like.
  • At least one binning operation may correspond to different depth binning operations, respectively. That is to say, the binning operation adopts the same binning method but different granularity of division, which makes the generated binning feature better describe the law of the original data record, which is more conducive to the training and prediction of the machine learning model.
  • the different depths used in the binning operation may constitute a geometric progression in numerical value.
  • the binning operation may perform the equal-depth binning according to the depth of the value 10, the value 100, the value 1000, the value 10000, and the like.
  • the different depths used in the binning operation may numerically form an arithmetic progression.
  • the binning operation may perform the equal-depth binning according to the depth of the value 10, the value 20, the value 30, the value 40, and the like.
  • the binning group feature generating device 200 can obtain the binning group feature by using each of the binning features as one constituent element. It can be seen that the bin group feature here can be seen as a collection of binning features and thus also as discrete features.
  • the machine learning sample is generated by the feature combining device 300 by performing feature combination between at least one of the discrete features including the binned group feature and the other discrete features generated based on the plurality of attribute information.
  • Combined features since the continuous features have been converted into binned group features as discrete features, any combination between features including binned group features and other discrete features can be made as a combined feature of the machine learning samples.
  • the combination between the features may be implemented by a Cartesian product, however, it should be noted that the combination is not limited thereto, and any manner in which two or more discrete features can be combined with each other can be applied to the present disclosure.
  • a single discrete feature can be regarded as a first-order feature, and according to an exemplary embodiment of the present disclosure, a higher-order feature combination of two-order, third-order, or the like can be performed until a predetermined cut-off condition is satisfied.
  • the combined features of the machine learning samples may be generated in an iterative manner according to a search strategy for the combined features.
  • FIG. 7 illustrates an example of a search tree for generating a combined feature, according to an exemplary embodiment of the present disclosure.
  • the search tree may be based on a heuristic search strategy such as a bundle search, wherein a layer of the search tree may correspond to a feature combination of a particular order.
  • the discrete features that can be combined include feature A, feature B, feature C, feature D, and feature E.
  • feature A, feature B, and feature C may be formed by discrete value attribute information of the data record itself.
  • the discrete features, while feature D and feature E can be binned group features that are transformed from continuous features.
  • two nodes, feature B and feature E are selected as first-order features.
  • feature importance can be used as an index to sort each node, and then select a part. The node continues to expand on the next level.
  • feature BA, feature BC, feature BD, feature BE, feature EA, feature EB, feature EC, feature ED are generated based on feature B and feature E, and continue to be selected based on the ranking indicator.
  • the feature BC and the feature EA are among them.
  • feature BE and feature EB can be considered as the same combined feature.
  • the iteration is continued as described above until a specific cutoff condition is met, for example, an order limit or the like.
  • the nodes selected in each layer can be used as combined features for subsequent processing, for example, as a final adopted feature or for further importance evaluation, with the remaining features (shown by dashed lines) Out) is pruned.
  • FIG. 8 illustrates a flowchart of a training method of a machine learning model according to an exemplary embodiment of the present disclosure.
  • the method in addition to the above steps S100, S200, and S300, the method further includes step S400 and step S500.
  • step S100, step S200, and step S300 may be similar to the corresponding steps shown in FIG. 6, wherein the history data record that has been marked may be acquired in step S100.
  • the machine learning sample generating apparatus 400 may generate a machine learning training sample including at least a part of the generated combined features, and in the case of supervised learning, the machine learning training sample may include both features and marks.
  • the machine learning model may be trained by the machine learning model training device 500 based on the machine learning training samples.
  • the machine learning model training device 500 can learn an appropriate machine learning model from the machine learning training samples using an appropriate machine learning algorithm.
  • the trained machine learning model After training the machine learning model, the trained machine learning model can be used to make predictions.
  • FIG. 9 illustrates a flowchart of a prediction method of a machine learning model according to an exemplary embodiment of the present disclosure.
  • the method in addition to the above steps S100, S200, and S300, the method further includes step S400 and step S600.
  • step S100, step S200, and step S300 may be similar to the corresponding steps shown in FIG. 6, wherein the data record to be predicted may be acquired in step S100.
  • the machine learning sample generating device 400 may generate a machine learning prediction sample including at least a part of the generated combined features, which may include only the feature portion.
  • the machine learning model predicting means 600 may use the machine learning model to provide a prediction result corresponding to the machine learning predicted sample.
  • the prediction results can be provided in batches for a plurality of machine learning prediction samples.
  • the machine learning model may be generated by a training method according to an exemplary embodiment of the present disclosure, or may be externally received.
  • an appropriate binning operation can be automatically selected when acquiring a bin group feature.
  • steps S100, S200, and S300 are similar to the corresponding steps shown in FIG. 6, and details will not be described herein.
  • the method of FIG. 10 further includes a step S150, in which, for each successive feature, the binning operation selecting means 150 can select from a predetermined number of binning operations to be executed for the continuous feature. At least one binning operation such that the binning feature corresponding to the selected binning operation is not less important than the binning feature corresponding to the unselected binning operation.
  • the binning operation selecting means 150 may construct a single feature machine learning model for each of the binning features corresponding to the predetermined number of binning operations, based on the effects of the individual single feature machine learning models The importance of each binning feature is determined and the at least one binning operation is selected based on the importance of each binning feature.
  • binning operation selection device 150 may utilize a portion of the historical data records to build the M single feature machine learning models (wherein, each single feature of a machine learning model based on the corresponding individual bins wherein f m be for machine learning prediction And then measure the effect of the M single-feature machine learning models on the same test data set (for example, AUC (area under ROC (Receiver Operating Characteristic) curve, Area Under ROC Curve)), and The AUC-based ordering determines at least one binning operation that is ultimately performed.
  • M is an integer greater than 1
  • the binning operation selecting means 150 may construct a composite machine learning model for each of the binning features corresponding to the predetermined number of binning operations, based on the effects of the respective compound machine learning models Determining the importance of each binning feature and selecting the at least one binning operation based on the importance of each binning feature, wherein the compound machine learning model includes a base based on a lifting framework (eg, a gradient lifting framework) The model and the additional sub-model, wherein the basic sub-model corresponds to a basic feature subset, and the additional sub-model corresponds to each of the binning features.
  • a lifting framework eg, a gradient lifting framework
  • binning operation selection device 150 may utilize a portion of the historical data records to the M build composite machine learning models (wherein each compound machine learning model based on a fixed basic feature subset and a corresponding binning feature f m, in accordance with the lifting The framework to predict machine learning problems), then measure the effect of the M composite learner models on the same test data set (eg, AUC), and determine at least one binning operation that is ultimately performed based on the ordering of the AUC.
  • AUC test data set
  • binning operation selection means 150 may in case of a fixed basic sub-models, respectively, for each bin feature f m trained additional sub-model to construct each compound machine learning models .
  • the subset of basic features upon which the basic submodel is based may be updated as iteratively generates the combined features.
  • step S150 may be performed for each round of iteration to update the at least one bin.
  • the operations are performed, and the combined features generated in each iteration are added to the basic feature subset as new discrete features.
  • the basic feature subset of the composite machine learning model may be empty, or may include at least a portion of the first-order features (eg, feature A, feature B, as discrete features, Feature C) or all features (eg, feature A, feature B, feature C as discrete features, and original continuous features corresponding to feature D and feature E).
  • feature B and feature E are added to the basic feature subset.
  • feature BC and feature EA are added to the basic feature subset;
  • feature BCD and feature EAB are added to the basic feature subset, and so on.
  • the number of combinations of features selected in each iteration is not limited to one.
  • the composite machine learning model is re-established to determine the binning operation of the continuous feature, so that the continuous feature is converted into the corresponding bin group feature according to the determined binning operation, in the next step. Iteratively combines with other discrete features in round iterations.
  • step S150 can also be applied to the methods shown in FIGS. 8 and 9, which will not be described again.
  • the devices illustrated in Figures 1 through 5 can be configured as software, hardware, firmware, or any combination of the above, respectively, to perform a particular function.
  • these devices may correspond to dedicated integrated circuits, may also correspond to pure software code, and may also correspond to units or modules in which software and hardware are combined.
  • one or more of the functions implemented by these devices can also be performed collectively by components in a physical physical device (eg, a processor, a client or a server, etc.).
  • a method and system for generating combined features of machine learning samples and a corresponding machine learning model training/prediction system are described above with reference to FIGS. 1 through 10 in accordance with an exemplary embodiment of the present disclosure. It should be understood that the above method may be implemented by a program recorded on a computer readable medium, for example, according to an exemplary embodiment of the present disclosure, a computer readable storage medium storing instructions may be provided, wherein when the instructions are Actuating, by the at least one computing device, to: acquire a data record, wherein the data record comprises a plurality of attribute information; for each of the at least one continuous feature generated based on the plurality of attribute information a continuous feature performing at least one binning operation to obtain a bin group feature consisting of at least one binning feature, wherein each binning operation corresponds to a binning feature; and by including a binning group feature and based Combining features between at least one of the discrete features of the other discrete features generated by the plurality of attribute information to generate a combined
  • the computer program in the computer readable storage medium described above can be executed in an environment deployed in a computer device such as a processor, a client, a host, a proxy device, a server, etc., for example, by at least one computer located in a stand-alone environment or a distributed cluster environment
  • the apparatus operates to provide, by way of example, a computing device, a computer, a processor, a computing unit (or module), a client, a host, a proxy device, a server, and the like.
  • the computer program can also be used to perform additional steps in addition to the above steps or to perform more specific processing when performing the above steps, the contents of which have been described with reference to FIGS. 1 through 10, Here, in order to avoid repetition, it will not be described again.
  • the combined feature generation system and the machine learning model training/prediction system may rely entirely on the operation of the computer program to implement the corresponding functions, that is, the functional architectures and steps of the respective devices and computer programs. Accordingly, the entire system is called through a specialized software package (for example, a lib library) to implement the corresponding functions.
  • a specialized software package for example, a lib library
  • the respective devices shown in FIGS. 1 through 5 can also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof.
  • the program code or code segments for performing the corresponding operations may be stored in a computer readable medium, such as a storage medium, such that the processor can read and run the corresponding program. Code or code segment to perform the appropriate action.
  • a system including at least one computing device and at least one storage device storing instructions
  • the instructions when executed by the at least one computing device, cause the at least A computing device performing the following steps for generating a combined feature of the machine learning sample: acquiring a data record, wherein the data record includes a plurality of attribute information; for at least one of the at least one continuous feature generated based on the plurality of attribute information Performing at least one binning operation for each successive feature to obtain a binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to a binning feature; and by including binning group features and Combining features between at least one of the discrete features of the other discrete features generated by the plurality of attribute information to generate a combined feature of the machine learning samples.
  • the system may constitute a stand-alone computing environment or a distributed computing environment, and includes at least one computing device and at least one storage device.
  • the computing device may be a general-purpose or dedicated computer, a processor, etc., and may be simple
  • the unit that uses software to perform processing may also be an entity that combines hardware and software. That is, the computing device can be implemented as a computer, a processor, a computing unit (or module), a client, a host, a proxy device, a server, and the like.
  • the storage device can be a physical storage device or a logically partitioned storage unit that can be operatively coupled to the computing device or can communicate with each other, for example, through an I/O port, a network connection, or the like.
  • an exemplary embodiment of the present disclosure can also be implemented as a computing device including a storage component and a processor having a set of computer executable instructions stored therein, when the set of computer executable instructions is When the processor executes, the combined feature generation method, the machine learning model training method, and/or the machine learning model prediction method are executed.
  • the computing device can be deployed in a server or client, or can be deployed on a node device in a distributed network environment.
  • the computing device can be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions described above.
  • the computing device does not have to be a single computing device, but can be any collection of devices or circuits capable of executing the above described instructions (or sets of instructions), either alone or in combination.
  • the computing device can also be part of an integrated control system or system manager, or can be configured as a portable electronic device interfaced locally or remotely (e.g., via wireless transmission).
  • the processor can include a central processing unit (CPU), a graphics processing unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor.
  • the processor may also include, by way of example and not limitation, an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
  • Some of the operations described in the combined feature generation method and the machine learning model training/prediction method according to an exemplary embodiment of the present disclosure may be implemented by software, some of which may be implemented by hardware, and may also be soft A combination of hardware to achieve these operations.
  • the processor can execute instructions or code stored in one of the storage components, wherein the storage component can also store data.
  • the instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
  • the storage component can be integrated with the processor, for example, by arranging the RAM or flash memory within an integrated circuit microprocessor or the like.
  • the storage components can include separate devices such as external disk drives, storage arrays, or other storage devices that can be used with any database system.
  • the storage component and processor may be operatively coupled or may be in communication with one another, such as through an I/O port, a network connection, etc., such that the processor can read the file stored in the storage component.
  • the computing device can also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device can be connected to each other via a bus and/or a network.
  • a video display such as a liquid crystal display
  • a user interaction interface such as a keyboard, mouse, touch input device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

提供了一种由至少一个计算装置执行的生成机器学习样本的组合特征的方法及系统。所述方法包括:获取数据记录,其中,所述数据记录包括多个属性信息;针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。根据所述方法和系统,将获得的分箱组特征与其他特征进行组合,使得组成机器学习样本的组合特征更为有效,从而提升了机器学习模型的效果。

Description

生成机器学习样本的组合特征的方法及系统 技术领域
本公开总体说来涉及人工智能领域,更具体地说,涉及一种生成机器学习样本的组合特征的方法及系统。
背景技术
随着海量数据的出现,人工智能技术得到了迅速发展,而为了从大量数据中挖掘出价值,需要基于数据记录来产生适用于机器学习的样本。
这里,每条数据记录可被看做关于一个事件或对象的描述,对应于一个示例或样例。在数据记录中,包括反映事件或对象在某方面的表现或性质的各个事项,这些事项可称为“属性”。
如何将原始数据记录的各个属性转化为机器学习样本的特征,会对机器学习模型的效果带来很大的影响。事实上,机器学习模型的预测效果与模型的选择、可用的数据和特征的提取等有关。也就是说,一方面,可通过改进特征提取方式来提高模型预测效果,反之,如果特征提取不适当,则将导致预测效果的恶化。
然而,在确定特征提取方式的过程中,往往需要技术人员不仅掌握机器学习的知识,还需要对实际预测问题有深入的理解,而预测问题往往结合着不同行业的不同实践经验,导致很难达到满意的效果。特别地,在将连续特征与其他特征进行组合时,一方面,难以从预测效果方面把握将哪些特征进行组合,另一方面,也难以从运算角度方面确定有效的组合方式。综上所述,现有技术中难以将特征进行自动组合。
发明内容
本公开的示例性实施例旨在克服现有技术中难以对机器学习样本的特征进行自动组合的缺陷。
根据本公开的示例性实施例,提供一种由至少一个计算装置执行的生成机器学习样本的组合特征的方法,包括:
获取数据记录,其中,所述数据记录包括多个属性信息;
针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及
通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
根据本公开的另一示例性实施例,提供一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述 至少一个计算装置执行用于生成机器学习样本的组合特征的以下步骤:
获取数据记录,其中,所述数据记录包括多个属性信息;
针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及
通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
根据本公开的另一示例性实施例,提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如上所述的生成机器学习样本的组合特征的方法。
根据本公开的另一示例性实施例,提供一种生成机器学习样本的组合特征的系统,包括:
数据记录获取装置,用于获取数据记录,其中,所述数据记录包括多个属性信息;
分箱组特征生成装置,用于针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及
特征组合装置,用于通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
在根据本公开示例性实施例的生成机器学习样本的组合特征的方法及系统中,针对连续特征,执行一种或多种分箱运算,将获得的分箱组特征与其他特征进行组合,使得组成机器学习样本的组合特征更为有效,从而提升了机器学习模型的效果。
附图说明
从下面结合附图对本公开实施例的详细描述中,本公开的这些和/或其他方面和优点将变得更加清楚并更容易理解,其中:
图1示出根据本公开示例性实施例的生成机器学习样本的组合特征的系统的框图;
图2示出根据本公开示例性实施例的机器学习模型的训练系统的框图;
图3示出根据本公开示例性实施例的机器学习模型的预测系统的框图;
图4示出根据本公开示例性实施例的机器学习模型的训练和预测系统的框图;
图5示出根据本公开另一示例性实施例的生成机器学习样本的组合特征的系统的框图;
图6示出根据本公开示例性实施例的生成机器学习样本的组合特征的方法的流程图;
图7示出根据本公开示例性实施例的用于生成组合特征的搜索策略的示例;
图8示出根据本公开示例性实施例的机器学习模型的训练方法的流程图;
图9示出根据本公开示例性实施例的机器学习模型的预测方法的流程图;以及
图10示出根据本公开另一示例性实施例的生成机器学习样本的组合特征的方法的流程图。
具体实施方式
为了使本领域技术人员更好地理解本公开,下面结合附图和具体实施方式对本公开的示例性实施例作进一步详细说明。
在本公开的示例性实施例中,通过以下方式来进行自动特征组合:对至少一个连续特征之中的每个单个连续特征进行至少一种分箱运算,以生成与单个连续特征对应的一个或多个分箱特征,将这些分箱特征组成的分箱组特征与其他离散特征(例如,单个离散特征和/或其他分箱组特征)进行组合,可使得生成的机器学习样本更适于机器学习,从而可取得较好的预测结果。
这里,机器学习是人工智能研究发展到一定阶段的必然产物,其致力于通过计算的手段,利用经验来改善系统自身的性能。在计算机系统中,“经验”通常以“数据”形式存在,通过机器学习算法,可从数据中产生“模型”,也就是说,将经验数据提供给机器学习算法,就能基于这些经验数据产生模型,在面对新的情况时,模型会提供相应的判断,即,预测结果。不论是训练机器学习模型,还是利用训练好的机器学习模型进行预测,数据都需要转换为包括各种特征的机器学习样本。机器学习可被实现为“有监督学习”、“无监督学习”或“半监督学习”的形式,应注意,本公开的示例性实施例对具体的机器学习算法并不进行特定限制。此外,还应注意,在训练和应用模型的过程中,还可结合统计算法等其他手段。
图1示出根据本公开示例性实施例的生成机器学习样本的组合特征的系统的框图。具体说来,所述系统对于将进行组合的各个连续特征分别进行至少一种分箱运算,从而单个连续特征可转换为相应的至少一个分箱运算特征组成的分箱组特征,进一步地,将分箱组特征与其他离散特征进行组合,使得能够同时从不同的角度、尺度/层面来刻画原始数据记录。利用所述系统,能够自动产生机器学习样本的组合特征,而相应的机器学习样本有助于提高机器学习效果(例如,模型稳定性、模型泛化性等)。
如图1所示,数据记录获取装置100用于获取数据记录,其中,所述数据记录包括多个属性信息。
上述数据记录可以是在线产生的数据、预先生成并存储的数据、也可以是通过输入装置或传输媒介而从外部接收的数据。这些数据可涉及个人、企业或组织的属性信息,例如,身份、学历、职业、资产、联系方式、负债、收入、盈利、纳税等信息。或者,这些数据也可涉及业务相关项目的属性信息,例如,关于买卖合同的交易额、交易双方、标的物、交易地点等信息。应注意,本公开的示例性实施例中提到的属性信息内容可涉及任何对象或事务在某方面的表现或性质,而不限于对个人、物体、组织、单位、机构、项目、事件等进行限定或描述。
数据记录获取装置100可获取不同来源的结构化或非结构化数据,例如,文本数据或数值数据等。获取的数据记录可用于形成机器学习样本,参与机器学习的训练/预测过程。 这些数据可来源于期望获取模型预测结果的实体内部,例如,来源于期望获取预测结果的银行、企业、学校等;这些数据也可来源于上述实体以外,例如,来源于数据提供商、互联网(例如,社交网站)、移动运营商、APP运营商、快递公司、信用机构等。可选地,上述内部数据和外部数据可组合使用,以形成携带更多信息的机器学习样本。
上述数据可通过输入装置输入到数据记录获取装置100,或者由数据记录获取装置100根据已有的数据来自动生成,或者可由数据记录获取装置100从网络上(例如,网络上的存储介质(例如,数据仓库))获得,此外,诸如服务器的中间数据交换装置可有助于数据记录获取装置100从外部数据源获取相应的数据。这里,获取的数据可被数据记录获取装置100中的文本分析模块等数据转换模块转换为容易处理的格式。应注意,数据记录获取装置100可被配置为由软件、硬件和/或固件组成的各个模块,这些模块中的某些模块或全部模块可被集成为一体或共同协作以完成特定功能。
分箱组特征生成装置200用于针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征。
这里,针对数据记录的至少一部分属性信息,可产生相应的连续特征,这里,连续特征是与离散特征(例如,类别特征)相对的一种特征,其取值可以是具有一定连续性的数值,例如,距离、年龄、金额等。相对地,作为示例,离散特征的取值不具有连续性,例如,可以是“来自北京”、“来自上海”或“来自天津”、“性别为男”、“性别为女”等无序分类的特征。由此可见,就数据记录的所述多个属性信息整体而言,可相应地产生至少一个连续特征。这里,本公开的示例性实施例并不限制每个连续特征的具体产生方式(例如,产生自哪个或哪些属性信息字段)。
举例说来,分箱组特征生成装置200可将数据记录中的某种连续值属性直接作为机器学习样本中的对应连续特征,例如,可将距离、年龄、金额等属性直接作为相应的连续特征。也就是说,所述每一个连续特征可由所述多个属性信息之中的连续值属性信息自身形成。
或者,分箱组特征生成装置200也可通过对数据记录中的某些属性信息(例如,连续值属性和/或离散值属性信息)进行处理,以得到相应的连续特征,例如,将身高与体重的比值作为相应的连续特征。特别地,所述连续特征可通过对所述多个属性信息之中的离散值属性信息进行连续变换而形成。作为示例,所述连续变换可指示对所述离散值属性信息的取值进行统计。例如,连续特征可指示某些离散值属性信息关于机器学习模型的预测目标的统计信息。举例说来,在预测购买概率的示例中,可将卖家商户编号这一离散值属性信息变换为关于相应卖家商户编码的历史购买行为的概率统计特征。
此外,除了将进行分箱运算的连续特征之外,分箱组特征生成装置200还可产生机器学习样本的其他离散特征。作为可选方式,上述特征也可由其他特征产生装置(未示出)来产生。根据本公开的示例性实施例,上述特征之间可进行任意组合,其中,连续特征在组合时已经转换为分箱组特征。
对于每一个连续特征,分箱组特征生成装置200可执行至少一种分箱运算,从而能够同时获得多个从不同的角度、尺度/层面来刻画原始数据记录的某些属性的离散特征。
这里,分箱(binning)运算是指将连续特征进行离散化的一种特定方式,即,将连续特征的值域划分为多个区间(即,多个箱子),并基于划分的箱子来确定相应的分箱特征值。分箱运算大体上可划分为有监督分箱和无监督分箱,这两种类型各自包括一些具体的分箱方式,例如,有监督分箱包括最小熵分箱、最小描述长度分箱等,而无监督分箱包括等宽分箱、等深分箱、基于k均值聚类的分箱等。在每种分箱方式下,可设置相应的分箱参数,例如,宽度、深度等。应注意,根据本公开的示例性实施例,由分箱组特征生成装置200执行的分箱运算不限制分箱方式的种类,也不限制分箱运算的参数,并且,相应产生的分箱特征的具体表示方式也不受限制。
分箱组特征生成装置200执行的分箱运算可以在分箱方式和/或分箱参数方面存在差异。例如,所述至少一种分箱运算可以是种类相同但具有不同运算参数(例如,深度、宽度等)的分箱运算,也可以是不同种类的分箱运算。相应地,每一种分箱运算可得到一个分箱特征,这些分箱特征共同组成一个分箱组特征,该分箱组特征可体现出不同分箱运算,从而提升了机器学习素材的有效性,为机器学习模型的训练/预测提供了较好的基础。
特征组合装置300用于通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合(feature crosses)来生成机器学习样本的组合特征。
如上所述,连续特征被转换为分箱组形式的离散特征,而且,还可基于属性信息产生一个或多个其他离散特征。相应地,特征组合装置300可促使作为分箱组特征或其他离散特征的离散特征之间进行任意组合,以得到相应的组合特征。具体说来,任意数量的分箱组特征之间可进行特征组合,任意数量的所述其他离散特征之间可进行特征组合,或者,任意数量的分箱组特征还可与任意数量的所述其他离散特征进行特征组合。这里,作为示例,分箱组特征和/或所述其他离散特征之间可按照笛卡尔积进行特征组合。然而,应理解,本公开的示例性实施例并不受限于笛卡尔积的组合方式,任何能够将上述离散特征进行组合的方式均可应用于本公开的示例性实施例。
作为示例,特征组合装置300可根据关于组合特征的搜索策略,按照迭代的方式来生成机器学习样本的组合特征。例如,根据诸如集束搜索(beam search)的启发式搜索策略,在搜索树的每一层,按照启发代价对节点进行排序,然后仅留下特定个数(Beam Width-集束宽度)的节点,仅这些节点在下一层继续扩展,而其他节点被剪掉。
图1所示的系统旨在产生机器学习样本的组合特征,该系统可独立存在,这里,应注意,所述系统获取数据记录的方式并不受限制,也就是说,作为示例,数据记录获取装置100可以是具有接收并处理数据记录的能力的装置,也可以仅仅是提供已经准备好的数据记录的装置。
此外,图1所示的系统也可集成到模型训练和/或模型预测的系统中,作为完成特征处理的组成部分。
图2示出根据本公开示例性实施例的机器学习模型的训练系统的框图。在图2所示的系统中,除了上述数据记录获取装置100、分箱组特征生成装置200和特征组合300之外,还包括机器学习样本生成装置400和机器学习模型训练装置500。
具体说来,在图2所示的系统中,数据记录获取装置100、分箱组特征生成装置200和特征组合装置300可按照在图1所示的系统中的方式进行操作,其中,数据记录获取装置100可获取已经标记过的历史数据记录。
此外,机器学习样本生成装置400用于产生至少包括一部分所产生的组合特征的机器学习样本。也就是说,在由机器学习样本生成装置400产生的机器学习样本中,包括由特征组合装置300产生的一部分或全部组合特征,此外,作为可选方式,机器学习样本还可包括基于数据记录的属性信息产生的任意其他特征,例如,直接由数据记录的属性信息本身充当的各个特征、通过对属性信息进行特征处理而得到的特征等。如上所述,作为示例,这些其他特征可由分箱组特征生成装置200来产生,也可由其他装置来产生。
具体说来,机器学习样本生成装置400可产生机器学习训练样本,特别地,作为示例,在有监督学习的情况下,机器学习样本生成装置400所产生的机器学习训练样本可包括特征和标记(label)两部分。
机器学习模型训练装置500用于基于机器学习训练样本来训练机器学习模型。这里,机器学习模型训练装置500可采用任何适当的机器学习算法(例如,对数几率回归),从机器学习训练样本学习出适当的机器学习模型。
在上述示例中,可训练出较为稳定且预测效果较好的机器学习模型。
图3示出根据本公开示例性实施例的机器学习模型的预测系统的框图。与图1所示的系统相比,图3的系统除了数据记录获取装置100、分箱组特征生成装置200和特征组合装置300之外,还包括机器学习样本生成装置400和机器学习模型预测装置600。
具体说来,在图3所示的系统中,数据记录获取装置100、分箱组特征生成装置200和特征组合装置300可按照在图1所示的系统中的方式进行操作,其中,数据记录获取装置100可获取将进行预测的数据记录(例如,不含有标记的新数据记录或用于测试的历史数据记录)。相应地,机器学习样本生成装置400可按照与在图2所示的类似方式来产生仅包括特征部分的机器学习预测样本。
机器学习模型预测装置600用于利用已经训练好的机器学习模型,提供与机器学习预测样本相应的预测结果。这里,机器学习模型预测装置600可批量地针对多个机器学习预测样本来提供预测结果。
这里,应注意:图2和图3的系统还可有效地融合以形成能够完成机器学习模型的训练和预测两者的系统。
具体说来,图4示出根据本公开示例性实施例的机器学习模型的训练和预测系统的框图。在图4所示的系统中,包括上述数据记录获取装置100、分箱组特征生成装置200、特征组合装置300、机器学习样本生成装置400、机器学习模型训练装置500和机器学习模型预测装置600。
这里,在图4所示的系统中,数据记录获取装置100、分箱组特征生成装置200和特征组合装置300可按照在图1所示的系统中的方式进行操作,其中,数据记录获取装置100可有针对性地获取历史数据记录或待预测数据记录。此外,机器学习样本生成装置400可根据情况来产生机器学习训练样本或机器学习预测样本,具体说来,在模型训练阶段,机器学习样本生成装置400可产生机器学习训练样本,特别地,作为示例,在有监督学习的情况下,机器学习样本生成装置400所产生的机器学习训练样本可包括特征和标记(label)两部分。此外,在模型预测阶段,机器学习样本生成装置400可产生机器学习预测样本,这里,应理解,机器学习预测样本的特征部分与机器学习训练样本的特征部分是一致的。
此外,在模型训练阶段,机器学习样本生成装置400将产生的机器学习训练样本提供给机器学习模型训练装置500,使得机器学习模型训练装置500基于机器学习训练样本来训练机器学习模型。在机器学习模型训练装置500学习出机器学习模型之后,机器学习模型训练装置500将训练好的机器学习模型提供给机器学习模型预测装置600。相应地,在模型预测阶段,机器学习样本生成装置400将产生的机器学习预测样本提供给机器学习模型预测装置600,使得机器学习模型预测装置600利用所述机器学习模型来提供针对机器学习预测样本的预测结果。
根据本公开的示例性实施例,需要对连续特征执行至少一种分箱运算。这里,所述至少一种分箱运算可通过任何适当的方式来确定,例如,可借助技术人员或业务人员的经验来确定,也可经由技术手段来自动确定。作为示例,可基于分箱特征的重要性来有效地确定具体的分箱运算方式。
图5示出根据本公开另一示例性实施例的生成机器学习样本的组合特征的系统的框图。与图1所示的系统相比,图5的系统除了数据记录获取装置100、分箱组特征生成装置200和特征组合装置300之外,还包括分箱运算选择装置150。
在图5所示的系统中,数据记录获取装置100、分箱组特征生成装置200和特征组合装置300可按照在图1所示的系统中的方式进行操作。此外,分箱运算选择装置150用于从预定数量的分箱运算中选择所述至少一种分箱运算,使得与选择的分箱运算对应的分箱特征的重要性不低于与未被选择的分箱运算对应的分箱特征的重要性。通过这种方式,能够在减小组合后特征空间大小的情况下,确保机器学习的效果。
具体说来,预定数量的分箱运算可指示在分箱方式和/或分箱参数方面存在差异的多种分箱运算。这里,通过执行每一种分箱运算,可得到对应的一个分箱特征,相应地,分箱运算选择装置150可确定这些分箱特征的重要性,并进而选择较为重要的分箱特征所对应的分箱运算,作为将由分箱组特征生成装置200所执行的至少一种分箱运算。
这里,分箱运算选择装置150可采用任何适当的方式来自动确定分箱特征的重要性。
例如,分箱运算选择装置150可针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建单特征机器学习模型,基于各个单特征机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算,其中,单特征机器学习模型对应所述每一个分箱特征。
又例如,分箱运算选择装置150可针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建复合机器学习模型,基于各个复合机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算,其中,复合机器学习模型包括基于提升框架的基本子模型和附加子模型,其中,基本子模型对应基本特征子集,附加子模型对应所述每一个分箱特征。根据本公开的示例性实施例,基本特征子集可固定地应用于所有相关复合机器学习模型中的基本子模型,这里,可将任何基于数据记录的属性信息产生的特征作为基本特征。例如,可将数据记录的至少一部分属性信息直接作为基本特征。此外,作为示例,可考虑实际的机器学习问题,基于测试计算或根据业务人员指定来确定相对重要或基本的特征作为基本特征。这里,在按照迭代的方式生成组合特征的情况下,分箱运算选择装置150可针对每一轮迭代来选择分箱运算,并且,每一轮迭代中生成的组合特征作为新的离散特征被加入基本特征子集。
应理解,图5中所示的分箱运算选择装置150可并入图2到图4所示的训练系统和/或预测系统中。
以下参照图6来描述根据本公开示例性实施例的生成机器学习样本的组合特征的方法的流程图。这里,作为示例,图6所示的方法可由图1所示的系统来执行,也可完全通过计算机程序以软件方式实现,还可通过特定配置的计算装置来执行图6所示的方法。为了描述方便,假设图6所示的方法由图1所示的系统来执行。
如图所示,在步骤S100中,由数据记录获取装置100获取数据记录,其中,所述数据记录包括多个属性信息。
这里,作为示例,数据记录获取装置100可通过手动、半自动或全自动的方式来采集数据,或对采集的原始数据进行处理,使得处理后的数据记录具有适当的格式或形式。作为示例,数据记录获取装置100可批量地采集数据。
这里,数据记录获取装置100可通过输入装置(例如,工作站)接收用户手动输入的数据记录。此外,数据记录获取装置100可通过全自动的方式从数据源系统地取出数据记录,例如,通过以软件、固件、硬件或其组合实现的定时器机制来系统地请求数据源并从响应中得到所请求的数据。所述数据源可包括一个或多个数据库或其他服务器。可经由内部网络和/或外部网络来实现全自动获取数据的方式,其中可包括通过互联网来传送加密的数据。在服务器、数据库、网络等被配置为彼此通信的情况下,可在没有人工干预的情况下自动进行数据采集,但应注意,在这种方式下仍旧可存在一定的用户输入操作。半自动方式介于手动方式与全自动方式之间。半自动方式与全自动方式的区别在于由用户激活的触发机制代替了例如定时器机制。在这种情况下,在接收到特定的用户输入的情况下,才产生提取数据的请求。每次获取数据时,优选地,可将捕获的数据存储在非易失性存储器中。作为示例,可利用数据仓库来存储在获取期间采集的原始数据以及处理后的数据。
上述获取的数据记录可来源于相同或不同的数据源,也就是说,每条数据记录也可以是不同数据记录的拼接结果。例如,除了获取客户向银行申请开通信用卡时填写的信息数据记录(其包括收入、学历、职务、资产情况等属性信息字段)之外,作为示例,数据记 录获取装置100可还获取该客户在该银行的其他数据记录,例如,贷款记录、日常交易数据等,这些获取的数据记录可拼接为完整的数据记录。此外,数据记录获取装置100还可获取来源于其他私有源或公共源的数据,例如,来源于数据提供商的数据、来源于互联网(例如,社交网站)的数据、来源于移动运营商的数据、来源于APP运营商的数据、来源于快递公司的数据、来源于信用机构的数据等等。
可选地,数据记录获取装置100可借助硬件集群(诸如Hadoop集群、Spark集群等)对采集到的数据进行存储和/或处理,例如,存储、分类和其他离线操作。此外,数据记录获取装置100也可对采集的数据进行在线的流处理。
作为示例,数据记录获取装置100中可包括文本分析模块等数据转换模块,相应地,在步骤S100中,数据记录获取装置100可将文本等非结构化数据转换为更易于使用的结构化数据以在后续进行进一步的处理或引用。基于文本的数据可包括电子邮件、文档、网页、图形、电子数据表、呼叫中心日志、交易报告等。
接下来,在步骤S200中,由分箱组特征生成装置200用于针对基于所述多个属性信息产生的至少一个连续特征中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征。
具体说来,步骤S200旨在产生由分箱特征组成的分箱组特征,这样的分箱组特征可代替原始的连续特征而参与离散特征之间的自动组合。为此,对于每一个连续特征,通过分别执行至少一种分箱运算,可获得相应的至少一个分箱特征。
连续特征可产生自数据记录的至少一部分属性信息。作为示例,数据记录的距离、年龄和金额等连续取值的属性信息可直接作为连续特征;作为另一示例,可通过对数据记录的某些属性信息进行进一步的处理来获得连续特征,例如,可将身高与体重的比值作为连续特征;又例如,可通过对属性信息之中的离散值属性信息进行连续变换而形成连续特征,举例说来,这里的连续变换可指示对所述离散值属性信息的取值进行统计,得到的统计信息作为连续特征。
在获得了连续特征之后,可由分箱组特征生成装置200对获得的连续特征执行至少一种分箱运算,这里,分箱组特征生成装置200可按照各种分箱方式和/或分箱参数来执行分箱运算。
以无监督下的等宽分箱为例,假设连续特征的取值区间为[0,100],相应的分箱参数(即,宽度)为50,则可分出2个箱子,在这种情况下,取值为61.5的连续特征对应于第2个箱子,如果这两个箱子的标号为0和1,则所述连续特征对应的箱子标号为1。或者,假设分箱宽度为10,则可分出10个箱子,在这种情况下,取值为61.5的连续特征对应于第7个箱子,如果这十个箱子的标号为0到9,则所述连续特征对应的箱子标号为6。或者,假设分箱宽度为2,则可分出50个箱子,在这种情况下,取值为61.5的连续特征对应于第31个箱子,如果这五十个箱子的标号为0到49,则所述连续特征对应的箱子标号为30。
在将连续特征映射到多个箱子之后,对应的特征值可以为自定义的任何值。这里,分箱特征可指示连续特征按照对应的分箱运算被分到了哪个箱子。也就是说,执行分箱 运算以产生与每一个连续特征对应的多维度的分箱特征,其中,作为示例,每个维度可指示对应的箱子中是否被分到了相应的连续特征,例如,以“1”来表示连续特征被分到了相应的箱子,而以“0”来表示连续特征没有被分到相应的箱子,相应地,在上述示例中,假设分出了10个箱子,则基本分箱特征可以是10个维度的特征,与取值为61.5的连续特征对应的基本分箱特征可表示为[0,0,0,0,0,0,1,0,0,0]。
此外,作为示例,在执行分箱运算前,还可以通过去除数据样本中可能的离群点来减少数据记录中的噪音。通过这种方式,能进一步提高利用分箱特征进行机器学习的有效性。
具体说来,可额外设置离群箱,使得具有离群值的连续特征被分到所述离群箱。举例说来,对于取值区间为[0,1000]的连续特征,可选取一定数量的样本进行预分箱,例如,先按照分箱宽度为10来进行等宽分箱,然后记录每个箱子内的样本数量,对于样本数量较少(例如,少于阈值)的箱子,可以将它们合并为至少一个离群箱。作为示例,如果位于两端的箱内样本数量较少,则可将样本较少的箱子合并为离群箱,而将剩余的箱子保留,假设0-10号箱子中的样本数量较少,则可将0-10号箱子合并为离群箱,从而将取值为[0,100]的连续特征统一划分到离群箱。
根据本公开的示例性实施例,所述至少一个分箱运算可以是分箱方式相同但分箱参数不同的分箱运算;或者,所述至少一个分箱运算可以是分箱方式不同的分箱运算。
这里的分箱方式包括有监督分箱和/或无监督分箱下的各种分箱方式。例如,有监督分箱包括最小熵分箱、最小描述长度分箱等,而无监督分箱包括等宽分箱、等深分箱、基于k均值聚类的分箱等。
作为示例,至少一种分箱运算可分别对应于不同宽度的等宽分箱运算。也就是说,采用的分箱方式相同但划分的粒度不同,这使得产生的分箱特征能够更好地刻画原始数据记录的规律,从而更有利于机器学习模型的训练与预测。特别地,至少一种分箱运算所采用的不同宽度可在数值上构成等比数列,例如,分箱运算可按照值2、值4、值8、值16等的宽度来进行等宽分箱。或者,至少一种分箱运算所采用的不同宽度可在数值上构成等差数列,例如,分箱运算可按照值2、值4、值6、值8等的宽度来进行等宽分箱。
作为另一示例,至少一种分箱运算可分别对应于不同深度的等深分箱运算。也就是说,分箱运算采用的分箱方式相同但划分的粒度不同,这使得产生的分箱特征能够更好地刻画原始数据记录的规律,从而更有利于机器学习模型的训练与预测。特别地,分箱运算所采用的不同深度可在数值上构成等比数列,例如,分箱运算可按照值10、值100、值1000、值10000等的深度来进行等深分箱。或者,分箱运算所采用的不同深度可在数值上构成等差数列,例如,分箱运算可按照值10、值20、值30、值40等的深度来进行等深分箱。
针对每一个连续特征,在通过执行分箱运算而得到了相应的至少一个分箱特征之后,分箱组特征生成装置200可通过将每一个分箱特征作为一个组成元素而得到分箱组特征。可以看出,这里的分箱组特征可看做分箱特征的集合,因而也被用作离散特征。
在步骤S300中,由特征组合装置300通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。这里,由于连续特征已经被转换为作为离散特征的分箱组特征,因此,可在包括分箱组特征和其他离散特征的特征之间进行任意组合,以作为机器学习样本的组合特征。作为示例,特征之间的组合可通过笛卡尔积来实现,然而,应注意,组合方式并不受限于此,任何能够将两个或多个离散特征相互结合的方式均可应用于本公开的示例性实施例。
这里,可将单个离散特征看做一阶特征,根据本公开的示例性实施例,可进行两阶、三阶等更高阶的特征组合,直至满足预定的截止条件。作为示例,可根据关于组合特征的搜索策略,按照迭代的方式来生成机器学习样本的组合特征。
图7示出根据本公开示例性实施例的用于生成组合特征的搜索树的示例。根据本公开的示例性实施例,例如,所述搜索树可基于诸如集束搜索的启发式搜索策略,其中,搜索树的一层可对应于特定阶数的特征组合。
参照图7,假设可进行组合的离散特征包括特征A、特征B、特征C、特征D和特征E,作为示例,特征A、特征B、特征C可以是由数据记录的离散值属性信息自身形成的离散特征,而特征D和特征E可以是由连续特征转换而来的分箱组特征。
根据搜索策略,在第一轮迭代中,选取了作为一阶特征的特征B和特征E这两个节点,这里,可将例如特征重要性等作为指标来对各个节点进行排序,并进而选取一部分节点以在下一层继续扩展。
在下一轮迭代中,基于特征B和特征E来生成作为二阶组合特征的特征BA、特征BC、特征BD、特征BE、特征EA、特征EB、特征EC、特征ED,并继续基于排序指标选取了其中的特征BC和特征EA。作为示例,特征BE和特征EB可被看做相同的组合特征。
按照上述方式继续进行迭代,直至满足特定的截止条件,例如,阶数限制等。这里,在每一层中被选择的节点(用实线示出)可作为组合特征以进行后续的处理,例如,作为最终采用的特征或进行进一步的重要性评价,而其余特征(用虚线示出)被剪枝。
图8示出根据本公开示例性实施例的机器学习模型的训练方法的流程图。在图8所示的方法中,除了上述步骤S100、S200和S300之外,所述方法还包括步骤S400和步骤S500。
具体说来,在图8所示的方法中,步骤S100、步骤S200和步骤S300可与图6所示的相应步骤类似,其中,在步骤S100中可获取已经标记过的历史数据记录。
此外,在步骤S400中,可由机器学习样本生成装置400产生至少包括一部分所产生的组合特征的机器学习训练样本,在有监督学习的情况下,该机器学习训练样本可包括特征和标记两部分。
在步骤S500中,可由机器学习模型训练装置500基于机器学习训练样本来训练机器学习模型。这里,机器学习模型训练装置500可利用适当的机器学习算法,从机器学 习训练样本学习出适当的机器学习模型。
在训练出机器学习模型之后,可利用训练出的机器学习模型来进行预测。
图9示出根据本公开示例性实施例的机器学习模型的预测方法的流程图。在图9所示的方法中,除了上述步骤S100、S200和S300之外,所述方法还包括步骤S400和步骤S600。
具体说来,在图9所示的方法中,步骤S100、步骤S200和步骤S300可与图6所示的相应步骤类似,其中,在步骤S100中可获取将进行预测的数据记录。
此外,在步骤S400中,可由机器学习样本生成装置400产生至少包括一部分所产生的组合特征的机器学习预测样本,该机器学习预测样本可仅包括特征部分。
在步骤S600中,可由机器学习模型预测装置600利用机器学习模型,提供与机器学习预测样本相应的预测结果。这里,可批量地针对多个机器学习预测样本来提供预测结果。此外,机器学习模型可通过根据本公开示例性实施例的训练方法来产生,也可从外部接收。
如上所述,根据本公开的示例性实施例,在获取分箱组特征时,可自动选取适当的分箱运算。以下将参照图10来描述根据本公开另一示例性实施例的生成机器学习样本的组合特征的方法的流程图。
参照图10,其中的步骤S100、S200和步骤S300与图6所示的相应步骤类似,这里将不再赘述细节。与图6的方法相比,图10的方法还包括步骤S150,在该步骤中,针对每一个连续特征,可由分箱运算选择装置150从预定数量的分箱运算中选择将针对该连续特征执行的至少一种分箱运算,使得与选择的分箱运算对应的分箱特征的重要性不低于与未被选择的分箱运算对应的分箱特征的重要性。
作为示例,分箱运算选择装置150可针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建单特征机器学习模型,基于各个单特征机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算。
例如,假设对于连续特征F而言,存在预定数量M(M为大于1的整数)种分箱运算,对应M个分箱特征f m,其中,m∈[1,M]。相应地,分箱运算选择装置150可利用一部分历史数据记录来构建M个单特征机器学习模型(其中,每一个单特征机器学习模型基于相应的单个分箱特征f m来针对机器学习问题进行预测),然后衡量这M个单特征机器学习模型在相同测试数据集上的效果(例如,AUC(ROC(受试者工作特征,Receiver Operating Characteristic)曲线下的面积,Area Under ROC Curve)),并基于AUC的排序来确定最终执行的至少一种分箱运算。
作为另一示例,分箱运算选择装置150可针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建复合机器学习模型,基于各个复合机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算,其中,复合机器学习模型包括基于提升框架(例如,梯度提升框架)的基本子模型和附加子模型,其中,基本子模型对应基本特征子集,附加子模型对应所述每一 个分箱特征。
例如,假设对于连续特征F而言,存在预定数量M种分箱运算,对应M个分箱特征f m,其中,m∈[1,M]。相应地,分箱运算选择装置150可利用一部分历史数据记录来M个构建复合机器学习模型(其中,每一个复合机器学习模型基于固定的基本特征子集和相应的分箱特征f m,按照提升框架来针对机器学习问题进行预测),然后衡量这M个复合器学习模型在相同测试数据集上的效果(例如,AUC),并基于AUC的排序来确定最终执行的至少一种分箱运算。优选地,为了进一步提高运算效率并降低资源消耗,分箱运算选择装置150可通过在固定基本子模型的情况下,分别针对每一个分箱特征f m训练附加子模型来构建各个复合机器学习模型。这里,基本子模型所依据的基本特征子集可随着生成组合特征的迭代而更新。
在诸如图7所示的根据关于组合特征的搜索策略,按照迭代的方式来生成机器学习样本的组合特征的示例中,可针对每一轮迭代来执行步骤S150以更新所述至少一种分箱运算,并且,每一轮迭代中生成的组合特征作为新的离散特征被加入基本特征子集。例如,在图7的示例中,在第一轮迭代中,复合机器学习模型的基本特征子集可以为空,也可包括至少一部分一阶特征(例如,作为离散特征的特征A、特征B、特征C)或全部特征(例如,作为离散特征的特征A、特征B、特征C连同与特征D和特征E对应的原始连续特征)。在第一轮迭代之后,特征B和特征E被加入基本特征子集。然后,在第二轮迭代之后,特征BC和特征EA被补充到基本特征子集;在第三轮迭代之后,特征BCD和特征EAB被补充到基本特征子集,以此类推。应注意,在每一轮迭代中被选择的特征组合个数不限于一个。同时,针对每一轮迭代,都会重新通过构建复合机器学习模型来确定连续特征的分箱运算,使得连续特征按照确定的分箱运算转换为相应的分箱组特征,以在紧接的下一轮迭代中与其他离散特征进行组合。
应注意,上述步骤S150也可同样应用于图8和图9所示的方法中,这里将不再赘述。
图1到图5所示出的装置可被分别配置为执行特定功能的软件、硬件、固件或上述项的任意组合。例如,这些装置可对应于专用的集成电路,也可对应于纯粹的软件代码,还可对应于软件与硬件相结合的单元或模块。此外,这些装置所实现的一个或多个功能也可由物理实体设备(例如,处理器、客户端或服务器等)中的组件来统一执行。
以上参照图1到图10描述了根据本公开示例性实施例的生成机器学习样本的组合特征的方法和系统以及相应的机器学习模型训练/预测系统。应理解,上述方法可通过记录在计算可读介质上的程序来实现,例如,根据本公开的示例性实施例,可提供一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行:获取数据记录,其中,所述数据记录包括多个属性信息;针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
上述计算机可读存储介质中的计算机程序可在诸如处理器、客户端、主机、代理装置、服务器等计算机设备中部署的环境中运行,例如,由位于单机环境或分布式集群环境的至少一个计算装置来运行,作为示例,这里的计算装置可作为计算机、处理器、计算单元(或模块)、客户端、主机、代理装置、服务器等。应注意,所述计算机程序还可用于执行除了上述步骤以外的附加步骤或者在执行上述步骤时执行更为具体的处理,这些附加步骤和进一步处理的内容已经参照图1到图10进行了描述,这里为了避免重复将不再进行赘述。
应注意,根据本公开示例性实施例的组合特征生成系统以及机器学习模型训练/预测系统可完全依赖计算机程序的运行来实现相应的功能,即,各个装置与计算机程序的功能架构中与各步骤相应,使得整个系统通过专门的软件包(例如,lib库)而被调用,以实现相应的功能。
另一方面,图1到图5所示的各个装置也可以通过硬件、软件、固件、中间件、微代码或其任意组合来实现。当以软件、固件、中间件或微代码实现时,用于执行相应操作的程序代码或者代码段可以存储在诸如存储介质的计算机可读介质中,使得处理器可通过读取并运行相应的程序代码或者代码段来执行相应的操作。
例如,根据本公开示例性实施例,可提供一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于生成机器学习样本的组合特征的以下步骤:获取数据记录,其中,所述数据记录包括多个属性信息;针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
这里,所述系统可构成单机计算环境或分布式计算环境,其包括至少一个计算装置和至少一个存储装置,这里,作为示例,计算装置可以是通用或专用的计算机、处理器等,可以是单纯利用软件来执行处理的单元,还可以是软硬件相结合的实体。也就是说,计算装置可实现为计算机、处理器、计算单元(或模块)、客户端、主机、代理装置、服务器等。此外,存储装置可以是物理上的存储设备或逻辑上划分出的存储单元,其可与计算装置在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信。
此外,例如,本公开的示例性实施例还可以实现为计算装置,该计算装置包括存储部件和处理器,存储部件中存储有计算机可执行指令集合,当所述计算机可执行指令集合被所述处理器执行时,执行组合特征生成方法、机器学习模型训练方法和/或机器学习模型预测方法。
具体说来,所述计算装置可以部署在服务器或客户端中,也可以部署在分布式网络环境中的节点装置上。此外,所述计算装置可以是PC计算机、平板装置、个人数字助理、智能手机、web应用或其他能够执行上述指令集合的装置。
这里,所述计算装置并非必须是单个的计算装置,还可以是任何能够单独或联合执行上述指令(或指令集)的装置或电路的集合体。计算装置还可以是集成控制系统或系统管 理器的一部分,或者可被配置为与本地或远程(例如,经由无线传输)以接口互联的便携式电子装置。
在所述计算装置中,处理器可包括中央处理器(CPU)、图形处理器(GPU)、可编程逻辑装置、专用处理器系统、微控制器或微处理器。作为示例而非限制,处理器还可包括模拟处理器、数字处理器、微处理器、多核处理器、处理器阵列、网络处理器等。
根据本公开示例性实施例的组合特征生成方法以及机器学习模型训练/预测方法中所描述的某些操作可通过软件方式来实现,某些操作可通过硬件方式来实现,此外,还可通过软硬件结合的方式来实现这些操作。
处理器可运行存储在存储部件之一中的指令或代码,其中,所述存储部件还可以存储数据。指令和数据还可经由网络接口装置而通过网络被发送和接收,其中,所述网络接口装置可采用任何已知的传输协议。
存储部件可与处理器集成为一体,例如,将RAM或闪存布置在集成电路微处理器等之内。此外,存储部件可包括独立的装置,诸如,外部盘驱动、存储阵列或任何数据库系统可使用的其他存储装置。存储部件和处理器可在操作上进行耦合,或者可例如通过I/O端口、网络连接等互相通信,使得处理器能够读取存储在存储部件中的文件。
此外,所述计算装置还可包括视频显示器(诸如,液晶显示器)和用户交互接口(诸如,键盘、鼠标、触摸输入装置等)。计算装置的所有组件可经由总线和/或网络而彼此连接。
根据本公开示例性实施例的组合特征生成方法以及相应的机器学习模型训练/预测方法所涉及的操作可被描述为各种互联或耦合的功能块或功能示图。然而,这些功能块或功能示图可被均等地集成为单个的逻辑装置或按照非确切的边界进行操作。
以上描述了本公开的各示例性实施例,应理解,上述描述仅是示例性的,并非穷尽性的,本公开不限于所披露的各示例性实施例。在不偏离本公开的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本公开的保护范围应该以权利要求的范围为准。

Claims (28)

  1. 一种由至少一个计算装置执行的生成机器学习样本的组合特征的方法,包括:
    获取数据记录,其中,所述数据记录包括多个属性信息;
    针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及
    通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
  2. 如权利要求1所述的方法,其中,在所述针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算的步骤之前,所述方法还包括:从预定数量的分箱运算中选择所述至少一种分箱运算,使得与选择的分箱运算对应的分箱特征的重要性不低于与未被选择的分箱运算对应的分箱特征的重要性。
  3. 如权利要求2所述的方法,其中,所述从预定数量的分箱运算中选择所述至少一种分箱运算的步骤包括:针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建单特征机器学习模型,基于各个单特征机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算。
  4. 如权利要求2所述的方法,其中,所述从预定数量的分箱运算中选择所述至少一种分箱运算的步骤包括:针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建复合机器学习模型,基于各个复合机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算,
    其中,复合机器学习模型包括基于提升框架的基本子模型和附加子模型,其中,基本子模型对应基本特征子集,附加子模型对应所述每一个分箱特征。
  5. 如权利要求4所述的方法,其中,根据关于组合特征的搜索策略,按照迭代的方式来生成机器学习样本的组合特征。
  6. 如权利要求5所述的方法,其中,针对每一轮迭代来执行所述从预定数量的分箱运算中选择所述至少一种分箱运算的步骤以更新所述至少一种分箱运算,并且,每一轮迭代中生成的组合特征作为新的离散特征被加入基本特征子集。
  7. 如权利要求1到6中的任一权利要求所述的方法,其中,所述至少一个离散特征之间按照笛卡尔积进行特征组合。
  8. 如权利要求1所述的方法,其中,所述至少一种分箱运算分别对应于不同宽度的等宽分箱运算或不同深度的等深分箱运算。
  9. 如权利要求8所述的方法,其中,所述不同宽度或不同深度在数值上构成等比数列或等差数列。
  10. 如权利要求1所述的方法,其中,分箱特征指示连续特征按照对应的分箱运算被分到了哪个箱子。
  11. 如权利要求1所述的方法,其中,所述每一个连续特征由所述多个属性信息之 中的连续值属性信息自身形成,或者,所述每一个连续特征通过对所述多个属性信息之中的离散值属性信息进行连续变换而形成。
  12. 如权利要求11所述的方法,其中,所述连续变换指示对所述离散值属性信息的取值进行统计。
  13. 如权利要求4所述的方法,其中,通过在固定基本子模型的情况下分别训练附加子模型来构建各个复合机器学习模型。
  14. 一种包括至少一个计算装置和至少一个存储指令的存储装置的系统,其中,所述指令在被所述至少一个计算装置运行时,促使所述至少一个计算装置执行用于生成机器学习样本的组合特征的以下步骤:
    获取数据记录,其中,所述数据记录包括多个属性信息;
    针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及
    通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
  15. 如权利要求14所述的系统,其中,所述指令在被所述至少一个计算装置运行时,将促使所述至少一个计算装置还执行以下步骤:从预定数量的分箱运算中选择所述至少一种分箱运算,使得与选择的分箱运算对应的分箱特征的重要性不低于与未被选择的分箱运算对应的分箱特征的重要性。
  16. 如权利要求15所述的系统,其中,所述从预定数量的分箱运算中选择所述至少一种分箱运算的步骤包括:针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建单特征机器学习模型,基于各个单特征机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算。
  17. 如权利要求15所述的系统,其中,所述从预定数量的分箱运算中选择所述至少一种分箱运算的步骤包括:针对与所述预定数量的分箱运算对应的分箱特征之中的每一个分箱特征,构建复合机器学习模型,基于各个复合机器学习模型的效果来确定各个分箱特征的重要性,并基于各个分箱特征的重要性来选择所述至少一种分箱运算,
    其中,复合机器学习模型包括基于提升框架的基本子模型和附加子模型,其中,基本子模型对应基本特征子集,附加子模型对应所述每一个分箱特征。
  18. 如权利要求17所述的系统,其中,根据关于组合特征的搜索策略,按照迭代的方式来生成机器学习样本的组合特征。
  19. 如权利要求18所述的系统,其中,针对每一轮迭代来重新选择所述至少一种分箱运算,并且,每一轮迭代中生成的组合特征作为新的离散特征被加入基本特征子集。
  20. 如权利要求14到19中的任一权利要求所述的系统,其中,所述至少一个离散特征之间按照笛卡尔积进行特征组合。
  21. 如权利要求14所述的系统,其中,所述至少一种分箱运算分别对应于不同宽 度的等宽分箱运算或不同深度的等深分箱运算。
  22. 如权利要求21所述的系统,其中,所述不同宽度或不同深度在数值上构成等比数列或等差数列。
  23. 如权利要求14所述的系统,其中,分箱特征指示连续特征按照对应的分箱运算被分到了哪个箱子。
  24. 如权利要求14所述的系统,其中,所述每一个连续特征由所述多个属性信息之中的连续值属性信息自身形成,或者,所述每一个连续特征通过对所述多个属性信息之中的离散值属性信息进行连续变换而形成。
  25. 如权利要求24所述的系统,其中,所述连续变换指示对所述离散值属性信息的取值进行统计。
  26. 如权利要求17所述的系统,其中,分箱运算选择装置通过在固定基本子模型的情况下分别训练附加子模型来构建各个复合机器学习模型。
  27. 一种存储指令的计算机可读存储介质,其中,当所述指令被至少一个计算装置运行时,促使所述至少一个计算装置执行如权利要求1到13中的任一权利要求所述的生成机器学习样本的组合特征的方法。
  28. 一种生成机器学习样本的组合特征的系统,包括:
    数据记录获取装置,用于获取数据记录,其中,所述数据记录包括多个属性信息;
    分箱组特征生成装置,用于针对基于所述多个属性信息产生的至少一个连续特征之中的每一个连续特征,执行至少一种分箱运算,以得到由至少一个分箱特征组成的分箱组特征,其中,每种分箱运算对应一个分箱特征;以及
    特征组合装置,用于通过在包括分箱组特征和基于所述多个属性信息产生的其他离散特征的离散特征之中的至少一个离散特征之间进行特征组合来生成机器学习样本的组合特征。
PCT/CN2018/096233 2017-07-20 2018-07-19 生成机器学习样本的组合特征的方法及系统 WO2019015631A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710595326.7A CN107392319A (zh) 2017-07-20 2017-07-20 生成机器学习样本的组合特征的方法及系统
CN201710595326.7 2017-07-20

Publications (1)

Publication Number Publication Date
WO2019015631A1 true WO2019015631A1 (zh) 2019-01-24

Family

ID=60337203

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/096233 WO2019015631A1 (zh) 2017-07-20 2018-07-19 生成机器学习样本的组合特征的方法及系统

Country Status (2)

Country Link
CN (2) CN107392319A (zh)
WO (1) WO2019015631A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506575A (zh) * 2020-03-26 2020-08-07 第四范式(北京)技术有限公司 一种网点业务量预测模型的训练方法、装置及系统
CN112380215A (zh) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 一种基于交叉聚合的自动特征生成方法
CN115130619A (zh) * 2022-08-04 2022-09-30 中建电子商务有限责任公司 一种基于聚类选择集成的风险控制方法
US11514369B2 (en) * 2020-06-16 2022-11-29 DataRobot, Inc. Systems and methods for machine learning model interpretation

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392319A (zh) * 2017-07-20 2017-11-24 第四范式(北京)技术有限公司 生成机器学习样本的组合特征的方法及系统
CN109840726B (zh) * 2017-11-28 2021-05-14 华为技术有限公司 物品分箱方法、装置以及计算机可读存储介质
CN108090516A (zh) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 自动生成机器学习样本的特征的方法及系统
CN113065101B (zh) * 2018-01-03 2024-04-02 第四范式(北京)技术有限公司 逻辑回归模型的可视化解释方法及装置
CN108510003A (zh) * 2018-03-30 2018-09-07 深圳广联赛讯有限公司 车联网大数据风控组合特征提取方法、装置及存储介质
CN109213833A (zh) * 2018-09-10 2019-01-15 成都四方伟业软件股份有限公司 二分类模型训练方法、数据分类方法及对应装置
CN110968887B (zh) * 2018-09-28 2022-04-05 第四范式(北京)技术有限公司 在数据隐私保护下执行机器学习的方法和系统
CN112101562B (zh) * 2019-06-18 2024-01-30 第四范式(北京)技术有限公司 机器学习建模过程的实现方法和系统
CN110956272B (zh) * 2019-11-01 2023-08-08 第四范式(北京)技术有限公司 实现数据处理的方法和系统
US11301351B2 (en) 2020-03-27 2022-04-12 International Business Machines Corporation Machine learning based data monitoring
CN112001452B (zh) * 2020-08-27 2021-08-27 深圳前海微众银行股份有限公司 特征选择方法、装置、设备及可读存储介质
CN112163704B (zh) * 2020-09-29 2021-05-14 筑客网络技术(上海)有限公司 一种用于建材投招标平台的优质供应商预测方法
TW202226054A (zh) 2020-12-17 2022-07-01 緯創資通股份有限公司 物件辨識裝置及物件辨識方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1864153A (zh) * 2002-04-19 2006-11-15 计算机联合思想公司 用于发现系统中演变的方法和装置
CN106095942A (zh) * 2016-06-12 2016-11-09 腾讯科技(深圳)有限公司 强变量提取方法及装置
CN106407999A (zh) * 2016-08-25 2017-02-15 北京物思创想科技有限公司 结合规则来进行机器学习的方法及系统
CN107392319A (zh) * 2017-07-20 2017-11-24 第四范式(北京)技术有限公司 生成机器学习样本的组合特征的方法及系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1864153A (zh) * 2002-04-19 2006-11-15 计算机联合思想公司 用于发现系统中演变的方法和装置
CN106095942A (zh) * 2016-06-12 2016-11-09 腾讯科技(深圳)有限公司 强变量提取方法及装置
CN106407999A (zh) * 2016-08-25 2017-02-15 北京物思创想科技有限公司 结合规则来进行机器学习的方法及系统
CN107392319A (zh) * 2017-07-20 2017-11-24 第四范式(北京)技术有限公司 生成机器学习样本的组合特征的方法及系统

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506575A (zh) * 2020-03-26 2020-08-07 第四范式(北京)技术有限公司 一种网点业务量预测模型的训练方法、装置及系统
CN111506575B (zh) * 2020-03-26 2023-10-24 第四范式(北京)技术有限公司 一种网点业务量预测模型的训练方法、装置及系统
US11514369B2 (en) * 2020-06-16 2022-11-29 DataRobot, Inc. Systems and methods for machine learning model interpretation
CN112380215A (zh) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 一种基于交叉聚合的自动特征生成方法
CN115130619A (zh) * 2022-08-04 2022-09-30 中建电子商务有限责任公司 一种基于聚类选择集成的风险控制方法

Also Published As

Publication number Publication date
CN107392319A (zh) 2017-11-24
CN112990486A (zh) 2021-06-18

Similar Documents

Publication Publication Date Title
WO2019015631A1 (zh) 生成机器学习样本的组合特征的方法及系统
Bilal et al. Big Data in the construction industry: A review of present status, opportunities, and future trends
US10417528B2 (en) Analytic system for machine learning prediction model selection
WO2019047790A1 (zh) 生成机器学习样本的组合特征的方法及系统
WO2018059016A1 (zh) 针对机器学习的特征处理方法及特征处理系统
Venkatram et al. Review on big data & analytics–concepts, philosophy, process and applications
US10452992B2 (en) Interactive interfaces for machine learning model evaluations
KR20230070272A (ko) 머신 러닝 모델에서 동적 이상값 편향 감소를 구현하도록 구성된 컴퓨터 기반 시스템, 컴퓨팅 구성요소 및 컴퓨팅 객체
US11645548B1 (en) Automated cloud data and technology solution delivery using machine learning and artificial intelligence modeling
CN111783893A (zh) 生成机器学习样本的组合特征的方法及系统
CN107273979B (zh) 基于服务级别来执行机器学习预测的方法及系统
CN116757297A (zh) 用于选择机器学习样本的特征的方法及系统
US11514369B2 (en) Systems and methods for machine learning model interpretation
CN111797927A (zh) 用于确定机器学习样本的重要特征的方法及系统
CN114298323A (zh) 生成机器学习样本的组合特征的方法及系统
CN116882520A (zh) 针对预定预测问题的预测方法及系统
Babu et al. Framework for Predictive Analytics as a Service using ensemble model
CN114579584A (zh) 数据表处理方法、装置、计算机设备和存储介质
US11853657B2 (en) Machine-learned model selection network planning
AU2020101842A4 (en) DAI- Dataset Discovery: DATASET DISCOVERY IN DATA ANALYTICS USING AI- BASED PROGRAMMING.
Sharma et al. Deep learning in big data and data mining
Poornima et al. Prediction of Water Consumption Using Machine Learning Algorithm
Dass et al. Amelioration of big data analytics by employing big data tools and techniques
Liu Apache spark machine learning blueprints
Ghosh et al. Understanding Machine Learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18834978

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18834978

Country of ref document: EP

Kind code of ref document: A1