CN112990486A - Method and system for generating combined features of machine learning samples - Google Patents

Method and system for generating combined features of machine learning samples Download PDF

Info

Publication number
CN112990486A
CN112990486A CN202110446590.0A CN202110446590A CN112990486A CN 112990486 A CN112990486 A CN 112990486A CN 202110446590 A CN202110446590 A CN 202110446590A CN 112990486 A CN112990486 A CN 112990486A
Authority
CN
China
Prior art keywords
binning
feature
features
machine learning
attribute information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110446590.0A
Other languages
Chinese (zh)
Inventor
陈雨强
戴文渊
杨强
罗远飞
涂威威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202110446590.0A priority Critical patent/CN112990486A/en
Publication of CN112990486A publication Critical patent/CN112990486A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and system for generating combined features of machine learning samples is provided. The method comprises the following steps: (A) obtaining a data record, wherein the data record comprises a plurality of attribute information; (B) executing at least one binning operation for each continuous feature generated based on the plurality of attribute information to obtain a binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to one binning feature; and (C) generating combined features of the machine-learned samples by combining features between the binned features and/or other discrete features produced based on the plurality of attribute information. According to the method and the system, the obtained box grouping features are combined with other features, so that the combined features forming the machine learning sample are more effective, and the effect of the machine learning model is improved.

Description

Method and system for generating combined features of machine learning samples
The present application is a divisional application of patent applications entitled "method and system for generating combined features of machine learning samples" filed on 20/7/2017, application No. 201710595326.7.
Technical Field
The present invention relates generally to the field of artificial intelligence, and more particularly to a method and system for generating combined features of machine learning samples.
Background
With the advent of massive amounts of data, artificial intelligence techniques have evolved rapidly, and in order to extract value from the massive amounts of data, it is necessary to generate samples suitable for machine learning based on data records.
Here, each data record may be considered as a description of an event or object, corresponding to an example or sample. In a data record, various items are included that reflect the performance or nature of an event or object in some respect, and these items may be referred to as "attributes".
How to convert each attribute of the original data record into the characteristics of the machine learning sample can bring great influence on the effect of the machine learning model. In fact, the predictive effect of machine learning models is related to the selection of the model, the extraction of available data and features, etc. That is, on the one hand, the model prediction effect can be improved by improving the feature extraction manner, whereas if the feature extraction is not appropriate, the prediction effect will be deteriorated.
However, in the process of determining the feature extraction manner, technicians are often required to not only master knowledge of machine learning, but also to deeply understand actual prediction problems, and the prediction problems are often combined with different practical experiences of different industries, so that satisfactory effects are difficult to achieve. In particular, when combining a continuous feature with another feature, it is difficult to grasp which features are combined from the viewpoint of the prediction effect, and it is also difficult to specify an effective combination scheme from the viewpoint of the calculation. In summary, it is difficult to automatically combine features in the prior art.
Disclosure of Invention
Exemplary embodiments of the present invention aim to overcome the drawback of the prior art that it is difficult to automatically combine features of machine-learned samples.
According to an exemplary embodiment of the invention, there is provided a method of generating combined features of machine learning samples, comprising: (A) obtaining a data record, wherein the data record comprises a plurality of attribute information; (B) executing at least one binning operation for each continuous feature generated based on the plurality of attribute information to obtain a binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to one binning feature; and (C) generating combined features of the machine-learned samples by combining features between the binned features and/or other discrete features produced based on the plurality of attribute information.
Optionally, in the method, before step (B), the method further includes: (D) the at least one binning operation is selected from a predetermined number of binning operations such that the importance of the binning characteristics corresponding to the selected binning operation is not lower than the importance of the binning characteristics corresponding to unselected binning operations.
Optionally, in the method, in step (D), for each of the binning features corresponding to the predetermined number of binning operations, a single-feature machine learning model is constructed, the importance of each binning feature is determined based on the effect of each single-feature machine learning model, and the at least one binning operation is selected based on the importance of each binning feature, wherein a single-feature machine learning model corresponds to each binning feature.
Optionally, in the method, in step (D), for each of the bin features corresponding to the predetermined number of bin operations, a composite machine learning model is constructed, an importance of each of the bin features is determined based on an effect of each of the composite machine learning models, and the at least one bin operation is selected based on the importance of each of the bin features, wherein the composite machine learning model includes a basic sub-model and an additional sub-model based on a lifting framework, wherein the basic sub-model corresponds to a subset of the basic features, and the additional sub-model corresponds to each of the bin features.
Optionally, in the method, the combined features of the machine-learned samples are generated in an iterative manner according to a search strategy for the combined features.
Optionally, in the method, step (D) is performed for each iteration round to update the at least one binning operation, and the combined features generated in each iteration round are added as new discrete features to the basic feature subset.
Optionally, in the method, in step (C), the features are combined according to a cartesian product between the binned features and/or the other discrete features.
Optionally, in the method, the at least one binning operation corresponds to an equal-width binning operation of different widths or an equal-depth binning operation of different depths, respectively.
Optionally, in the method, the different widths or different depths numerically constitute an geometric series or an arithmetic series.
Optionally, in the method, the binning feature indicates to which bin the consecutive features are binned according to the corresponding binning operation.
Alternatively, in the method, each of the continuous features is formed by continuous-value attribute information itself among the plurality of attribute information, or each of the continuous features is formed by continuously transforming discrete-value attribute information among the plurality of attribute information.
Optionally, in the method, the continuous transformation instruction performs statistics on values of the discrete-value attribute information.
Optionally, in the method, each composite machine learning model is constructed by separately training additional sub-models with a fixed base sub-model.
According to another exemplary embodiment of the invention, there is provided a system for generating combined features of machine learning samples, comprising: data record obtaining means for obtaining a data record, wherein the data record includes a plurality of attribute information; a binning feature generating device, configured to perform at least one binning operation on each continuous feature generated based on the plurality of attribute information to obtain a binning feature composed of at least one binning feature, where each binning operation corresponds to one binning feature; and feature combining means for generating combined features of the machine-learned samples by feature combining between the binned features and/or other discrete features generated based on the plurality of attribute information.
Optionally, the system further comprises: binning operation selection means for selecting the at least one binning operation from a predetermined number of binning operations such that the importance of the binning characteristics corresponding to the selected binning operation is not lower than the importance of the binning characteristics corresponding to unselected binning operations.
Alternatively, in the system, the binning operation selection means constructs a single-feature machine learning model for each of the binning features corresponding to the predetermined number of binning operations, determines the importance of each of the binning features based on the effect of each of the single-feature machine learning models, and selects the at least one binning operation based on the importance of each of the binning features, wherein a single-feature machine learning model corresponds to each of the binning features.
Optionally, in the system, the binning operation selecting means constructs a composite machine learning model for each of the binning features corresponding to the predetermined number of binning operations, determines importance of each of the binning features based on an effect of each of the composite machine learning models, and selects the at least one binning operation based on the importance of each of the binning features, wherein the composite machine learning model includes a basic sub-model and an additional sub-model based on a lifting framework, wherein the basic sub-model corresponds to a basic feature subset, and the additional sub-model corresponds to each of the binning features.
Optionally, in the system, the binned feature generation means generates the combined features of the machine-learned samples in an iterative manner according to a search strategy for the combined features.
Optionally, in the system, the binning operation selection means reselects the at least one binning operation for each iteration, and the combined features generated in each iteration are added to the basic feature subset as new discrete features.
Optionally, in the system, the feature combination means causes feature combination between the binned features and/or the other discrete features according to a cartesian product.
Optionally, in the system, the at least one binning operation corresponds to an equal-width binning operation of different widths or an equal-depth binning operation of different depths, respectively.
Optionally, in the system, the different widths or different depths numerically constitute an geometric series or an arithmetic series.
Optionally, in the system, the binning feature indicates to which bin the consecutive features are binned according to the corresponding binning operation.
Alternatively, in the system, each of the continuous features is formed by continuous-value attribute information itself among the plurality of attribute information, or each of the continuous features is formed by continuously transforming discrete-value attribute information among the plurality of attribute information.
Optionally, in the system, the continuous transformation indicates that values of the discrete-value attribute information are counted.
Optionally, in the system, the binning operation selection means constructs each composite machine learning model by training the additional submodels separately with the basic submodel fixed.
According to another exemplary embodiment of the present invention, a computer-readable medium for generating combined features of machine learning samples is provided, wherein a computer program for performing the above-mentioned method is recorded on the computer-readable medium.
According to another exemplary embodiment of the present invention, a computing apparatus for generating combined features of machine learning samples is provided, comprising a storage component and a processor, wherein the storage component has stored therein a set of computer-executable instructions which, when executed by the processor, perform the above method.
In the method and the system for generating the combined features of the machine learning samples according to the exemplary embodiment of the invention, one or more binning operations are performed on the continuous features, and the obtained binning group features are combined with other features, so that the combined features forming the machine learning samples are more effective, and the effect of the machine learning model is improved.
Drawings
These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings of which:
FIG. 1 shows a block diagram of a system for generating combined features of machine learning samples, according to an example embodiment of the present invention;
FIG. 2 illustrates a block diagram of a training system for a machine learning model according to an exemplary embodiment of the present invention;
FIG. 3 illustrates a block diagram of a prediction system of a machine learning model according to an exemplary embodiment of the present invention;
FIG. 4 illustrates a block diagram of a training and prediction system of a machine learning model according to an exemplary embodiment of the present invention;
FIG. 5 shows a block diagram of a system for generating combined features of machine learning samples according to another example embodiment of the present invention;
FIG. 6 illustrates a flow diagram of a method of generating combined features of machine learning samples according to an exemplary embodiment of the invention;
FIG. 7 illustrates an example of a search strategy for generating combined features according to an exemplary embodiment of the present invention;
FIG. 8 illustrates a flow chart of a method of training a machine learning model according to an exemplary embodiment of the invention;
FIG. 9 illustrates a flow diagram of a prediction method of a machine learning model according to an exemplary embodiment of the invention; and
fig. 10 shows a flowchart of a method of generating combined features of machine learning samples according to another exemplary embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, exemplary embodiments thereof will be described in further detail below with reference to the accompanying drawings and detailed description.
In an exemplary embodiment of the invention, automatic feature combining is performed by: performing at least one binning operation on a single continuous feature to generate one or more binning features corresponding to the single continuous feature, and combining the binning feature composed of the binning features with other discrete features (e.g., the single discrete feature and/or the other binning features) can make the generated machine learning sample more suitable for machine learning, so as to obtain a better prediction result.
Here, machine learning is a necessary product of the development of artificial intelligence research to a certain stage, which is directed to improving the performance of the system itself by means of calculation, using experience. In a computer system, "experience" is usually in the form of "data" from which a "model" can be generated by a machine learning algorithm, i.e. by providing empirical data to a machine learning algorithm, a model can be generated based on these empirical data, which provides a corresponding judgment, i.e. a prediction, in the face of a new situation. Whether the machine learning model is trained or predicted using the trained machine learning model, the data needs to be converted into machine learning samples including various features. Machine learning may be implemented in the form of "supervised learning," "unsupervised learning," or "semi-supervised learning," it being noted that exemplary embodiments of the present invention do not impose particular limitations on specific machine learning algorithms. It should also be noted that other means such as statistical algorithms may also be incorporated during the training and application of the model.
FIG. 1 shows a block diagram of a system for generating combined features of machine learning samples according to an exemplary embodiment of the invention. Specifically, the system performs at least one binning operation on each of the continuous features to be combined, such that a single continuous feature may be converted into a binning feature comprised of a corresponding at least one binning operation feature, and further, combines the binning feature with other discrete features to enable simultaneous characterization of the original data record from different angles, scales/layers. With the system, the combined features of the machine learning samples can be automatically generated, and the corresponding machine learning samples help to improve the machine learning effect (e.g., model stability, model generalization, etc.).
As shown in fig. 1, the data record obtaining apparatus 100 is configured to obtain a data record, wherein the data record includes a plurality of attribute information.
The data record may be data generated on-line, data generated and stored in advance, or data received from the outside through an input device or a transmission medium. Such data may relate to attribute information of an individual, business, or organization, such as identity, academic calendar, occupation, assets, contact details, liabilities, income, profit, tax, and the like. Alternatively, the data may relate to attribute information of the business-related items, such as transaction amount, both parties to the transaction, subject matter, transaction location, and the like, regarding the sales contract. It should be noted that the attribute information content mentioned in the exemplary embodiments of the present invention may relate to the performance or nature of any object or matter in some respect, and is not limited to defining or describing individuals, objects, organizations, units, organizations, items, events, and so forth.
The data record acquisition device 100 may acquire structured or unstructured data from different sources, such as text data or numerical data. The acquired data records can be used to form machine learning samples to participate in the training/prediction process of machine learning. Such data may originate from within the entity desiring to obtain the model predictions, e.g., from a bank, business, school, etc. desiring to obtain the predictions; such data may also originate from other than the aforementioned entities, such as from data providers, the internet (e.g., social networking sites), mobile operators, APP operators, courier companies, credit agencies, and so forth. Optionally, the internal data and the external data can be used in combination to form a machine learning sample carrying more information.
The data may be input to the data record obtaining apparatus 100 through an input device, or automatically generated by the data record obtaining apparatus 100 according to the existing data, or may be obtained by the data record obtaining apparatus 100 from a network (e.g., a storage medium (e.g., a data warehouse) on the network), and furthermore, an intermediate data exchange device such as a server may facilitate the data record obtaining apparatus 100 to obtain the corresponding data from an external data source. Here, the acquired data may be converted into a format that is easy to handle by a data conversion module such as a text analysis module in the data record acquisition apparatus 100. It should be noted that the data record acquisition apparatus 100 may be configured as various modules composed of software, hardware, and/or firmware, and some or all of these modules may be integrated or cooperate together to accomplish a specific function.
The binning characteristic generating apparatus 200 is configured to perform at least one binning operation on each successive characteristic generated based on the plurality of attribute information to obtain a binning characteristic composed of at least one binning characteristic, where each binning operation corresponds to one binning characteristic.
Here, for at least a part of the attribute information of the data record, a corresponding continuous feature may be generated, where a continuous feature is a feature as opposed to a discrete feature (e.g., a category feature), and the value thereof may be a numerical value having a certain continuity, such as a distance, an age, an amount, and the like. In contrast, as an example, the values of the discrete features do not have continuity, and may be the features of unordered classification such as "from beijing", "from shanghai", or "from tianjin", "sex is male", and "sex is female", for example.
For example, the binning feature generation apparatus 200 may directly use some continuous value attribute in the data record as the corresponding continuous feature in the machine learning sample, e.g., may directly use the attributes of distance, age, amount, etc. as the corresponding continuous feature. That is, each of the continuous features may be formed of continuous-value attribute information itself among the plurality of attribute information.
Alternatively, the binned feature generation apparatus 200 may also process some attribute information (e.g., continuous value attribute and/or discrete value attribute information) in the data record to obtain a corresponding continuous feature, for example, a ratio of height to weight as the corresponding continuous feature. In particular, the continuous feature may be formed by continuously transforming discrete-value attribute information among the plurality of attribute information. As an example, the continuous transformation may indicate counting values of the discrete-value attribute information. For example, the continuous features may indicate statistical information that certain discrete-value attribute information relates to a prediction objective of the machine learning model. For example, in an example of predicting purchase probabilities, the discrete value attribute information of the seller merchant number may be transformed into a probabilistic statistical feature about the historical purchasing behavior of the corresponding seller merchant code.
Furthermore, the binned feature generation apparatus 200 may generate other discrete features of the machine-learned samples in addition to the continuous features to be binned. Alternatively, the above-described features may be generated by other feature generation means (not shown). According to an exemplary embodiment of the present invention, any combination between the above features is possible, wherein consecutive features have been converted into binning features at the time of combination.
For each successive feature, the binned feature generation apparatus 200 may perform at least one binning operation, thereby enabling simultaneous acquisition of multiple discrete features characterizing certain attributes of the original data record from different angles, scales/layers.
Here, the binning operation is a specific method of discretizing a continuous feature, that is, dividing a value range of the continuous feature into a plurality of sections (i.e., a plurality of bins), and determining a corresponding bin feature value based on the divided bins. Binning operations can be broadly divided into supervised binning and unsupervised binning, with each of these two types including some specific binning modes, e.g., supervised binning including minimum entropy binning, minimum description length binning, etc., and unsupervised binning including equal width binning, equal depth binning, k-means cluster-based binning, etc. In each binning mode, corresponding binning parameters, such as width, depth, etc., may be set. It should be noted that, according to the exemplary embodiment of the present invention, the binning operation performed by the binning group feature generating apparatus 200 is not limited to the kind of binning manner nor to the parameters of the binning operation, and the specific representation manner of the binning features produced accordingly is also not limited.
The binning operation performed by the binning group feature generation apparatus 200 may differ in binning manner and/or binning parameters. For example, the at least one binning operation may be of the same kind but with different operation parameters (e.g., depth, width, etc.), or may be of different kinds. Correspondingly, each box-dividing operation can obtain a box-dividing characteristic, the box-dividing characteristics jointly form a box-dividing group characteristic, and the box-dividing group characteristic can embody different box-dividing operations, so that the effectiveness of machine learning materials is improved, and a better basis is provided for the training/prediction of a machine learning model.
The feature combination means 300 is configured to generate a combined feature of the machine-learned sample by feature combination between the binned features and/or other discrete features generated based on the plurality of attribute information.
As described above, the continuous features are converted into discrete features in the form of groups of bins, and one or more other discrete features may also be generated based on the attribute information. Accordingly, the feature grouping apparatus 300 may facilitate any combination between features that are binned features and/or other discrete features to yield corresponding combined features. Here, as an example, the feature combinations between the binned features and/or the other discrete features may be performed according to a cartesian product. However, it should be understood that the exemplary embodiments of the present invention are not limited to the combination of cartesian products, and any combination of the above-described discrete features may be applied to the exemplary embodiments of the present invention.
As an example, the feature combination apparatus 300 may generate the combined features of the machine learning samples in an iterative manner according to a search strategy for the combined features. For example, according to a heuristic search strategy such as Beam search, at each level of the search tree, the nodes are sorted by heuristic cost, and then only a certain number (Beam Width-Beam Width) of nodes are left, only these nodes continue to expand at the next level, while other nodes are pruned.
The system shown in fig. 1 is intended to produce a composite feature of machine learning samples that may exist independently, and it should be noted here that the manner in which the system acquires data records is not limited, that is, by way of example, the data record acquisition device 100 may be a device having the capability of receiving and processing data records, or may simply be a device that provides data records that are already prepared.
In addition, the system shown in FIG. 1 may also be integrated into a system for model training and/or model prediction as part of performing feature processing.
FIG. 2 illustrates a block diagram of a training system for a machine learning model according to an exemplary embodiment of the present invention. The system shown in fig. 2 includes a machine learning sample generation device 400 and a machine learning model training device 500, in addition to the data record acquisition device 100, the bin feature generation device 200, and the feature combination 300.
Specifically, in the system shown in fig. 2, the data record obtaining means 100, the grouped feature generating means 200 and the feature combining means 300 may operate as in the system shown in fig. 1, wherein the data record obtaining means 100 may obtain the history data records that have been marked.
Further, the machine learning sample generation apparatus 400 is configured to generate a machine learning sample including at least a portion of the generated combined features. That is, in the machine learning sample produced by the machine learning sample production means 400, some or all of the combined features produced by the feature combination means 300 are included, and further, the machine learning sample may optionally include any other features produced based on the attribute information of the data record, for example, individual features directly served by the attribute information itself of the data record, features obtained by performing feature processing on the attribute information, and the like. As described above, these other features may be generated by the bin set feature generation apparatus 200, as examples, or by other means.
Specifically, the machine learning sample generation apparatus 400 may generate the machine learning training sample, and particularly, as an example, in the case of supervised learning, the machine learning training sample generated by the machine learning sample generation apparatus 400 may include two parts, namely a feature and a label (label).
The machine learning model training apparatus 500 is used to train a machine learning model based on machine learning training samples. Here, the machine learning model training apparatus 500 may use any suitable machine learning algorithm (e.g., log-probability regression) to learn a suitable machine learning model from the machine learning training samples.
In the above example, a more stable and predictive machine learning model may be trained.
FIG. 3 illustrates a block diagram of a prediction system of a machine learning model according to an exemplary embodiment of the present invention. Compared with the system shown in fig. 1, the system of fig. 3 includes a machine learning sample generation device 400 and a machine learning model prediction device 600 in addition to the data record acquisition device 100, the bin feature generation device 200 and the feature combination device 300.
Specifically, in the system shown in fig. 3, the data record obtaining means 100, the binning feature generation means 200, and the feature combination means 300 may operate as in the system shown in fig. 1, wherein the data record obtaining means 100 may obtain a data record to be predicted (e.g., a new data record without a flag or a historical data record for testing). Accordingly, the machine learning sample generation apparatus 400 may produce machine learning prediction samples including only the feature portion in a similar manner as shown in fig. 2.
The machine learning model prediction apparatus 600 is configured to provide a prediction result corresponding to a machine learning prediction sample by using a trained machine learning model. Here, the machine learning model prediction apparatus 600 may provide a prediction result for a plurality of machine learning prediction samples in a batch.
Here, it should be noted that: the systems of fig. 2 and 3 may also be effectively fused to form a system capable of accomplishing both training and prediction of machine learning models.
In particular, FIG. 4 illustrates a block diagram of a training and prediction system for a machine learning model according to an exemplary embodiment of the present invention. The system shown in fig. 4 includes the data record acquisition device 100, the binned feature generation device 200, the feature combination device 300, the machine learning sample generation device 400, the machine learning model training device 500, and the machine learning model prediction device 600.
Here, in the system shown in fig. 4, the data record acquisition means 100, the binning feature generation means 200, and the feature combination means 300 may operate as in the system shown in fig. 1, wherein the data record acquisition means 100 may acquire the historical data records or the data records to be predicted in a targeted manner. Further, the machine learning sample generation apparatus 400 may generate the machine learning training samples or the machine learning prediction samples according to the situation, specifically, in the model training stage, the machine learning sample generation apparatus 400 may generate the machine learning training samples, and particularly, as an example, in the case of supervised learning, the machine learning training samples generated by the machine learning sample generation apparatus 400 may include two parts of features and labels (labels). Further, in the model prediction phase, the machine learning sample generation apparatus 400 may generate machine learning prediction samples, where it is understood that the feature portions of the machine learning prediction samples are consistent with the feature portions of the machine learning training samples.
Further, in the model training phase, the machine learning sample generation apparatus 400 supplies the generated machine learning training samples to the machine learning model training apparatus 500, so that the machine learning model training apparatus 500 trains the machine learning model based on the machine learning training samples. After the machine learning model training device 500 learns the machine learning model, the machine learning model training device 500 supplies the trained machine learning model to the machine learning model prediction device 600. Accordingly, in the model prediction stage, the machine learning sample generation apparatus 400 provides the generated machine learning prediction samples to the machine learning model prediction apparatus 600, so that the machine learning model prediction apparatus 600 provides prediction results for the machine learning prediction samples using the machine learning model.
According to an exemplary embodiment of the present invention, at least one binning operation needs to be performed on consecutive features. Here, the at least one binning operation may be determined in any suitable manner, for example by experience of a technician or business person, or automatically via technical means. As an example, the specific binning mode may be efficiently determined based on the importance of the binning characteristics.
Fig. 5 shows a block diagram of a system for generating combined features of machine learning samples according to another exemplary embodiment of the present invention. Compared to the system shown in fig. 1, the system of fig. 5 includes a binning operation selection means 150 in addition to the data record acquisition means 100, the binning group feature generation means 200 and the feature combination means 300.
In the system shown in fig. 5, the data record acquisition means 100, the binning feature generation means 200 and the feature combination means 300 may operate as in the system shown in fig. 1. Furthermore, the binning operation selection means 150 is configured to select the at least one binning operation from a predetermined number of binning operations such that the importance of the binning characteristics corresponding to the selected binning operation is not lower than the importance of the binning characteristics corresponding to unselected binning operations. In this way, the effect of machine learning can be ensured while reducing the size of the combined feature space.
In particular, the predetermined number of binning operations may indicate a variety of binning operations that differ in binning manner and/or binning parameters. Here, by performing each binning operation, a corresponding one of the binning characteristics is obtained, and accordingly, the binning operation selection means 150 may determine the importance of these binning characteristics and further select the binning operation corresponding to the more important binning characteristic as the at least one binning operation to be performed by the binning group characteristic generation means 200.
Here, the binning operation selection device 150 may automatically determine the importance of the binning feature in any suitable manner.
For example, the binning operation selection means 150 may construct a single-feature machine learning model for each binning feature among the binning features corresponding to the predetermined number of binning operations, determine the importance of each binning feature based on the effect of each single-feature machine learning model, and select the at least one binning operation based on the importance of each binning feature, wherein a single-feature machine learning model corresponds to each binning feature.
For another example, the binning operation selecting means 150 may construct a composite machine learning model for each of the binning features corresponding to the predetermined number of binning operations, wherein the composite machine learning model includes a basic sub-model and an additional sub-model based on a lifting framework, wherein the basic sub-model corresponds to a basic feature subset, and the additional sub-model corresponds to each of the binning features, determine the importance of each of the binning features based on the effect of each of the composite machine learning models, and select the at least one binning operation based on the importance of each of the binning features. According to an exemplary embodiment of the present invention, the basic feature subset may be fixedly applied to the basic submodels in all relevant composite machine learning models, where any feature generated based on the attribute information of the data record may be taken as the basic feature. For example, at least a part of the attribute information of the data record may be directly used as the basic feature. Further, as an example, actual machine learning issues may be considered, with relatively important or basic features determined as basic features based on test calculations or from business personnel designations. Here, in the case where the combined feature is generated in an iterative manner, the binning operation selection means 150 may select the binning operation for each iteration, and the combined feature generated in each iteration is added as a new discrete feature to the basic feature subset.
It should be understood that the binning operation selection device 150 shown in fig. 5 may be incorporated into the training system and/or the prediction system shown in fig. 2-4.
A flowchart of a method of generating combined features of machine learning samples according to an exemplary embodiment of the present invention is described below with reference to fig. 6. Here, the method shown in fig. 6 may be performed by the system shown in fig. 1, may be implemented entirely in software by a computer program, and may be performed by a specifically configured computing device as an example. For convenience of description, it is assumed that the method shown in fig. 6 is performed by the system shown in fig. 1.
As shown, in step S100, a data record is acquired by the data record acquisition apparatus 100, wherein the data record includes a plurality of attribute information.
Here, as an example, the data record obtaining apparatus 100 may collect data in a manual, semi-automatic or fully automatic manner, or process the collected raw data so that the processed data record has an appropriate format or form. As an example, the data record acquisition device 100 may collect data in batches.
Here, the data record obtaining apparatus 100 may receive the data record manually input by the user through an input device (e.g., a workstation). Further, the data record acquisition device 100 can systematically retrieve data records from a data source in a fully automated manner, for example, by systematically requesting a data source and obtaining the requested data from a response via a timer mechanism implemented in software, firmware, hardware, or a combination thereof. The data sources may include one or more databases or other servers. The manner in which the data is obtained in a fully automated manner may be implemented via an internal network and/or an external network, which may include transmitting encrypted data over the internet. Where servers, databases, networks, etc. are configured to communicate with one another, data collection may be automated without human intervention, but it should be noted that certain user input operations may still exist in this manner. The semi-automatic mode is between the manual mode and the full-automatic mode. The semi-automatic mode differs from the fully automatic mode in that a trigger mechanism activated by the user replaces, for example, a timer mechanism. In this case, the request for extracting data is generated only in the case where a specific user input is received. Each time data is acquired, the captured data may preferably be stored in non-volatile memory. As an example, a data warehouse may be utilized to store raw data collected during acquisition as well as processed data.
The data records obtained above may originate from the same or different data sources, that is, each data record may also be the result of a concatenation of different data records. For example, in addition to obtaining information data records (which include attribute information fields of income, academic history, post, property condition, and the like) filled by a customer when applying for opening a credit card to a bank, the data record obtaining apparatus 100 may also obtain other data records of the customer at the bank, such as loan records, daily transaction data, and the like, and these obtained data records may be spliced into a complete data record. Furthermore, the data record acquisition device 100 may also acquire data originating from other private or public sources, such as data originating from a data provider, data originating from the internet (e.g., social networking sites), data originating from a mobile operator, data originating from an APP operator, data originating from an express company, data originating from a credit agency, and so forth.
Optionally, the data record acquiring apparatus 100 may store and/or process the acquired data by means of a hardware cluster (such as a Hadoop cluster, a Spark cluster, etc.), for example, store, sort, and perform other offline operations. In addition, the data record acquisition device 100 may perform online streaming processing on the acquired data.
As an example, a data conversion module such as a text analysis module may be included in the data record obtaining device 100, and accordingly, in step S100, the data record obtaining device 100 may convert unstructured data such as text into more easily usable structured data for further processing or reference later. Text-based data may include emails, documents, web pages, graphics, spreadsheets, call center logs, transaction reports, and the like.
Next, in step S200, the binning feature generating device 200 is configured to perform at least one binning operation on each continuous feature generated based on the plurality of attribute information to obtain a binning feature composed of at least one binning feature, where each binning operation corresponds to one binning feature.
Specifically, step S200 is directed to generating a bin feature that is comprised of bin features that can participate in automatic combinations between discrete features in place of the original continuous features. To this end, for each successive feature, a respective at least one binned feature may be obtained by performing at least one binning operation, respectively.
The continuous characteristic may be generated from at least a portion of attribute information of the data record. As an example, attribute information of continuous values such as distance, age, and amount of data records can be directly used as continuous features; as another example, the continuous characteristic may be obtained by further processing certain attribute information of the data record, e.g. the ratio of height to weight may be taken as the continuous characteristic; for another example, the continuous feature may be formed by continuously transforming discrete-value attribute information in the attribute information, for example, the continuous transformation may indicate that values of the discrete-value attribute information are counted, and the obtained statistical information is used as the continuous feature.
After the continuous features are obtained, at least one binning operation may be performed on the obtained continuous features by the binning group feature generating device 200, where the binning group feature generating device 200 may perform the binning operation in various binning manners and/or binning parameters.
Taking the unsupervised equal-width binning as an example, assuming that the value interval of the continuous feature is [0,100], and the corresponding binning parameter (i.e., width) is 50, 2 bins can be sorted, in which case the continuous feature with a value of 61.5 corresponds to the 2 nd bin, and if the two bins are numbered 0 and 1, the bin corresponding to the continuous feature is numbered 1. Alternatively, assuming a bin width of 10, 10 bins may be separated, in which case a consecutive feature with a value of 61.5 corresponds to the 7 th bin, and if the ten bins are numbered 0 to 9, the consecutive feature corresponds to the bin numbered 6. Alternatively, assuming a bin width of 2, 50 bins may be separated, in which case a consecutive feature with a value of 61.5 corresponds to the 31 st bin, and if the fifty bins are numbered 0 to 49, the consecutive feature corresponds to the bin number of 30.
After mapping the sequential features to multiple bins, the corresponding feature values may be any value that is custom defined. Here, the binning feature may indicate which bin the consecutive features are binned into according to the corresponding binning operation. That is, a binning operation is performed to generate a multi-dimensional binning feature corresponding to each successive feature, where each dimension may indicate whether the corresponding bin is binned with the corresponding successive feature, for example, by "1" indicating that the successive feature is binned into the corresponding bin and "0" indicating that the successive feature is not binned into the corresponding bin, and accordingly, in the above example, assuming that 10 bins are binned, the basic binning feature may be a 10-dimensional feature, and the basic binning feature corresponding to the successive feature with a value of 61.5 may be represented as [0,0,0,0, 1,0,0,0, 0 ].
Further, as an example, noise in the data records may also be reduced by removing possible outliers in the data samples prior to performing the binning operation. In this way, the effectiveness of machine learning using binning features can be further improved.
Specifically, an outlier bin may be additionally set such that consecutive features having outliers are sorted to the outlier bin. For example, for a continuous feature with a value interval of [0,1000], a certain number of samples may be selected for pre-binning, for example, equal width binning is performed with a bin width of 10, then the number of samples in each bin is recorded, and for bins with a smaller number of samples (e.g., less than a threshold value), they may be combined into at least one outlier bin. As an example, if the number of samples in the bins at both ends is small, the bins with less samples may be merged into an outlier bin while the remaining bins are kept, and assuming that the number of samples in the bins 0-10 is small, the bins 0-10 may be merged into an outlier bin, thereby uniformly dividing the continuous features having values of [0,100] into the outlier bins.
According to an exemplary embodiment of the present invention, the at least one binning operation may be a binning operation with the same binning mode but different binning parameters; alternatively, the at least one binning operation may be a binning operation with different binning modes.
The binning mode includes various binning modes under supervision binning and/or unsupervised binning. For example, supervised binning includes minimum entropy binning, minimum description length binning, and the like, while unsupervised binning includes equal width binning, equal depth binning, k-means cluster-based binning, and the like.
As an example, at least one binning operation may correspond to equal-width binning operations of different widths, respectively. That is to say, the adopted binning modes are the same but the granularity of the binning is different, so that the generated binning characteristics can better depict the rule of the original data record, and the training and prediction of the machine learning model are facilitated. In particular, the different widths employed by at least one of the binning operations may numerically form an equal ratio series, e.g., the binning operations may be equally wide binned by the widths of value 2, value 4, value 8, value 16, etc. Alternatively, the different widths used in at least one of the binning operations may numerically form an arithmetic progression, e.g., the binning operation may be performed for equal width binning by the widths of value 2, value 4, value 6, value 8, etc.
As another example, at least one binning operation may correspond to an equal depth binning operation for different depths, respectively. That is to say, the binning mode adopted by the binning operation is the same but the granularity of the binning is different, so that the generated binning characteristics can better depict the rule of the original data record, thereby being more beneficial to the training and prediction of the machine learning model. In particular, the different depths employed by the binning operation may numerically constitute an geometric series, e.g., the binning operation may be performed by a depth of 10, 100, 1000, 10000, etc. Alternatively, the different depths used for binning may numerically form an arithmetic progression, e.g., binning may be performed for depths of 10, 20, 30, 40, etc.
For each of the continuous features, after the corresponding at least one binning feature is obtained by performing a binning operation, the binning feature generating apparatus 200 may obtain the binning feature by taking each binning feature as one constituent element. It can be seen that the binned features herein can be viewed as a collection of binned features and thus are also used as discrete features.
In step S300, combined features of the machine-learned samples are generated by the feature combining means 300 by performing feature combination between the binned features and/or other discrete features generated based on the plurality of attribute information. Here, since the continuous features have been converted into the bin group features as the discrete features, arbitrary combination can be made between the features including the bin group features and other discrete features as the combined features of the machine learning sample. As an example, the combination between the features may be realized by a cartesian product, however, it should be noted that the combination is not limited thereto, and any manner capable of combining two or more discrete features with each other may be applied to the exemplary embodiments of the present invention.
Here, a single discrete feature may be regarded as a first-order feature, and according to an exemplary embodiment of the present invention, higher-order feature combinations of two-order, three-order, and the like may be performed until a predetermined cutoff condition is satisfied. As an example, the combined features of the machine-learned samples may be generated in an iterative manner according to a search strategy for the combined features.
Fig. 7 illustrates an example of a search tree for generating combined features according to an exemplary embodiment of the present invention. According to an exemplary embodiment of the invention, the search tree may be based on a heuristic search strategy such as a beam search, for example, where one layer of the search tree may correspond to a particular order of feature combinations.
Referring to fig. 7, it is assumed that the discrete features that can be combined include a feature a, a feature B, a feature C, a feature D, and a feature E, and as an example, the feature a, the feature B, and the feature C may be discrete features formed from discrete-value attribute information of data records itself, and the feature D and the feature E may be bin group features converted from continuous features.
According to the search strategy, in the first iteration, two nodes, namely a feature B and a feature E, which are first-order features, are selected, wherein the nodes can be sorted by taking feature importance and the like as indexes, and then a part of nodes are selected to continue to expand at the next layer.
In the next iteration, generating a feature BA, a feature BC, a feature BD, a feature BE, a feature EA, a feature EB, a feature EC and a feature ED which are second-order combined features based on the feature B and the feature E, and continuously selecting the feature BC and the feature EA based on the ranking index. As an example, feature BE and feature EB can BE considered the same combined feature.
The iteration continues in the manner described above until a certain cutoff condition, e.g., an order limit, is met. Here, the nodes (shown in solid lines) selected in each layer may be used as combined features for subsequent processing, e.g., as final adopted features or for further importance evaluation, while the remaining features (shown in dashed lines) are pruned.
FIG. 8 illustrates a flowchart of a training method of a machine learning model according to an exemplary embodiment of the present invention. In the method shown in fig. 8, the method includes steps S400 and S500 in addition to the above-described steps S100, S200, and S300.
Specifically, in the method shown in fig. 8, step S100, step S200 and step S300 may be similar to the corresponding steps shown in fig. 6, wherein the history data records that have been marked may be acquired in step S100.
Further, in step S400, a machine learning training sample including at least a portion of the generated combined features may be generated by the machine learning sample generation apparatus 400, and in the case of supervised learning, the machine learning training sample may include both features and labels.
In step S500, a machine learning model may be trained by the machine learning model training apparatus 500 based on machine learning training samples. Here, the machine learning model training apparatus 500 may learn an appropriate machine learning model from the machine learning training samples using an appropriate machine learning algorithm.
After the machine learning model is trained, the trained machine learning model can be utilized to make predictions.
Fig. 9 illustrates a flowchart of a prediction method of a machine learning model according to an exemplary embodiment of the present invention. In the method shown in fig. 9, the method includes steps S400 and S600 in addition to the above-described steps S100, S200, and S300.
Specifically, in the method shown in fig. 9, steps S100, S200 and S300 may be similar to the corresponding steps shown in fig. 6, wherein in step S100 a data record to be predicted may be obtained.
Further, in step S400, a machine learning prediction sample including at least a portion of the generated combined features may be generated by the machine learning sample generation apparatus 400, and the machine learning prediction sample may include only the feature portion.
In step S600, the machine learning model prediction apparatus 600 may provide a prediction result corresponding to the machine learning prediction sample using the machine learning model. Here, the prediction results may be provided for a plurality of machine learning prediction samples in a batch. Further, the machine learning model may be generated by a training method according to an exemplary embodiment of the present invention, and may also be received from the outside.
As described above, according to the exemplary embodiments of the present invention, when the binning characteristics are acquired, an appropriate binning operation may be automatically selected. A flowchart of a method of generating combined features of machine-learned samples according to another exemplary embodiment of the present invention will be described below with reference to fig. 10.
Referring to fig. 10, steps S100, S200 and S300 are similar to the corresponding steps shown in fig. 6, and details will not be repeated here. Compared to the method of fig. 6, the method of fig. 10 further comprises a step S150 in which, for each successive feature, at least one binning operation to be performed for the successive feature may be selected by the binning operation selection means 150 from a predetermined number of binning operations such that the importance of the binning feature corresponding to the selected binning operation is not lower than the importance of the binning feature corresponding to the non-selected binning operation.
As an example, the binning operation selection means 150 may construct a single-feature machine learning model for each binning feature among the binning features corresponding to the predetermined number of binning operations, determine the importance of each binning feature based on the effect of each single-feature machine learning model, and select the at least one binning operation based on the importance of each binning feature, wherein a single-feature machine learning model corresponds to each binning feature.
For example, assume that for a continuous feature F, there are a predetermined number M (M is an integer greater than 1) of binning operations corresponding to M binned features FmWherein M is [1, M ]]. Accordingly, the binning operation selection device 150 may use a portion of the historical data records to construct M single-feature machine learning models (wherein each single-feature machine learning model is based on a respective single binning feature fmTo predict for machine learning problems), then measure the effect of the M single-feature machine learning models on the same test dataset (e.g., AUC (Receiver Operating Characteristic, Area Under Receiver Operating Characteristic) Curve, Area Under ROC currve), and determine at least one binning operation to be finally performed based on the ranking of the AUC.
As another example, the binning operation selecting means 150 may construct a composite machine learning model for each of the binning features corresponding to the predetermined number of binning operations, wherein the composite machine learning model includes a basic sub-model and an additional sub-model based on a lifting framework (e.g., a gradient lifting framework), wherein the basic sub-model corresponds to a subset of the basic features, and the additional sub-model corresponds to each of the binning features, determine an importance of each of the binning features based on an effect of each of the composite machine learning models, and select the at least one binning operation based on the importance of each of the binning features.
For example, assume that for a continuous feature F, there are a predetermined number M of binning operations, corresponding to M binned features FmWherein M is [1, M ]]. Accordingly, the binning operation selection device 150 may use a portion of the historical data records to M construct a composite machine learning model (where each composite is a composite ofMachine learning model based on fixed subsets of basis features and corresponding binned features fmPredictions are made for the machine learning problem in terms of a lifting framework), and then the effects (e.g., AUC) of the M compounder learning models on the same test data set are measured, and at least one binning operation that is finally performed is determined based on the ordering of the AUC. Preferably, in order to further improve the operation efficiency and reduce the resource consumption, the binning operation selecting means 150 may select the binning feature f for each binning feature by fixing the basic sub-modelmAdditional sub-models are trained to build the respective composite machine learning models. Here, the basic feature subset from which the basic submodel depends may be updated as iterations of generating the combined features.
In an example where the combined features of the machine-learned samples are generated in an iterative manner according to a search strategy on the combined features, such as shown in fig. 7, step S150 may be performed for each iteration to update the at least one binning operation, and the combined features generated in each iteration are added to the basic feature subset as new discrete features. For example, in the example of fig. 7, in the first iteration, the basic feature subset of the composite machine learning model may be empty, and may also include at least a portion of the first-order features (e.g., feature a, feature B, feature C as discrete features) or all of the features (e.g., feature a, feature B, feature C as discrete features along with the original continuous features corresponding to feature D and feature E). After the first iteration, features B and E are added to the base feature subset. Then, after a second iteration, the features BC and EA are supplemented to the base feature subset; after the third iteration, the feature BCD and the feature EAB are supplemented to the basic feature subset, and so on. It should be noted that the number of feature combinations selected in each iteration round is not limited to one. Meanwhile, for each iteration, the continuous features are determined by constructing the composite machine learning model again, so that the continuous features are converted into corresponding bin group features according to the determined bin operation, and are combined with other discrete features in the next iteration.
It should be noted that the above step S150 can also be applied to the methods shown in fig. 8 and 9, and will not be described again here.
The devices shown in fig. 1-5 may each be configured as software, hardware, firmware, or any combination thereof that performs a particular function. These means may correspond, for example, to an application-specific integrated circuit, to pure software code, or to a combination of software and hardware elements or modules. Further, one or more functions implemented by these apparatuses may also be collectively performed by components in a physical entity device (e.g., a processor, a client, a server, or the like).
Methods and systems for generating combined features of machine learning samples and corresponding machine learning model training/prediction systems according to exemplary embodiments of the present invention are described above with reference to fig. 1-10. It is to be understood that the above-described method may be implemented by a program recorded on a computer readable medium, for example, according to an exemplary embodiment of the present invention, there may be provided a computer medium for generating combined features of machine learning samples, wherein a computer program for performing the following method steps is recorded on the computer readable medium: (A) obtaining a data record, wherein the data record comprises a plurality of attribute information; (B) executing at least one binning operation for each continuous feature generated based on the plurality of attribute information to obtain a binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to one binning feature; and (C) generating combined features of the machine-learned samples by combining features between the binned features and/or other discrete features produced based on the plurality of attribute information.
The computer program in the computer-readable medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may also be used to perform additional steps other than the above steps or perform more specific processing when the above steps are performed, and the contents of the additional steps and the further processing are described with reference to fig. 1 to 10, and will not be described again to avoid repetition.
It should be noted that the combined feature generation system and the machine learning model training/prediction system according to the exemplary embodiment of the present invention may completely depend on the execution of the computer program to realize the corresponding functions, that is, each device corresponds to each step in the functional architecture of the computer program, so that the entire system is called by a special software package (e.g., lib library) to realize the corresponding functions.
Alternatively, each of the means shown in fig. 1 to 5 may be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.
For example, exemplary embodiments of the present invention may also be implemented as a computing device comprising a storage component having stored therein a set of computer-executable instructions that, when executed by the processor, perform a combined feature generation method, a machine learning model training method, and/or a machine learning model prediction method.
In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions described above.
The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
Some of the operations described in the combined feature generation method and the machine learning model training/prediction method according to the exemplary embodiments of the present invention may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.
The processor may execute instructions or code stored in one of the memory components, which may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.
Further, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.
Operations involved in a combined feature generation method and corresponding machine learning model training/prediction method according to exemplary embodiments of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.
For example, as described above, a computing device for generating combined features of machine learning samples according to exemplary embodiments of the present invention may include a storage component and a processor, wherein the storage component has stored therein a set of computer-executable instructions that, when executed by the processor, perform the steps of: (A) obtaining a data record, wherein the data record comprises a plurality of attribute information; (B) executing at least one binning operation for each continuous feature generated based on the plurality of attribute information to obtain a binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to one binning feature; and (C) generating combined features of the machine-learned samples by combining features between the binned features and/or other discrete features produced based on the plurality of attribute information.
While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims (10)

1. A method of generating combined features of machine learning samples, comprising:
(A) obtaining a data record, wherein the data record comprises a plurality of attribute information;
(B) executing at least one binning operation for each continuous feature generated based on the plurality of attribute information to obtain a binning group feature consisting of at least one binning feature, wherein each binning operation corresponds to one binning feature; and
(C) generating combined features of the machine-learned samples by feature combining between the binned features and/or other discrete features generated based on the plurality of attribute information.
2. The method of claim 1, wherein prior to step (B), further comprising: (D) the at least one binning operation is selected from a predetermined number of binning operations such that the importance of the binning characteristics corresponding to the selected binning operation is not lower than the importance of the binning characteristics corresponding to unselected binning operations.
3. The method of claim 2, wherein in step (D), for each of the binning features corresponding to the predetermined number of binning operations, a single-feature machine learning model is constructed, an importance of each binning feature is determined based on an effect of each single-feature machine learning model, and the at least one binning operation is selected based on the importance of each binning feature,
and the single-feature machine learning model corresponds to each box feature.
4. The method of claim 2, wherein in step (D), for each of the bin features corresponding to the predetermined number of bin operations, a composite machine learning model is constructed, an importance of each bin feature is determined based on an effect of each composite machine learning model, and the at least one bin operation is selected based on the importance of each bin feature,
the composite machine learning model comprises a basic submodel and an additional submodel based on a lifting framework, wherein the basic submodel corresponds to the basic feature subset, and the additional submodel corresponds to each of the bin features.
5. The method of claim 4, wherein the combined features of the machine-learned samples are generated in an iterative manner according to a search strategy for the combined features.
6. The method of claim 5, wherein step (D) is performed for each iteration round to update the at least one binning operation, and the combined features generated in each iteration round are added as new discrete features to the base feature subset.
7. The method according to claim 1, wherein each of the continuous features is formed by continuous-value attribute information itself among the plurality of attribute information, or is formed by continuously transforming discrete-value attribute information among the plurality of attribute information.
8. A system for generating combined features of machine-learned samples, comprising:
data record obtaining means for obtaining a data record, wherein the data record includes a plurality of attribute information;
a binning feature generating device, configured to perform at least one binning operation on each continuous feature generated based on the plurality of attribute information to obtain a binning feature composed of at least one binning feature, where each binning operation corresponds to one binning feature; and
feature combination means for generating a combined feature of the machine-learned sample by feature combination between the binned features and/or other discrete features generated based on the plurality of attribute information.
9. A computer-readable medium generating combined features of machine-learned samples, wherein a computer program for performing the method of any one of claims 1 to 7 is recorded on the computer-readable medium.
10. A computing device for generating combined features of machine-learned samples, comprising a storage component and a processor, wherein the storage component has stored therein a set of computer-executable instructions that, when executed by the processor, perform the method of any of claims 1 to 7.
CN202110446590.0A 2017-07-20 2017-07-20 Method and system for generating combined features of machine learning samples Pending CN112990486A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110446590.0A CN112990486A (en) 2017-07-20 2017-07-20 Method and system for generating combined features of machine learning samples

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710595326.7A CN107392319A (en) 2017-07-20 2017-07-20 Generate the method and system of the assemblage characteristic of machine learning sample
CN202110446590.0A CN112990486A (en) 2017-07-20 2017-07-20 Method and system for generating combined features of machine learning samples

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201710595326.7A Division CN107392319A (en) 2017-07-20 2017-07-20 Generate the method and system of the assemblage characteristic of machine learning sample

Publications (1)

Publication Number Publication Date
CN112990486A true CN112990486A (en) 2021-06-18

Family

ID=60337203

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202110446590.0A Pending CN112990486A (en) 2017-07-20 2017-07-20 Method and system for generating combined features of machine learning samples
CN201710595326.7A Pending CN107392319A (en) 2017-07-20 2017-07-20 Generate the method and system of the assemblage characteristic of machine learning sample

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201710595326.7A Pending CN107392319A (en) 2017-07-20 2017-07-20 Generate the method and system of the assemblage characteristic of machine learning sample

Country Status (2)

Country Link
CN (2) CN112990486A (en)
WO (1) WO2019015631A1 (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990486A (en) * 2017-07-20 2021-06-18 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples
CN109840726B (en) * 2017-11-28 2021-05-14 华为技术有限公司 Article sorting method and device and computer readable storage medium
CN108090516A (en) * 2017-12-27 2018-05-29 第四范式(北京)技术有限公司 Automatically generate the method and system of the feature of machine learning sample
CN108090032B (en) * 2018-01-03 2021-03-23 第四范式(北京)技术有限公司 Visual interpretation method and device of logistic regression model
CN108510003A (en) * 2018-03-30 2018-09-07 深圳广联赛讯有限公司 Car networking big data air control assemblage characteristic extracting method, device and storage medium
CN109213833A (en) * 2018-09-10 2019-01-15 成都四方伟业软件股份有限公司 Two disaggregated model training methods, data classification method and corresponding intrument
CN110968887B (en) * 2018-09-28 2022-04-05 第四范式(北京)技术有限公司 Method and system for executing machine learning under data privacy protection
CN112101562B (en) * 2019-06-18 2024-01-30 第四范式(北京)技术有限公司 Implementation method and system of machine learning modeling process
CN110956272B (en) * 2019-11-01 2023-08-08 第四范式(北京)技术有限公司 Method and system for realizing data processing
CN111506575B (en) * 2020-03-26 2023-10-24 第四范式(北京)技术有限公司 Training method, device and system for network point traffic prediction model
US11301351B2 (en) 2020-03-27 2022-04-12 International Business Machines Corporation Machine learning based data monitoring
US11514369B2 (en) * 2020-06-16 2022-11-29 DataRobot, Inc. Systems and methods for machine learning model interpretation
CN112001452B (en) * 2020-08-27 2021-08-27 深圳前海微众银行股份有限公司 Feature selection method, device, equipment and readable storage medium
CN112163704B (en) * 2020-09-29 2021-05-14 筑客网络技术(上海)有限公司 High-quality supplier prediction method for building material tender platform
CN112380215B (en) * 2020-11-17 2023-07-28 北京融七牛信息技术有限公司 Automatic feature generation method based on cross aggregation
TW202226054A (en) 2020-12-17 2022-07-01 緯創資通股份有限公司 Object detection device and object detection method
CN115130619A (en) * 2022-08-04 2022-09-30 中建电子商务有限责任公司 Risk control method based on clustering selection integration

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2481296A1 (en) * 2002-04-19 2003-10-30 Computer Associates Think, Inc. Method and apparatus for discovering evolutionary changes within a system
CN106095942B (en) * 2016-06-12 2018-07-27 腾讯科技(深圳)有限公司 Strong variable extracting method and device
CN106407999A (en) * 2016-08-25 2017-02-15 北京物思创想科技有限公司 Rule combined machine learning method and system
CN112990486A (en) * 2017-07-20 2021-06-18 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples

Also Published As

Publication number Publication date
CN107392319A (en) 2017-11-24
WO2019015631A1 (en) 2019-01-24

Similar Documents

Publication Publication Date Title
CN112990486A (en) Method and system for generating combined features of machine learning samples
CN111797928A (en) Method and system for generating combined features of machine learning samples
JP6457693B1 (en) Systems and techniques for predictive data analysis
US10417528B2 (en) Analytic system for machine learning prediction model selection
US20210390461A1 (en) Graph model build and scoring engine
US10360517B2 (en) Distributed hyperparameter tuning system for machine learning
CN107871166B (en) Feature processing method and feature processing system for machine learning
CN113435602A (en) Method and system for determining feature importance of machine learning sample
EP4214652A1 (en) Computer-based systems, computing components and computing objects configured to implement dynamic outlier bias reduction in machine learning models
CN111783893A (en) Method and system for generating combined features of machine learning samples
CN114298323A (en) Method and system for generating combined features of machine learning samples
CN113570064A (en) Method and system for performing predictions using a composite machine learning model
Lima et al. Domain knowledge integration in data mining using decision tables: case studies in churn prediction
CN107273979B (en) Method and system for performing machine learning prediction based on service level
CN111797927A (en) Method and system for determining important features of machine learning samples
CN116757297A (en) Method and system for selecting features of machine learning samples
US11093833B1 (en) Multi-objective distributed hyperparameter tuning system
US20210075875A1 (en) Utilizing a recommendation system approach to determine electronic communication send times
CN113610240A (en) Method and system for performing predictions using nested machine learning models
CN116882520A (en) Prediction method and system for predetermined prediction problem
CN113822440A (en) Method and system for determining feature importance of machine learning samples
CN111369344B (en) Method and device for dynamically generating early warning rules
CN110866625A (en) Promotion index information generation method and device
Cadei et al. Machine Learning Advanced Algorithm to Enhance Production Optimization: An ANN Proxy Modelling Approach
CN113656692B (en) Product recommendation method, device, equipment and medium based on knowledge migration algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination