CN111611239A - Method, device, equipment and storage medium for realizing automatic machine learning - Google Patents

Method, device, equipment and storage medium for realizing automatic machine learning Download PDF

Info

Publication number
CN111611239A
CN111611239A CN202010306838.9A CN202010306838A CN111611239A CN 111611239 A CN111611239 A CN 111611239A CN 202010306838 A CN202010306838 A CN 202010306838A CN 111611239 A CN111611239 A CN 111611239A
Authority
CN
China
Prior art keywords
training
data
machine learning
feature
operator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010306838.9A
Other languages
Chinese (zh)
Inventor
岳凌
郭夏玮
涂威威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202010306838.9A priority Critical patent/CN111611239A/en
Publication of CN111611239A publication Critical patent/CN111611239A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a device, equipment and a storage medium for realizing automatic machine learning. The method comprises the following steps: providing a first editing interface according to the first operation of the editing model training operator; acquiring training operator content input through a first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model; packaging the training operator content to obtain a model training operator; and carrying out automatic machine learning model training by using the model training operator.

Description

Method, device, equipment and storage medium for realizing automatic machine learning
Technical Field
The present invention relates to the field of artificial intelligence, and more particularly, to a method of implementing automatic machine learning, an apparatus including at least one computing device and at least one storage device, and a computer-readable storage medium.
Background
With the rapid development and application of machine learning technology, machine learning has a huge breakthrough in many fields.
At present, when machine learning is used in actual business, professional personnel is required to perform professional operations such as programming and data configuration in advance to train a machine learning model. That is, the existing machine learning model requires a certain threshold for the training person. This in turn tends to result in increased labor costs for training the machine learning model, as well as decreased training efficiency.
Therefore, how to provide a method for training a machine learning model without programming, data configuration, etc. is a technical problem to be solved urgently.
Disclosure of Invention
It is an object of the present invention to provide a new technical solution for implementing automatic machine learning.
According to a first aspect of the present invention, there is provided a method of implementing automatic machine learning, comprising:
providing a first editing interface according to the first operation of the editing model training operator;
acquiring training operator content input through the first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model;
packaging the training operator content to obtain the model training operator;
and carrying out automatic machine learning model training by using the model training operator.
Optionally, the method further includes:
providing a second editing interface according to the second operation of the editing model prediction operator;
acquiring the content of the predictor input through the second editing interface; the predictor content includes: processing the predicted data according to a data format conversion rule, a missing value filling rule and an initial time field processing rule in the data preprocessing, aligning attribute information in the processed predicted data with attribute information in training data after data preprocessing, performing feature generation on the aligned predicted data according to a result of the feature engineering, and predicting a feature generation result according to a trained machine learning model group;
packaging the content of the prediction operator to obtain the model prediction operator;
and performing automatic machine learning model prediction by using the model prediction operator.
Optionally, the performing automatic machine learning model training by using the model training operator includes: carrying out data preprocessing operation on input training data; the data pre-processing operation includes at least one of:
the first item is used for carrying out data format conversion on the training data;
a second term that downsamples the training data;
a third item, labeling the training data as labeled data and unlabeled data;
fourthly, unifying the format of the label values in the training data;
a fifth item for automatically identifying and marking the type of each attribute information contained in the training data;
a sixth item that performs missing value padding on the training data;
the seventh item, unify the format of the initial time field in the training data, add new time field based on the unified result, and delete the initial time field;
and the eighth item is used for automatically identifying non-numerical data in the training data and carrying out hash processing on the non-numerical data.
Optionally, the performing automatic machine learning model training by using the model training operator further includes: carrying out characteristic engineering operation on the training data after data preprocessing; the operations of the feature engineering include:
down-sampling the training data after the data preprocessing to a first preset number;
performing first feature selection on the down-sampled training data to obtain basic features;
combining the basic features to generate new combined features;
and generating a training sample according to the basic characteristic and the combined new characteristic.
Optionally, the performing a first feature selection on the downsampled training data to obtain a basic feature includes:
extracting all attribute information included in the down-sampled training data, wherein the attribute information is used for forming features;
acquiring a characteristic importance value corresponding to each attribute information;
and obtaining the basic characteristics according to the characteristic importance value.
Optionally, the obtaining the basic feature according to the feature importance value includes:
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and acquiring attribute information corresponding to the feature importance values of the second preset number as basic features according to the sorting result.
Optionally, the combining the basic features to generate a combined new feature includes:
selecting a set feature generation rule according to the type of each attribute information contained in the training data;
and combining the basic features according to the selected feature generation rule to obtain a combined new feature.
Optionally, the generating a training sample according to the basic feature and the combined new feature includes:
performing a second feature selection on the base feature and the combined feature;
and generating a training sample according to the selected features of the second features.
Optionally, the second feature selection on the basic feature and the combined feature includes:
acquiring a feature importance value of each basic feature and each combined feature;
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and according to the sorting result, acquiring attribute information corresponding to the feature importance values of the first third preset number as features required by the training sample.
Optionally, the performing automatic machine learning model prediction by using the model predictor includes: performing feature generation operation on the aligned prediction data according to the result of the feature engineering; the feature generation operation includes:
screening out a characteristic set from the result of the characteristic engineering; the feature set comprises basic features and combined new features;
identifying a feature generation rule corresponding to the combined new feature according to the combined new feature;
deleting attribute information which does not belong to the basic features from the attribute information in the processed prediction data to obtain the basic features of the prediction data;
generating new combined characteristics of the prediction data according to the attribute information in the prediction data and the characteristic generation rule;
and generating a prediction sample according to the basic characteristics of the prediction data and the combined new characteristics of the prediction data.
Optionally, the method further includes:
and when the number of the samples of the prediction samples is larger than a fourth preset number, updating the parameters of the corresponding combined new features in the prediction samples according to the parameters corresponding to the combined new features which are the same as the combined new features of the prediction data in the training samples.
Optionally, the performing automatic machine learning model prediction by using the model predictor includes: an operation of predicting the feature generation result according to the trained machine learning model group, the operation of predicting comprising:
inputting the feature generation result to a trained machine learning model group to obtain a prediction result of each machine learning model in the machine learning model group;
and taking the comprehensive value of the prediction result of each machine learning model as the prediction result corresponding to the prediction data.
Optionally, the taking the integrated value of the prediction result of each machine learning model as the prediction result corresponding to the prediction data includes:
and taking the average value of the prediction results of each machine learning model as the prediction result corresponding to the prediction data.
Optionally, the method further includes:
providing a plurality of sets of hyper-parameters of each of the machine learning models in the set of machine learning models;
selecting a corresponding optimal solution from each group of the hyper-parameters according to a training sample corresponding to the training data;
and setting the corresponding optimal solution as the corresponding hyper-parameter of the machine learning model.
Optionally, the machine learning model group includes:
a gradient lifting decision tree model, a random forest model, a factorization machine model, a domain sensitive factorization machine model, and a linear regression model.
According to a second aspect of the present invention, there is provided an apparatus for implementing automatic machine learning, comprising:
the first providing module is used for providing a first editing interface according to the first operation of the editing model training operator;
the first acquisition module is used for acquiring training operator content input through the first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model;
the first packaging module is used for packaging the training operator content to obtain the model training operator;
and the training module is used for carrying out automatic machine learning model training by utilizing the model training operator.
Optionally, the apparatus further comprises:
the second providing module is used for providing a second editing interface according to the second operation of the editing model prediction operator;
the second acquisition module is used for acquiring the content of the predictor input through the second editing interface; the predictor content includes: processing the predicted data according to a data format conversion rule, a missing value filling rule and an initial time field processing rule in the data preprocessing, aligning attribute information in the processed predicted data with attribute information in training data after data preprocessing, performing feature generation on the aligned predicted data according to a result of the feature engineering, and predicting a feature generation result according to a trained machine learning model group;
the second packaging module is used for packaging the content of the predictor to obtain the model predictor;
and the prediction module is used for performing automatic machine learning model prediction by using the model prediction operator.
Optionally, the training module is specifically configured to: carrying out data preprocessing operation on input training data; the data pre-processing operation includes at least one of:
the first item is used for carrying out data format conversion on the training data;
a second term that downsamples the training data;
a third item, labeling the training data as labeled data and unlabeled data;
fourthly, unifying the format of the label values in the training data;
a fifth item for automatically identifying and marking the type of each attribute information contained in the training data;
a sixth item that performs missing value padding on the training data;
the seventh item, unify the format of the initial time field in the training data, add new time field based on the unified result, and delete the initial time field;
and the eighth item is used for automatically identifying non-numerical data in the training data and carrying out hash processing on the non-numerical data.
Optionally, the training module is specifically configured to: carrying out characteristic engineering operation on the training data after data preprocessing; the operations of the feature engineering include:
down-sampling the training data after the data preprocessing to a first preset number;
performing first feature selection on the down-sampled training data to obtain basic features;
combining the basic features to generate new combined features;
and generating a training sample according to the basic characteristic and the combined new characteristic.
Optionally, the training module is specifically configured to:
extracting all attribute information included in the down-sampled training data, wherein the attribute information is used for forming features;
acquiring a characteristic importance value corresponding to each attribute information;
and obtaining the basic characteristics according to the characteristic importance value.
Optionally, the training module is specifically configured to:
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and acquiring attribute information corresponding to the feature importance values of the second preset number as basic features according to the sorting result.
Optionally, the training module is specifically configured to:
selecting a set feature generation rule according to the type of each attribute information contained in the training data;
and combining the basic features according to the selected feature generation rule to obtain a combined new feature.
Optionally, the training module is specifically configured to:
performing a second feature selection on the base feature and the combined feature;
and generating a training sample according to the selected features of the second features.
Optionally, the training module is specifically configured to:
acquiring a feature importance value of each basic feature and each combined feature;
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and according to the sorting result, acquiring attribute information corresponding to the feature importance values of the first third preset number as features required by the training sample.
Optionally, the prediction module is specifically configured to: performing feature generation operation on the aligned prediction data according to the result of the feature engineering; the feature generation operation includes:
screening out a characteristic set from the result of the characteristic engineering; the feature set comprises basic features and combined new features;
identifying a feature generation rule corresponding to the combined new feature according to the combined new feature;
deleting attribute information which does not belong to the basic features from the attribute information in the processed prediction data to obtain the basic features of the prediction data;
generating new combined characteristics of the prediction data according to the attribute information in the prediction data and the characteristic generation rule;
and generating a prediction sample according to the basic characteristics of the prediction data and the combined new characteristics of the prediction data.
Optionally, the prediction module is further configured to:
and when the number of the samples of the prediction samples is larger than a fourth preset number, updating the parameters of the corresponding combined new features in the prediction samples according to the parameters corresponding to the combined new features which are the same as the combined new features of the prediction data in the training samples.
Optionally, the prediction module is further configured to: an operation of predicting the feature generation result according to the trained machine learning model group, the operation of predicting comprising:
inputting the feature generation result to a trained machine learning model group to obtain a prediction result of each machine learning model in the machine learning model group;
and taking the comprehensive value of the prediction result of each machine learning model as the prediction result corresponding to the prediction data.
Optionally, the prediction module is further configured to:
and taking the average value of the prediction results of each machine learning model as the prediction result corresponding to the prediction data.
Optionally, the apparatus further comprises:
a third providing module, configured to provide multiple sets of hyper-parameters of each machine learning model in the set of machine learning models;
the selection module is used for selecting a corresponding optimal solution from each group of the hyper-parameters according to the training sample corresponding to the training data;
and the setting module is used for setting the corresponding optimal solution as the corresponding hyper-parameter of the machine learning model.
Optionally, the machine learning model group includes:
a gradient lifting decision tree model, a random forest model, a factorization machine model, a domain sensitive factorization machine model, and a linear regression model.
According to a third aspect of the present invention, there is provided an apparatus comprising at least one computing device and at least one storage device, wherein the at least one storage device is configured to store instructions for controlling the at least one computing device to perform the method according to any one of the first aspects.
According to a fourth aspect of the present invention, there is provided a computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, carries out the method according to any one of the first aspect.
In the embodiment, a first editing interface is provided through a first operation of training an operator according to an editing model; acquiring training operator content input through a first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model; packaging the training operator content to obtain a model training operator; and carrying out automatic machine learning model training by using the model training operator. Therefore, by packaging a model training operator, after a user inputs training data to the model training operator, the model training operator can automatically process the training data, and finally a trained machine learning model group is obtained. Namely, the method provides a method for realizing automatic machine learning, so that a user can train the machine learning model without the capabilities of programming, data configuration and the like.
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.
FIG. 1 is a block diagram of an example of a hardware configuration of an electronic device that may be used to implement embodiments of the present disclosure;
FIG. 2 is a flow chart of a method for implementing automatic machine learning according to an embodiment of the present invention;
FIG. 3 is a flow chart of another method for implementing automatic machine learning according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an apparatus for implementing automatic machine learning according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of another apparatus for implementing automatic machine learning according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
< hardware configuration >
Fig. 1 is a block diagram of a hardware configuration of an electronic device implementing a method for implementing automatic machine learning according to an embodiment of the present invention.
The electronic device 1000 may generally be a laptop, desktop, tablet, etc.
The electronic device 1000 may include a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and so forth. The processor 1100 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a headphone interface, and the like. Communication device 1400 is capable of wired or wireless communication, for example. The display device 1500 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1600 may include, for example, a touch screen, a keyboard, and the like. A user can input/output voice information through the speaker 1700 and the microphone 1800.
Although a plurality of devices are shown in fig. 1 for each of the electronic devices 1000, the present invention may relate to only some of the devices, for example, the electronic device 1000 may relate to only the memory 1200 and the processor 1100.
In an embodiment of the present invention, the memory 1200 of the electronic device 1000 is used for storing instructions for controlling the processor 1100 to execute the method for implementing automatic machine learning provided by the embodiment of the present invention.
In the above description, the skilled person will be able to design instructions in accordance with the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.
< method examples >
In the present embodiment, a method of implementing automatic machine learning is provided, which may be performed by the electronic device 1000 as shown in fig. 1.
As shown in fig. 2, the method for implementing automatic machine learning according to this embodiment includes the following steps S2100 to S2400:
s2100, providing a first editing interface according to the first operation of the editing module training operator.
In one embodiment, the first operation may be a click operation for an icon, link, or package of files that is an icon implementing an editing model training operator.
In this embodiment, the first operation is to instruct the electronic device 1000 to provide a first editing interface. When the electronic device 1000 receives the first operation, the first operation is responded, and a first editing interface is displayed.
In this embodiment, the first editing interface includes an editing entry, and the editing entry may be an input box, a drop-down list, a voice input interface, or the like. And the first editing interface is used for inputting the training operator content by the user.
S2200, acquiring training operator content input through the first editing interface.
Wherein, training operator content includes: the method comprises the steps of preprocessing input training data to obtain an operation command, performing feature engineering on the preprocessed training data to obtain an operation command, and performing machine learning model group training according to the result of the feature engineering; the set of machine learning models includes at least one machine learning model.
In this embodiment, the developer edits the training operator content in advance, and then provides identification information of the training operator content edited in advance, such as a link, a directory, and the like of the training operator content.
In one embodiment, the user can input the identification information through the editing entry of the first editing interface, so that the training operator content can be input through the first editing interface.
In one embodiment, the machine learning models included in the set of machine learning models may be: a gradient lifting decision tree model, a random forest model, a factorization machine model, a domain sensitive factorization machine model, and a linear regression model.
Among these, the factorizer model is also commonly referred to as: the Factorization mechanisms model. The domain-sensitive Factorization model is also commonly referred to as Field-aware Factorization Machines.
In addition, the training data may be raw data in actual business, that is, the format of the training data is not limited in this embodiment. Thus, the format requirement on the training data is reduced, and the intelligence of the method for realizing the automatic machine learning provided by the embodiment is improved.
And S2300, packaging the training operator content to obtain the model training operator.
In this embodiment, the training operator content is packaged, so that the training operator content includes an operation command for performing data preprocessing on the input training data, an operation command for performing feature engineering on the training data after data preprocessing, and an operation command for performing machine learning model group training according to the result of the feature engineering, which are sequentially connected in series. Meanwhile, an input entry is set for the user to input training data. And setting an output outlet to provide the trained set of machine learning models to the user.
And S2400, performing automatic machine learning model training by using a model training operator.
In this embodiment, based on the model training operator obtained in S2300, after the user inputs training data to the model training operator, the model training operator processes the training data, so as to obtain a trained machine learning model group.
In the embodiment, a first editing interface is provided through a first operation of training an operator according to an editing model; acquiring training operator content input through a first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model; packaging the training operator content to obtain a model training operator; and carrying out automatic machine learning model training by using the model training operator. Therefore, by packaging a model training operator, after a user inputs training data to the model training operator, the model training operator can automatically process the training data, and finally a trained machine learning model group is obtained. Namely, the method provides a method for realizing automatic machine learning, so that a user can train the machine learning model without the capabilities of programming, data configuration and the like.
In one embodiment, based on the above S2200, the above S2400 includes an operation of preprocessing the input training data. The data preprocessing operation reports at least one of:
first, data format conversion is performed on training data.
In this embodiment, since the actually input service data includes various input data types and different formats, for example, different data types may be uniformly converted into a widely used PandasDataFrame format.
In one example, in the case that the service is a transaction service, the actually inputted service data may be information of a product that a merchant has historically transacted.
In addition, the actually input service data includes input data stored in the electronic device 1000 in a positive integer type, a floating point type, a character string type, and the like.
Second, the training data is downsampled.
In this embodiment, the total amount of input training data may be down-sampled to leave only the number of samples preset in the electronic apparatus 1000. The number of samples left is automatically configured by the electronic device 1000 according to the environment of the development environment. The development environment may be the capacity of memory, CPU, etc.
In addition, if the machine learning in the present embodiment corresponds to a classification task, downsampling needs to be performed according to the ratio of positive and negative samples in the training data. Specifically, if the ratio of positive and negative samples included in the training data is 1:2, after the down-sampling, the ratio of positive and negative samples needs to be kept at 1: 2. If the machine learning in this embodiment corresponds to a non-categorical task, random down-sampling may be performed.
Third, the training data is labeled as labeled data and unlabeled data.
In this item, labeled training data and unlabeled training data are labeled, so that the electronic device 1000 can quickly identify whether the training data is labeled data or unlabeled data in a subsequent process.
In addition, based on the third item, the labeled training data and the unlabeled training data can be subjected to feature generation, so that the expression capability of the features in the labeled data can be enriched through feature engineering by the unlabeled training data. And finally, training the machine learning model group by the labeled training data fused with the unlabeled training data.
And fourthly, unifying the format of the label values in the training data.
In the fourth item, the value of the label in the training data may be "0", "1", or "positive", "negative", etc. In the fourth step, the values of the labels in all the training data are unified as "0" and "1", or are unified as "positive" and "negative".
In one example, based on the fourth item, the obtained training data can be as shown in table 1 below:
user id Commodity Price label
1 BYD 100000 0
2 Audi (Audi) 400000 1
3 Huashi 10000 0
4 Huashi 8000 0
5 Apple mobile phone 5000 1
6 Apple mobile phone 6000 0
7 Toyota 300000 1
8 Apple mobile phone 6000 1
9 Apple mobile phone 6000 1
10 Huashi 11000 0
11 BYD 110000 0
12 Huashi 10000 0
13 Audi (Audi) 300000 0
TABLE 1
Where each row in table 1 identifies a set of labeled training data.
And fifthly, automatically identifying and identifying the type of each attribute information contained in the training data.
In the fifth item, the type of each attribute information included in the training data may be converted into a service type required in the subsequent feature engineering, and the service type is divided according to the physical meaning of the data feature and labeled on the data in advance. The traffic type may be, for example, a time type, a discrete value type, a continuous value type, an array type, and a dictionary type. Generally, if a user does not define a business type by himself, the electronic device 1000 may convert a floating point type into a continuous value type, a non-floating point type into a discrete value type, and so on.
In an example, taking the above table 1 as an example, each attribute information included in the training data may be: user id, commodity, price, label.
And sixthly, missing value filling is carried out on the training data.
In one example, for each attribute information a, one mapping attribute information a' is added. The median value of the mapping attribute information A' is: for a specific value in the attribute information a, if there is no valid data for the value, the value is considered as a missing value, and the mapping attribute information a' corresponding to the missing value is filled with 1. If the value has valid data, the mapping attribute information A' corresponding to the missing value is filled to 0.
In one example, if the training data is as shown in table 2 below, the sixth processing is performed to obtain table 3 below.
User id Commodity Price label
1 BYD 100000 0
2 Audi (Audi) 400000 1
3 10000 0
4 Huashi 8000 0
5 Apple mobile phone 5000 1
6 6000 0
7 Toyota 300000 1
8 Apple mobile phone 6000 1
9 6000 1
10 Huashi 11000 0
11 110000 0
12 Huashi 10000 0
13 Audi (Audi) 300000 0
TABLE 2
Figure BDA0002456087410000121
Figure BDA0002456087410000131
TABLE 3
In tables 2 and 3, the attribute information a is regarded as a product, and the mapping attribute information a' is regarded as a product _ isnan.
And seventhly, unifying the format of the initial time field in the training data, adding a new time field based on the unified result, and deleting the initial time field.
In this item, for a column whose attribute information is time in training data, time types of different formats in the column are converted into a unified data format Date. The Date column is parsed to obtain the information of year, month, day, week and hour, and the information is added to the training data as a new hash value and a new continuous value column, respectively, to serve as new attribute information. Meanwhile, the timestamp of the Date column is taken as a new continuous value characteristic and used as new attribute information in the training data, and the initial time field is deleted.
In one example, if the training data is as shown in table 4 below, the seventh processing is performed to obtain table 5 below.
Figure BDA0002456087410000132
Figure BDA0002456087410000141
TABLE 4
Figure BDA0002456087410000142
Figure BDA0002456087410000151
TABLE 5
And the eighth item is used for automatically identifying non-numerical data in the training data and carrying out hash processing on the non-numerical data.
In this item, it may be determined whether there is a column of a floating point number that does not belong to an integer, and if there is a column of a floating point number that does not belong to an integer, the column is mapped to an integer string using a hash algorithm, and the model may learn information in the original data column using the newly generated integer string.
In one embodiment, based on the above S2200, the above S2400 includes performing a feature engineering operation on the training data after the data preprocessing. The operation of the feature engineering comprises the following steps S2410-S2413:
s2410, down-sampling the training data after the data preprocessing to a first preset number.
In this embodiment, the first predetermined amount may be a predetermined percentage of the total amount of training data after data preprocessing, and the percentage may be fifty percent. The first predetermined number may be a predetermined number, such as 10 ten thousand. The first preset number may also be a numerical value set according to a specific application scenario or a simulation test. It should be noted that, in this embodiment, a specific value manner of the first preset number is not limited.
In this embodiment, the training data after the preprocessing is down-sampled, for example, randomly sampled, so that the data amount of the training data can be reduced, and the training time for training the machine learning model group can be reduced.
S2411, performing first feature selection on the down-sampled training data to obtain basic features.
In an embodiment, the first feature selection is performed on the sampled training data to extract relatively important training data from the downsampled training data as the basic feature.
In one embodiment, the specific implementation of S2411 may be S2411-1, S2411-2, S2411-3 as follows:
s4211-1, extracting all attribute information included in the down-sampled training data, wherein the attribute information is used for forming features.
S2411-2, acquiring a characteristic importance value corresponding to each attribute information.
In this embodiment, the feature importance value may be any one of a hellinger distance, a random forest feature segmentation gain, a gradient boosting decision tree feature segmentation gain, and the like. In general, the hellinger distance of each attribute information can be calculated for the classification task as the feature importance value of the corresponding attribute information. For the regression task, the random forest feature segmentation gain of each attribute information can be calculated and used as the feature importance value of the corresponding attribute information.
S2411-3, obtaining basic characteristics according to the characteristic importance value.
In one embodiment, the specific implementation of S2411-3 may be the following S2411-31 and S2411-32:
s2411-31, sorting all the feature importance values in a descending order to obtain a sorting result.
S2411-32, acquiring attribute information corresponding to the feature importance values of the second preset number as basic features according to the sorting result.
In one embodiment, the second preset number may be a value set according to a specific application scenario or a simulation experiment. For example, in the present embodiment, the second preset number may be a preset percentage of the total amount of the training data after the down-sampling, and the percentage may be fifty percent. The second predetermined number may be a predetermined number. The second preset number may also be a numerical value set according to a specific application scenario or a simulation test.
In the case that the second preset number is a value set according to a specific application scenario or a simulation test, for different application scenarios, the value corresponding to the application scenario may be set, and the values corresponding to the different application scenarios may be the same or different. For example, the same value may be set for all application scenarios.
It should be noted that, in this embodiment, a specific value manner of the second preset number is not limited.
S2412, combining the basic features to generate new combined features.
In one embodiment, the above-mentioned S2412 may be implemented by the following S2412-1 and S2412-2:
s2412-1, selecting the set feature generation rule according to the type of each attribute information contained in the training data.
In this embodiment, the specific implementation of S2212-1 may be: and selecting a set feature generation rule from a plurality of feature generation rules to be selected according to the type of each attribute information contained in the training data.
In one example, the above-mentioned multiple candidate feature generation rules may be the following feature generation rules:
count: the number of occurrences of the value in the original column is taken as the new feature corresponding to this original column.
Nunique: there are two discrete values A, B, make group to A and then find out how many kinds of value of B appear.
NumAdd: the values of two consecutive value columns are added as a new feature.
Numubtrack: the values of two consecutive value columns are subtracted as a new feature.
NumMultip: the values of two consecutive value columns are multiplied as a new feature.
NumDivision: the values of two consecutive value columns are divided as a new feature.
CatNumMean: there is a column of discrete values a and a column of continuous values B, and after a column group, the average of B columns is calculated.
CatNumStd: there is a column of discrete values A and a column of continuous values B, and after the A column group, the standard deviation of the B column is obtained.
CatNumMax: there is a column of discrete values A and a column of continuous values B, and the maximum value of column B is calculated after the group of column A.
CatNumMin: there is a column of discrete values A and a column of continuous values B, and the minimum value of B column is obtained after A column group.
TimeSubtract: the difference is calculated for the two columns of time rows.
NumOutlier: and normalizing each row of samples according to the column, and then averaging the column to reflect the outlier of the row of samples.
CatTimeDiff: the time difference value of the last occurrence of the value of the A column on the B column is calculated, and the difference value of two continuous occurrences of the same value after time sequence arrangement can be considered. Two features are generated, a past difference and a future difference, respectively.
Note that "one column" in the above refers to one type of attribute information.
Taking the above feature generation rule Count as an example, the Count is a feature generation rule that can perform feature generation for each type of attribute information. Taking the above feature generation rule Nunique as an example, Nunique is a feature generation rule that can perform feature generation on attribute information of a discrete value type.
In one embodiment, taking Count as an example, when the training data is as shown in table 1 below, the training data shown in table 6 below is obtained after performing the feature generation of Count for the attribute of the product.
User id Commodity Price label Commodity _ Count
1 BYD 100000 0 2
2 Audi (Audi) 400000 1 2
3 Huashi 10000 0 4
4 Huashi 8000 0 4
5 Apple mobile phone 5000 1 4
6 Apple mobile phone 6000 0 4
7 Toyota 300000 1 1
8 Apple mobile phone 6000 1 4
9 Apple mobile phone 6000 1 4
10 Huashi 11000 0 4
11 BYD 110000 0 2
12 Huashi 10000 0 4
13 Audi (Audi) 300000 0 2
TABLE 6
It should be noted that, in this embodiment, the feature generation rule matched with each attribute information may be used as the selected feature generation rule; or selecting a part of feature generation rules from the feature generation rules matched with each attribute information as the selected feature generation rules.
S2412-2, combining the basic characteristics according to the selected characteristic generation rule to obtain new combined characteristics.
In this embodiment, the matched basic features are combined according to the selected feature generation rule to obtain a combined new feature.
And S2413, generating a training sample according to the basic features and the combined new features.
In an embodiment, the specific implementation of S2413 may be: each piece of training data obtained by down-sampling based on S2410 and including the basic feature obtained based on S2211 and the combined new feature obtained based on S2412 may be used as a training sample.
In another embodiment, the specific implementation of S2413 may further be: and (4) re-screening the basic features obtained based on the S2411 and the combined new features obtained based on the S2412, and taking the training data of the basic features and the combined new features reserved after screening as training samples. Based on this, the specific implementation of S2413 may be S2413-1 and S2413-2 as follows:
and S2413-1, carrying out second feature selection on the basic features and the combined features.
In this embodiment, the above-mentioned S2413-1 can be obtained by the following steps S2413-11, S2413-12 and S2413-13:
s2413-11, acquiring a feature importance value of each basic feature and each combined feature.
In this embodiment, the feature importance value may be any one of a hellinger distance, a random forest feature segmentation gain, a gradient boosting decision tree feature segmentation gain, and the like. In general, the hellinger distance of each attribute information can be calculated for the classification task as the feature importance value of the corresponding attribute information. For the regression task, the random forest feature segmentation gain of each attribute information can be calculated and used as the feature importance value of the corresponding attribute information.
S2413-12, sequencing all the feature importance values in a descending order to obtain a sequencing result.
S2413-13, acquiring attribute information corresponding to the feature importance values of the first third preset number as basic features according to the sorting result.
In one embodiment, the third preset number may be a value set according to a specific application scenario or a simulation experiment. For example, in the present embodiment, the third predetermined number may be a predetermined percentage of the total amount of the training data after the down-sampling, and the percentage may be fifty percent. The third predetermined number may be a predetermined number. The third preset number may also be a numerical value set according to a specific application scenario or a simulation test.
In the case that the third preset number is a value set according to a specific application scenario or a simulation test, for different application scenarios, the value corresponding to the application scenario may be set, and the values corresponding to the different application scenarios may be the same or different. For example, the same value may be set for all application scenarios.
In one embodiment, the specific implementation of S2413-13 may be: setting a threshold parameter r, forming a characteristic importance value set according to the obtained characteristic importance values, and obtaining a median m of the set. In the set, if a feature importance value is greater than r m, the feature corresponding to the feature importance value is retained. Wherein r is a decimal number greater than 0 and equal to or less than 1.
It should be noted that, in this embodiment, a specific value manner of the third set number is not limited.
S2413-2, selecting the obtained features according to the second features, and generating a training sample.
In this embodiment, by the second feature selection, the unimportant features in the training data can be deleted, which greatly reduces the data processing amount of the electronic device 1000.
On the basis of any one of the above embodiments, the method for implementing machine learning provided by the present embodiment further includes the following steps S2420-S2422:
s2420, providing multiple sets of hyper-parameters of each machine learning model in the machine learning model set.
S2421, selecting a corresponding optimal solution from each plurality of sets of hyper-parameters according to the training samples corresponding to the training data.
And S2422, setting the corresponding optimal solution as the hyper-parameter of the corresponding machine learning model.
In this embodiment, each of the plurality of sets of hyper-parameters of each of the machine learning models in the set of machine learning models is proved by experiments, so that the corresponding machine learning model has a good prediction result of hyper-parameters.
In this embodiment, by setting a plurality of sets of hyper-parameters, the method for automatic machine learning provided in this embodiment can automatically select an optimal hyper-parameter.
On the basis of any of the above embodiments, as shown in fig. 3, the method for implementing automatic machine learning according to this embodiment further includes the following steps S2500-S2800:
and S2500, providing a second editing interface according to the second operation of the editing model prediction operator.
In one embodiment, the second operation may be a click operation for an icon, link, or package of files that is an icon to implement an editing predictor.
In this embodiment, the second operation is to instruct the electronic device 1000 to provide a second editing interface. When the electronic device 1000 receives the second operation, the second operation is responded to, and a second editing interface is displayed.
In this embodiment, the second editing interface includes an editing entry, and the editing entry may be an input box, a drop-down list, a voice input interface, or the like. And the editing entry of the second editing interface is used for inputting the content of the predictor by the user.
And S2600, obtaining the content of the predictor input through the second editing interface.
The predictor content includes: acquiring the content of the predictor input through a second editing interface; the predictor content includes: the method comprises the steps of processing predicted data according to a data format conversion rule, a missing value filling rule and an initial time field processing rule in data preprocessing, aligning attribute information in the processed predicted data with attribute information in training data after data preprocessing, performing feature generation on the aligned predicted data according to a result of feature engineering, and predicting a feature generation result according to a trained machine learning model group.
In this embodiment, the developer edits the content of the predictor in advance, and then provides identification information of the content of the predictor edited in advance, such as a link, a directory, and the like of the content of the training operator.
In one embodiment, the user can input the identification information through the editing entry of the second editing interface, so that the input of the content of the predictor through the second editing interface can be realized.
In this embodiment, the operation command processed according to the data format conversion rule, the missing value padding rule, the initial time field processing rule, and the predicted data in the data preprocessing in S2600 refers to an operation name command for performing data format conversion on the predicted data according to the data format conversion rule adopted in the data preprocessing in S2200, performing missing value padding on the predicted data after data format conversion according to the missing value padding rule adopted in the data preprocessing in S2200, and performing initial time period processing on the predicted data after missing value padding according to the initial time field processing rule adopted in the data preprocessing in S2200.
In this embodiment, the operation command in S2600, which aligns the attribute information in the processed prediction data with the attribute information in the training data after data preprocessing, refers to a command that sorts the attribute information in the prediction data according to the sequence of the attribute information in the training data, so that the sequence of the attribute information in the sorted prediction data is the same as the sequence of the attribute information in the training data.
In this embodiment, the operation command for performing feature generation on the aligned prediction data according to the result of the feature engineering in S2600 refers to an operation command for performing feature engineering on the prediction data so that the features included in the result of the feature engineering on the prediction data are the same as the features included in the training data after the feature engineering on the training data.
In this embodiment, the operation command for predicting the feature generation result from the trained machine learning model group in S2600 refers to an operation command for inputting the feature generation result to the machine learning model group trained in S2200.
S2700, packaging the content of the prediction operator to obtain the model prediction operator.
In this embodiment, the content of the predictor is packaged, so that the operation commands included in the content of the predictor and used for processing the predicted data according to the data format conversion rule, the missing value filling rule and the initial time field processing rule in the data preprocessing, the operation commands used for aligning the attribute information in the processed predicted data with the attribute information in the training data after the data preprocessing, the operation commands used for performing feature generation on the aligned predicted data according to the result of the feature engineering, and the operation commands used for predicting the feature generation result according to the trained machine learning model group are sequentially connected in series. Meanwhile, an input entry is set for a user to input the prediction data. And setting an output outlet to provide the output prediction result to the user.
S2800, automatic machine learning model prediction is carried out by using a model prediction operator.
In this embodiment, based on the model predictor obtained in S2700, after the user inputs the prediction data to the model predictor, the model predictor processes the prediction data without performing any operation on the prediction data, so that the prediction result can be obtained.
With reference to S2100-S2400, in the method for implementing automatic machine learning provided in this embodiment, a user only needs to input training data to a model training operator and input prediction data to a model predictor, so that a prediction result corresponding to the prediction data can be obtained. This enables a complete set of automated machine learning methods.
Based on the above S2700, in one embodiment, the above S2800 includes an operation of performing feature generation on the aligned prediction data according to a result of the feature engineering. The operation includes the following steps S2810 to S2814:
s2810, screening a feature set from a feature engineering result; the feature set comprises basic features and combined new features.
In this embodiment, the result of the feature engineering in S2810 refers to a feature result based on the training data obtained in S2400. S2810 specifically includes: and extracting all the characteristics contained in the corresponding training data in the result of the characteristic engineering.
It can be understood that the corresponding training data in the result of the feature engineering is a training sample.
S2811, according to the combined new features, identifying feature generation rules corresponding to the combined new features.
In the present embodiment, each of the new features corresponds to one feature generation rule. The feature generation rules corresponding to the combined new features obtained in S2810 are all identified.
S2812, deleting the attribute information that does not belong to the basic feature from the attribute information in the aligned prediction data to obtain the basic feature of the prediction data.
In this embodiment, step S2812 is specifically to delete attribute information, which does not belong to the basic feature included in the training data in the feature generation result obtained in step S2400, from the attribute information in the aligned prediction data, so as to obtain the basic feature of the prediction data.
S2813, generates a new feature of the prediction data combination according to the feature generation rule using the attribute information in the prediction data.
In the present embodiment, a new feature of the combination of the prediction data is obtained based on the attribute information in the prediction data in accordance with the feature generation rule obtained based on S2811 described above.
S2814, generating a prediction sample according to the basic characteristics of the prediction data and the new combined characteristics of the prediction data.
In the embodiment, the prediction data is processed according to the training sample to obtain the prediction sample. This avoids unnecessary operations on the prediction data, thereby improving the efficiency of generating the prediction result. Meanwhile, the consistency of the characteristics of the prediction sample and the training sample is ensured, so that the accuracy of the prediction result is improved.
In an embodiment, based on the foregoing S2810-S2814, the method for implementing automatic machine learning according to this embodiment further includes the following S2900:
and S2900, when the number of the samples of the prediction samples is larger than the fourth preset number, updating the parameters of the corresponding combined new features in the prediction samples according to the parameters corresponding to the combined new features which are the same as the combined new features of the prediction data in the training samples.
In this embodiment, the fourth predetermined number may be obtained empirically. In addition, the specific value of the fourth preset number is not limited in this embodiment. In addition, when the number of samples of the prediction samples is greater than the fourth preset number, the corresponding prediction scene is usually a non-real-time predicted scene.
In this embodiment, because the number of samples of the training samples is too large, the corresponding prediction scenario is usually a non-real-time prediction scenario, that is, there is no strict requirement on the prediction time. Based on this, the specific value corresponding to the combined new feature in the prediction data may be updated so that the value of the combined new feature is closer to the value parameter distribution of the feature to be predicted, thereby obtaining a combined new feature with a higher information amount. Wherein, the update may be: and on the basis of the specific value corresponding to the combined new feature in the prediction data, superposing the specific value corresponding to the same combined new feature in the training sample.
Based on the above S2700, in one embodiment, the above S2700 includes an operation of predicting the feature generation result according to the trained machine learning model group. The operation includes the following steps S2820 and S2821:
s2820, inputting the feature generation result to the trained machine learning model group to obtain a prediction result of each machine learning model in the machine learning model group.
S2821, taking the comprehensive value of the prediction result of each machine learning model as the prediction result corresponding to the prediction data.
In an embodiment, the specific implementation of S2821 may be: and taking the maximum value, the minimum value or the median value of the prediction result of each machine learning model as the prediction result corresponding to the prediction data.
Based on the above, when the average value of the prediction results of each machine learning model is used as the prediction result corresponding to the prediction data, the specific implementation of S2821 may be:
and taking the average value of the prediction results of each machine learning model as the prediction result corresponding to the prediction data.
< apparatus embodiment >
As shown in fig. 4, the present embodiment provides an apparatus 40 for implementing automatic machine learning, including: a first providing module 41, a first obtaining module 42, a first packaging module 43, and a training module 44. Wherein:
a first providing module 41, configured to provide a first editing interface according to a first operation of editing the model training operator;
a first obtaining module 42, configured to obtain training operator content input through the first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model;
a first encapsulation module 43, configured to encapsulate the training operator content to obtain the model training operator;
and the training module 44 is used for performing automatic machine learning model training by using the model training operator.
In one embodiment, as shown in fig. 5, the apparatus 40 further comprises: a second providing module 45, a second obtaining module 46, a second encapsulating module 47, a predicting module 48, wherein:
a second providing module 45, configured to provide a second editing interface according to a second operation of the editing model predictor;
a second obtaining module 46, configured to obtain the content of the predictor input through the second editing interface; the predictor content includes: processing the predicted data according to a data format conversion rule, a missing value filling rule and an initial time field processing rule in the data preprocessing, aligning attribute information in the processed predicted data with attribute information in training data after data preprocessing, performing feature generation on the aligned predicted data according to a result of the feature engineering, and predicting a feature generation result according to a trained machine learning model group;
a second encapsulation module 47, configured to encapsulate the content of the predictor, so as to obtain the model predictor;
and a prediction module 48, configured to perform automatic machine learning model prediction using the model predictor.
In one embodiment, the training module 44 is specifically configured to: carrying out data preprocessing operation on input training data; the data pre-processing operation includes at least one of:
the first item is used for carrying out data format conversion on the training data;
a second term that downsamples the training data;
a third item, labeling the training data as labeled data and unlabeled data;
fourthly, unifying the format of the label values in the training data;
a fifth item for automatically identifying and marking the type of each attribute information contained in the training data;
a sixth item that performs missing value padding on the training data;
the seventh item, unify the format of the initial time field in the training data, add new time field based on the unified result, and delete the initial time field;
and the eighth item is used for automatically identifying non-numerical data in the training data and carrying out hash processing on the non-numerical data.
In one embodiment, the training module 44 is specifically configured to: carrying out characteristic engineering operation on the training data after data preprocessing; the operations of the feature engineering include:
down-sampling the training data after the data preprocessing to a first preset number;
performing first feature selection on the down-sampled training data to obtain basic features;
combining the basic features to generate new combined features;
and generating a training sample according to the basic characteristic and the combined new characteristic.
In one embodiment, the training module 44 is specifically configured to:
extracting all attribute information included in the down-sampled training data, wherein the attribute information is used for forming features;
acquiring a characteristic importance value corresponding to each attribute information;
and obtaining the basic characteristics according to the characteristic importance value.
In one embodiment, the training module 44 is specifically configured to:
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and acquiring attribute information corresponding to the feature importance values of the second preset number as basic features according to the sorting result.
In one embodiment, the training module 44 is specifically configured to:
selecting a set feature generation rule according to the type of each attribute information contained in the training data;
and combining the basic features according to the selected feature generation rule to obtain a combined new feature.
In one embodiment, the training module 44 is specifically configured to:
performing a second feature selection on the base feature and the combined feature;
and generating a training sample according to the selected features of the second features.
In one embodiment, the training module 44 is specifically configured to:
acquiring a feature importance value of each basic feature and each combined feature;
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and according to the sorting result, acquiring attribute information corresponding to the feature importance values of the first third preset number as features required by the training sample.
In one embodiment, prediction module 48 is specifically configured to: performing feature generation operation on the aligned prediction data according to the result of the feature engineering; the feature generation operation includes:
screening out a characteristic set from the result of the characteristic engineering; the feature set comprises basic features and combined new features;
identifying a feature generation rule corresponding to the combined new feature according to the combined new feature;
deleting attribute information which does not belong to the basic features from the attribute information in the processed prediction data to obtain the basic features of the prediction data;
generating new combined characteristics of the prediction data according to the attribute information in the prediction data and the characteristic generation rule;
and generating a prediction sample according to the basic characteristics of the prediction data and the combined new characteristics of the prediction data.
In one embodiment, the prediction module 48 is further configured to:
and when the number of the samples of the prediction samples is larger than a fourth preset number, updating the parameters of the corresponding combined new features in the prediction samples according to the parameters corresponding to the combined new features which are the same as the combined new features of the prediction data in the training samples.
In one embodiment, the prediction module 48 is further configured to: an operation of predicting the feature generation result according to the trained machine learning model group, the operation of predicting comprising:
inputting the feature generation result to a trained machine learning model group to obtain a prediction result of each machine learning model in the machine learning model group;
and taking the comprehensive value of the prediction result of each machine learning model as the prediction result corresponding to the prediction data.
In one embodiment, the prediction module 48 is further configured to:
and taking the average value of the prediction results of each machine learning model as the prediction result corresponding to the prediction data.
In one embodiment, the apparatus 40 further comprises:
a third providing module, configured to provide multiple sets of hyper-parameters of each machine learning model in the set of machine learning models;
the selection module is used for selecting a corresponding optimal solution from each group of the hyper-parameters according to the training sample corresponding to the training data;
and the setting module is used for setting the corresponding optimal solution as the corresponding hyper-parameter of the machine learning model.
In one embodiment, the set of machine learning models includes:
a gradient lifting decision tree model, a random forest model, a factorization machine model, a domain sensitive factorization machine model, and a linear regression model.
< apparatus embodiment >
As shown in fig. 6, the present embodiment provides an apparatus 60, the apparatus 60 comprising an apparatus comprising at least one computing device 61 and at least one storage device 62, wherein the at least one storage device 62 is configured to store instructions for controlling the at least one computing device 61 to perform the method according to any one of the above method embodiments.
Fig. 6 shows a computing device 61 and a storage device 62.
< storage Medium embodiment >
The present embodiment provides a computer-readable storage medium, wherein a computer program is stored thereon, which computer program, when being executed by a processor, realizes the method according to any one of the above-mentioned method embodiments.
The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. A method of implementing automatic machine learning, comprising:
providing a first editing interface according to the first operation of the editing model training operator;
acquiring training operator content input through the first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model;
packaging the training operator content to obtain the model training operator;
and carrying out automatic machine learning model training by using the model training operator.
2. The method of claim 1, further comprising:
providing a second editing interface according to the second operation of the editing model prediction operator;
acquiring the content of the predictor input through the second editing interface; the predictor content includes: processing the predicted data according to a data format conversion rule, a missing value filling rule and an initial time field processing rule in the data preprocessing, aligning attribute information in the processed predicted data with attribute information in training data after data preprocessing, performing feature generation on the aligned predicted data according to a result of the feature engineering, and predicting a feature generation result according to a trained machine learning model group;
packaging the content of the prediction operator to obtain the model prediction operator;
and performing automatic machine learning model prediction by using the model prediction operator.
3. The method of claim 1, wherein the automated machine learning model training with the model training operator comprises: carrying out data preprocessing operation on input training data; the data pre-processing operation includes at least one of:
the first item is used for carrying out data format conversion on the training data;
a second term that downsamples the training data;
a third item, labeling the training data as labeled data and unlabeled data;
fourthly, unifying the format of the label values in the training data;
a fifth item for automatically identifying and marking the type of each attribute information contained in the training data;
a sixth item that performs missing value padding on the training data;
the seventh item, unify the format of the initial time field in the training data, add new time field based on the unified result, and delete the initial time field;
and the eighth item is used for automatically identifying non-numerical data in the training data and carrying out hash processing on the non-numerical data.
4. The method of claim 1, wherein the automated machine learning model training with the model training operator further comprises: carrying out characteristic engineering operation on the training data after data preprocessing; the operations of the feature engineering include:
down-sampling the training data after the data preprocessing to a first preset number;
performing first feature selection on the down-sampled training data to obtain basic features;
combining the basic features to generate new combined features;
and generating a training sample according to the basic characteristic and the combined new characteristic.
5. The method of claim 4, wherein the performing a first feature selection on the downsampled training data to obtain a base feature comprises:
extracting all attribute information included in the down-sampled training data, wherein the attribute information is used for forming features;
acquiring a characteristic importance value corresponding to each attribute information;
and obtaining the basic characteristics according to the characteristic importance value.
6. The method of claim 5, wherein obtaining the base feature according to the feature importance value comprises:
sequencing all the feature importance values in a descending order to obtain a sequencing result;
and acquiring attribute information corresponding to the feature importance values of the second preset number as basic features according to the sorting result.
7. The method of claim 4, wherein the combining the base features to generate combined new features comprises:
selecting a set feature generation rule according to the type of each attribute information contained in the training data;
and combining the basic features according to the selected feature generation rule to obtain a combined new feature.
8. An apparatus for implementing automatic machine learning, comprising:
the first providing module is used for providing a first editing interface according to the first operation of the editing model training operator;
the first acquisition module is used for acquiring training operator content input through the first editing interface; the training operator content comprises: the method comprises the steps of carrying out an operation command of data preprocessing on input training data, carrying out an operation command of feature engineering on the training data after the data preprocessing, and carrying out an operation command of machine learning model group training according to a result of the feature engineering; the machine learning model group comprises at least one machine learning model;
the first packaging module is used for packaging the training operator content to obtain the model training operator;
and the training module is used for carrying out automatic machine learning model training by utilizing the model training operator.
9. An apparatus comprising at least one computing device and at least one storage device, wherein the at least one storage device is to store instructions for controlling the at least one computing device to perform the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.
CN202010306838.9A 2020-04-17 2020-04-17 Method, device, equipment and storage medium for realizing automatic machine learning Pending CN111611239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010306838.9A CN111611239A (en) 2020-04-17 2020-04-17 Method, device, equipment and storage medium for realizing automatic machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010306838.9A CN111611239A (en) 2020-04-17 2020-04-17 Method, device, equipment and storage medium for realizing automatic machine learning

Publications (1)

Publication Number Publication Date
CN111611239A true CN111611239A (en) 2020-09-01

Family

ID=72199700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010306838.9A Pending CN111611239A (en) 2020-04-17 2020-04-17 Method, device, equipment and storage medium for realizing automatic machine learning

Country Status (1)

Country Link
CN (1) CN111611239A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200320436A1 (en) * 2019-04-08 2020-10-08 Google Llc Transformation for machine learning pre-processing
CN112199075A (en) * 2020-09-30 2021-01-08 黑龙江省网络空间研究中心 Intelligent information processing method and framework based on micro-service
CN112668723A (en) * 2020-12-29 2021-04-16 杭州海康威视数字技术股份有限公司 Machine learning method and system
CN113052328A (en) * 2021-04-02 2021-06-29 上海商汤科技开发有限公司 Deep learning model production system, electronic device, and storage medium
WO2022063274A1 (en) * 2020-09-27 2022-03-31 中兴通讯股份有限公司 Data annotation method and system, and electronic device
CN114882333A (en) * 2021-05-31 2022-08-09 北京百度网讯科技有限公司 Training method and device of data processing model, electronic equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200320436A1 (en) * 2019-04-08 2020-10-08 Google Llc Transformation for machine learning pre-processing
US11928559B2 (en) * 2019-04-08 2024-03-12 Google Llc Transformation for machine learning pre-processing
WO2022063274A1 (en) * 2020-09-27 2022-03-31 中兴通讯股份有限公司 Data annotation method and system, and electronic device
CN112199075A (en) * 2020-09-30 2021-01-08 黑龙江省网络空间研究中心 Intelligent information processing method and framework based on micro-service
CN112199075B (en) * 2020-09-30 2021-09-21 黑龙江省网络空间研究中心 Intelligent information processing method and framework system based on micro-service
CN112668723A (en) * 2020-12-29 2021-04-16 杭州海康威视数字技术股份有限公司 Machine learning method and system
CN112668723B (en) * 2020-12-29 2024-01-02 杭州海康威视数字技术股份有限公司 Machine learning method and system
CN113052328A (en) * 2021-04-02 2021-06-29 上海商汤科技开发有限公司 Deep learning model production system, electronic device, and storage medium
CN114882333A (en) * 2021-05-31 2022-08-09 北京百度网讯科技有限公司 Training method and device of data processing model, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111611239A (en) Method, device, equipment and storage medium for realizing automatic machine learning
WO2021208685A1 (en) Method and apparatus for executing automatic machine learning process, and device
CN106599039B (en) Statistical representation method supporting free combination nesting of relational database data
US11636341B2 (en) Processing sequential interaction data
CN110019616B (en) POI (Point of interest) situation acquisition method and equipment, storage medium and server thereof
WO2021036589A1 (en) Feature processing method and apparatus for artificial intelligence recommendation model, electronic device, and storage medium
WO2018040067A1 (en) User guidance system and method
Burhanuddin et al. Analysis of mobile service providers performance using naive bayes data mining technique
CN110119401A (en) Processing method, device, server and the storage medium of user&#39;s portrait
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN110532487B (en) Label generation method and device
CN110688844A (en) Text labeling method and device
CN115062617A (en) Task processing method, device, equipment and medium based on prompt learning
CN110119784B (en) Order recommendation method and device
CN115730603A (en) Information extraction method, device, equipment and storage medium based on artificial intelligence
CN113569929B (en) Internet service providing method and device based on small sample expansion and electronic equipment
CN115358473A (en) Power load prediction method and prediction system based on deep learning
CN105446711B (en) Obtain the method and device of the contextual information for software development task
CN110309047B (en) Test point generation method, device and system
CN111274383B (en) Object classifying method and device applied to quotation
CN114077664A (en) Text processing method, device and equipment in machine learning platform
CN113111174A (en) Group identification method, device, equipment and medium based on deep learning model
CN113111897A (en) Alarm receiving and warning condition type determining method and device based on support vector machine
US20190220252A1 (en) System and method for defining and generating requirement data from input
US20230237341A1 (en) Systems and methods for weak supervision classification with probabilistic generative latent variable models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination