CN111008707A - Automatic modeling method and device and electronic equipment - Google Patents

Automatic modeling method and device and electronic equipment Download PDF

Info

Publication number
CN111008707A
CN111008707A CN201911251943.0A CN201911251943A CN111008707A CN 111008707 A CN111008707 A CN 111008707A CN 201911251943 A CN201911251943 A CN 201911251943A CN 111008707 A CN111008707 A CN 111008707A
Authority
CN
China
Prior art keywords
data
training
machine learning
initial
learning model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911251943.0A
Other languages
Chinese (zh)
Inventor
张世健
李瀚�
王敏
乔胜传
孙越
郝玥
赵庆
周凯
桂权力
戴文渊
陈雨强
胡时伟
黄缨宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201911251943.0A priority Critical patent/CN111008707A/en
Publication of CN111008707A publication Critical patent/CN111008707A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an automatic modeling method, an automatic modeling device and electronic equipment, wherein the method comprises the following steps: acquiring and storing initial training data serving as an initial model training basis; generating an initial training sample based on the stored initial training data, and training an initial machine learning model by using the initial training sample; obtaining a model training scheme based on the trained initial machine learning model; based on the obtained model training scheme, obtaining an updated training sample by using updated training data, and obtaining an updated machine learning model by using the updated training sample; and performing prediction service by using the selected machine learning model.

Description

Automatic modeling method and device and electronic equipment
Technical Field
The invention relates to the field of artificial intelligence, in particular to an automatic modeling method, an automatic modeling device and electronic equipment.
Background
At present, the construction of the machine learning model is mainly realized by professional modeling personnel through manually writing codes or using a graphical interface, however, both the two implementation modes require high labor cost and time cost to obtain a satisfactory model, which is not beneficial to large-scale exploration and application of the model, and therefore, an automatic modeling method is necessary to be provided.
Disclosure of Invention
An object of the embodiments of the present invention is to provide a new technical solution for automated modeling.
According to a first aspect of the present invention, there is provided an automated modeling method comprising:
acquiring and storing initial training data serving as an initial model training basis;
generating an initial training sample based on the stored initial training data, and training an initial machine learning model by using the initial training sample;
obtaining a model training scheme based on the trained initial machine learning model;
based on the obtained model training scheme, obtaining an updated training sample by using updated training data, and obtaining an updated machine learning model by using the updated training sample; and the number of the first and second groups,
and performing prediction service by using the selected machine learning model.
Optionally, the initial training data comprises at least one of behavioural data and feedback data, and the updated training data comprises at least one of updated behavioural data and updated feedback data.
Optionally, the acquiring and saving initial training data used as a basis for initial model training includes:
at least one data import path for respectively providing import behavior data or import feedback data;
after the data import path is selected, uploading the behavior data table or the feedback data table;
providing a configuration interface for configuring information of the uploaded data table;
importing behavior data in a behavior data table or feedback data in a feedback data table according to configuration information input through the configuration interface;
the behavioral data or feedback data is saved.
Optionally, after uploading the behavior data table or the feedback data table, the method further includes:
data field types in the behavior data table or the feedback data table are automatically identified.
Optionally, the configuration information input through the information configuration interface relates to at least one of configuration of basic data information of the uploaded data table and data storage configuration.
Optionally, the data storage configuration of the uploaded data table relates to at least one of a configuration of a data structure of the uploaded data table, a configuration of a primary key field of the uploaded data table, and a configuration of exception data handling of the uploaded data table.
Wherein, the abnormal data processing of the uploaded data table comprises at least one of missing value filling and useless data cleaning.
Optionally, the method further comprises:
respectively providing a configuration interface for splicing the uploaded plurality of behavior data tables or the plurality of feedback data tables;
responding to the starting operation of training an initial machine learning model, and respectively acquiring at least one of a splicing main key and a relation type which are input through the configuration interface and used for splicing the uploaded behavior data tables or the uploaded feedback data tables;
and respectively splicing a plurality of behavior data tables or a plurality of feedback data tables according to the splicing main key and/or the relation type.
Optionally, the method further comprises:
respectively counting behavior data or feedback data belonging to the same data field type to obtain a statistical result;
and respectively displaying the statistical results according to the data field types.
Optionally, the generating initial training samples based on the saved initial training data and training an initial machine learning model by using the initial training samples includes:
providing a configuration interface for performing configurations related to model training;
generating an initial training sample from the stored behavior data and feedback data according to configuration information input through the configuration interface;
and training an initial machine learning model based on the initial training sample by utilizing at least one preset model training algorithm.
Optionally, the configuration information entered through the configuration interface relates to at least one of: the configuration of behavior data selection rules and the configuration of feedback data selection rules, the configuration of data splitting rules for splitting behavior data into training data and verification data, the configuration of automatic feature generation rules for extracting features of the training data and the verification data according to data field types, the configuration of model training stopping strategies, the configuration of rules for selecting the features and the configuration of automatic parameter adjustment for performing parameter adjustment according to a preset parameter adjustment mode,
the behavior data selection rule or the feedback data selection rule comprises the step of selecting a behavior data slice or a feedback data slice which is used as the basis of initial model training according to business time.
Optionally, the selecting, according to the service time, a behavior data slice or a feedback data slice as a basis for initial model training includes:
responding to a trigger operation for selecting a behavior data slice or a feedback data slice according to service time, and acquiring a time field and a slicing granularity which are used as data slicing bases;
providing a data selection interface for selecting the time field and the data slice at the slice granularity;
and according to the behavior data slice or the feedback data slice selected through the data selection interface, the behavior data slice or the feedback data slice serving as the basis of initial model training is obtained.
Optionally, the data splitting rule for splitting the behavior data into training data and verification data includes at least one of:
directly splitting the behavior data into training data and verification data according to a preset splitting ratio; alternatively, the first and second electrodes may be,
randomly scrambling behavior data;
splitting the disordered behavior data into training data and verification data according to a preset splitting ratio; alternatively, the first and second electrodes may be,
randomly selecting one behavior data as initial training data, and randomly selecting another behavior data as initial verification data;
selecting a certain amount of behavior data as training data by taking the initial training data as a starting point; selecting a certain amount of behavior data as verification data by taking the initial verification data as a starting point; alternatively, the first and second electrodes may be,
acquiring a data field type serving as a grouping basis;
splitting the behavior data according to the data field type to obtain multiple groups of behavior data;
according to a preset splitting ratio, splitting each group of behavior data into candidate training data and candidate verification data;
splicing the candidate training data corresponding to the multiple groups of behavior data to obtain final training data; splicing the candidate verification data corresponding to the multiple groups of behavior data to obtain final verification data; alternatively, the first and second electrodes may be,
acquiring a script defining a data splitting rule;
and automatically splitting the behavior data into training data and verification data according to the script.
Optionally, the automatic feature generation for feature extraction of training data and validation data by data field type comprises at least one of:
carrying out specific numerical operation on the data field type; alternatively, the first and second electrodes may be,
operating a plurality of time fields included in the data field type; alternatively, the first and second electrodes may be,
performing dimension-increasing processing on a plurality of classified fields included in the data field type; alternatively, the first and second electrodes may be,
and directly performing automatic feature combination on the data field types.
Optionally, the model training stopping strategy comprises at least one of:
stopping training when the effect of the machine learning model trained by using at least one model training algorithm reaches a set effect; alternatively, the first and second electrodes may be,
stopping training when the training time for training the machine learning model by using at least one model training algorithm reaches a set time threshold; alternatively, the first and second electrodes may be,
and stopping training when the number of training rounds reaches the set number of training rounds when the machine learning model is trained by at least one model training algorithm.
Optionally, the model training scheme comprises an algorithm for automatically training out the initial machine learning model and its hyper-parameters.
Optionally, the obtaining updated training samples by using the updated training data and obtaining an updated machine learning model by using the updated training samples includes:
providing a configuration interface for performing self-learning configuration with respect to the model;
and generating updated training samples according to the configuration information input through the configuration interface and the updated behavior data and the updated feedback data according to the model training scheme, and training an updated machine learning model based on the updated training samples.
Optionally, the configuration information entered through the configuration interface relates to at least one of: the method comprises the steps of configuration of a model updating period for updating a machine learning model, configuration of a model result storage position, configuration of model result naming and configuration of a task starting check rule.
Optionally, the performing a prediction service by using the selected machine learning model includes:
providing a configuration interface that performs configuration with respect to providing a prediction service using a machine learning model;
and performing online prediction service and/or batch prediction service by using the selected machine learning model according to the configuration information input through the configuration interface.
Optionally, the configuration information according to the input through the configuration interface relates to a configuration of a model selection rule for selecting an online machine learning model from among machine learning models and/or a configuration of an application resource.
Optionally, the model selection rule for selecting the online machine learning model from among the machine learning models relates to at least one of:
arbitrarily selecting one machine learning model from the model result storage positions as an online machine learning model; alternatively, the first and second electrodes may be,
automatically selecting a machine learning model as an online machine learning model according to a model selection strategy defined by a script; alternatively, the first and second electrodes may be,
selecting a newly generated machine learning model as an online machine learning model; alternatively, the first and second electrodes may be,
and selecting the best machine learning model as the online machine learning model.
Optionally, the performing a batch prediction service by using the selected machine learning model includes:
and responding to a set triggering event, and performing batch prediction service by using the selected machine learning model.
Optionally, the set triggering event includes at least one of when a new machine learning model appears in the model result storage location, when the set model configuration time expires, and after receiving a machine learning model selected by a user.
Optionally, the performing online prediction service by using the selected machine learning model according to the configuration information input through the configuration interface includes:
providing an interface for simulating a request for a forecast service to a user;
generating a predicted service request including predicted data according to a user input to the interface;
and responding to the generated prediction service request, obtaining a prediction result aiming at the prediction data by using the selected machine learning model, and displaying the prediction result to the user.
Optionally, the configuration information entered via the configuration interface further relates to a switch state of the predictive data for automatic reflow.
According to a second aspect of the present invention, there is also provided an automated modeling apparatus, comprising:
the data acquisition module is used for acquiring and storing initial training data serving as an initial model training basis;
the initial machine learning model training module is used for generating an initial training sample based on the stored initial training data and training an initial machine learning model by using the initial training sample;
the model training scheme acquisition module is used for acquiring a model training scheme based on the trained initial machine learning model;
the updated machine learning model training module is used for obtaining an updated training sample by using updated training data based on the obtained model training scheme and obtaining an updated machine learning model by using the updated training sample; and the number of the first and second groups,
and the prediction service providing module is used for performing prediction service by using the selected machine learning model.
According to a third aspect of the present invention, there is also provided an electronic apparatus, comprising:
the automated modeling apparatus of the second aspect of the present invention; alternatively, the first and second electrodes may be,
a processor and a memory for storing instructions for controlling the processor to perform the method according to the first aspect of the invention.
According to a fourth aspect of the present invention, there is also provided a computer readable storage medium, wherein a computer program is stored thereon, which when executed by a processor, performs the method according to the first aspect of the present invention.
According to the method, the device and the electronic equipment provided by the embodiment of the invention, the model training scheme can be obtained on the basis of training the initial machine learning model, so that the updated machine learning model can be obtained according to the model training scheme, the regular automatic updating of the machine learning model and the automatic updating of the on-line machine learning model can be realized, and the whole-flow cycle operation of multiple processes such as data collection, model production, model application and the like is realized through the automatic modeling of the machine learning model, so that the cost of machine learning modeling is greatly reduced.
Drawings
Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.
FIG. 1 is a schematic block diagram showing a hardware configuration of an electronic device that may be used to implement an embodiment of the invention;
FIG. 2 illustrates a flow diagram of an automated modeling method of an embodiment of the present invention;
3-7 illustrate an interface display schematic example of an automated modeling approach according to one embodiment;
FIG. 8 illustrates a functional block diagram of an automated modeling apparatus of an embodiment of the present invention;
FIG. 9 shows a block diagram of one example of an electronic device of an embodiment of the invention.
Detailed Description
Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.
The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.
Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.
In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.
Various embodiments and examples according to embodiments of the present invention are described below with reference to the accompanying drawings.
< hardware configuration >
Fig. 1 is a block diagram showing a hardware configuration of an electronic apparatus 1000 that can implement an embodiment of the present invention.
The electronic device 1000 may be a laptop, desktop, cell phone, tablet, etc. As shown in fig. 1, the electronic device 1000 may include a processor 1100, a memory 1200, an interface device 1300, a communication device 1400, a display device 1500, an input device 1600, a speaker 1700, a microphone 1800, and the like. The processor 1100 may be a central processing unit CPU, a microprocessor MCU, or the like. The memory 1200 includes, for example, a ROM (read only memory), a RAM (random access memory), a nonvolatile memory such as a hard disk, and the like. The interface device 1300 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1400 is capable of wired or wireless communication, for example, and may specifically include Wifi communication, bluetooth communication, 2G/3G/4G/5G communication, and the like. The display device 1500 is, for example, a liquid crystal display panel, a touch panel, or the like. The input device 1600 may include, for example, a touch screen, a keyboard, a somatosensory input, and the like. A user can input/output voice information through the speaker 1700 and the microphone 1800.
The electronic device shown in fig. 1 is merely illustrative and is in no way meant to limit the invention, its application, or uses. In an embodiment of the invention, the memory 1200 of the electronic device 1000 is configured to store instructions for controlling the processor 1100 to operate to perform any one of the automated modeling methods provided by the embodiments of the invention. It will be appreciated by those skilled in the art that although a plurality of means are shown for the electronic device 1000 in fig. 1, the present invention may relate to only some of the means therein, e.g. the electronic device 1000 relates to only the processor 1100 and the storage means 1200. The skilled person can design the instructions according to the disclosed solution. How the instructions control the operation of the processor is well known in the art and will not be described in detail herein.
< method examples >
In the present embodiment, an automated modeling method is provided, which may be implemented by an electronic device, which may be the electronic device 1000 shown in fig. 1.
As shown in fig. 2, the automated modeling method of the present embodiment may include the following steps S2100 to S2500:
in step S2100, initial training data serving as a basis for initial model training is acquired and saved.
The initial training data includes at least one of behavioral data and feedback data.
The behavior data relates to the characteristic part of the training data, and can be imported according to a plurality of import modes, such as but not limited to single import, timing import and streaming import; the user may also import the file according to different paths, such as but not limited to local import, database import, FTP import, HDFS import, hive import, Kafka import, and the like. Here, the initially imported data defines the schema of the entire data, and when new data is subsequently imported, schema verification is performed, so that only data forms of the same schema content are accepted. When the imported behavior data is landed, the behavior data is converted into a specific format of a corresponding data set, and the same mechanism is used for feeding back the data as each data slice in the data set.
In an embodiment of the present invention, the step of obtaining and storing initial training data as a basis for initial model training in step S2100 may further include the following steps S2110 to S2150:
step S2110, providing at least one data import path for importing behavior data or importing feedback data, respectively.
Exemplarily, in fig. 3, behavior data or feedback data can be imported through four import manners, i.e., importing locally stored data into the electronic device 1000, importing data in a database into the electronic device 1000, importing data through HDFS timing, and importing data through Kafka streaming.
It is to be understood that, although the five importing manners are shown in fig. 3 for the behavior data and the feedback data, only one or more of them may be enabled according to a specific application scenario, and the embodiment is not limited herein.
In step S2120, after the data import path is selected, the behavior data table or the feedback data table is uploaded.
In step S2120, the data field type in the behavior data table or the feedback data table may be automatically identified after the behavior data table or the feedback data table is uploaded. The automatic identification of the data field type is implemented on a sampling basis, and for example, 500 pieces of data are extracted from the data to identify the data field type. In addition, the embodiment of the invention also supports the manual adjustment and change of the data field type by the user.
Step S2130, providing a configuration interface for configuring information of the uploaded data table.
The configuration information input through the configuration interface relates to at least one of a configuration of data basic information of the uploaded data table and a data storage configuration.
The above configuration of the basic data information of the uploaded data table relates to the configuration of the data encoding format of the uploaded data table, the configuration of whether the imported data includes a field name, the data preview operation, and the like.
Illustratively, in fig. 3, the data encoding format is UTF-8, and the file header is in an on state as a switch of the field name. In addition, data preview can be performed on the imported data.
The above data storage configuration of the uploaded data table relates to at least one of a configuration of a data structure of the uploaded data table, a configuration of a primary key field of the uploaded data table, and a configuration of exception data handling of the uploaded data table.
In this embodiment, for data with abnormal types, for example, data with a string (string) type appears in a field marked as an integer (int) type, and for data with inconsistent column numbers, for example, the standard column number of the data table is 20 columns, but abnormal data with 22 columns of data included in a certain row is all set in an abnormal data processing manner, so as to ensure data quality in the model training process, where the abnormal data processing manner may include at least one of missing value padding and useless data cleaning. The missing value may be filled in by a plurality of filling methods, such as, but not limited to, filling the missing value with a Null value (Null), a mean value, a mode, a median, and the like. The useless data can be some useless variable which is automatically recognized, the useless data has no practical meaning for model training, and therefore the useless data can be automatically recognized and discarded. In addition, the exception data processing mode may also be discarding exception rows and introducing failures, where discarding exception rows refers to discarding problematic data rows, and introducing failures refers to discarding the introducing when problematic data is encountered.
Furthermore, the above data storage configuration of the uploaded data table also involves the high-level parameter configuration of the import task. The setting of the advanced parameters may affect whether the data introduction is successful and the efficiency of the introduction, for example, there may be a default recommended setting, and custom advanced parameters are also supported.
Illustratively, in FIG. 3, the data structure settings relate to the configuration of the field name, field type, flag type, and whether a field is used, etc., for each field in the data table. The primary key field flag selects the automl _ id. The type abnormal row is selected to throw off the abnormal row, and the column number abnormal row is selected to throw off the abnormal row. The high level parameters are set to use the recommended configuration.
Step S2140, according to the configuration information input through the configuration interface, importing the behavior data in the behavior data table or the feedback data in the feedback data table.
In an example, in the case of a single behavior data table and a single feedback data table, after a user inputs configuration information through a configuration interface, a data import operation is performed in the configuration interface, and the electronic device 1000 imports behavior data in the behavior data table or feedback data in the feedback data table in response to the data import operation performed in the configuration interface.
In an example, in the case of multiple behavior data tables and multiple feedback data tables, after a user inputs configuration information through a configuration interface, a configuration interface for splicing the uploaded multiple behavior data tables or multiple feedback data tables is continuously provided, so that the user defines an association relationship between the multiple behavior data tables or multiple feedback data tables.
The configuration interface for splicing the uploaded behavior data tables or the uploaded feedback data tables and the configuration interface for configuring the information of the uploaded data tables may be provided on the same interface, or may be provided on different interfaces, which is not limited herein.
In this example, for example, at least one of a concatenation main key and a relationship type for concatenating the uploaded behavior data tables or the uploaded feedback data tables, which are input through the configuration interface, may be obtained in response to a start operation for training the initial machine learning model, so as to concatenate the behavior data tables or the feedback data tables according to the concatenation main key and/or the relationship type.
Specifically, a user may set a behavior data table or a feedback data table as a main table, and define a relationship between the data tables according to the main table, where the relationship definition between the data tables is implemented by using an Entity relationship map (E-R) diagram. Through the definition of the E-R diagram, the electronic device 1000 can splice a plurality of behavior data tables or a plurality of feedback data tables into one large-width table through an automatic table splicing scheme for feature engineering and model training. What needs to be defined in the E-R diagram is the concatenation primary key and/or relationship type between the behavior data table or the feedback data table. The relationship type is, for example but not limited to, 1: 1. 1: n, N: 1 and N: and M.
In addition, the data table splicing not only supports the connection between the main table and the auxiliary table, but also supports the direct connection between the auxiliary table and the auxiliary table.
In addition, before the data table splicing, for example, an aggregation operation may be performed on the behavior data table, and the aggregation operation is, for example and without limitation, a sum, an average value, a maximum value, a minimum value, a variance, and the like of data in a plurality of behavior data tables or data in a plurality of feedback data tables are calculated. For another example, the time type field in the plurality of behavior data tables or the plurality of feedback data tables may be subjected to a sliding window operation.
Step S2150, the behavior data or feedback data is saved.
In this step S2150, two different storage manners are provided for two different cases, that is, the behavior data or the feedback data imported for the first time and the behavior data or the feedback data imported for the subsequent time, where the storing the imported behavior data or the feedback data in this step S2150 may further include:
case 1: and performing structure extraction aiming at the behavior data or feedback data imported for the first time, and storing the behavior data or the feedback data as a first data slice under a behavior data group or a feedback data group.
Case 2: and performing structure check aiming at the subsequently imported behavior data or feedback data, and storing the behavior data or feedback data passing the check as a subsequent data slice under the behavior data group or the feedback data group.
It is understood that the embodiment of the present invention manages data in units of data groups, each data group includes a data slice, and each data slice includes a piece of real data. Here, in a specific case, a data slice may also be composed of a single data record, and accordingly, a data group may be regarded as equivalent to a data table.
After acquiring and storing the behavior data and the feedback data which are used as the basis of the initial model training, entering:
step S2200 is that an initial training sample is generated based on the stored initial training data, and an initial machine learning model is trained by using the initial training sample.
The initial training samples may be samples generated by combining feedback data corresponding to behavior data after features are extracted from collected behavior data according to an automatic machine learning technique, where, for example, each field of the behavior data may be automatically declared as a discrete feature or a continuous feature according to a data type of each field and/or an algorithm used for training a model. Preferably, the initial training samples may be samples generated by combining feedback data corresponding to behavior data after various features are obtained by feature extraction and feature combination on the collected behavior data according to an automatic machine learning technique.
The initial machine learning model is a machine learning model trained based on initial training samples according to an automatic machine learning technique in a model investigation stage, and can be used for providing online prediction service in the initial stage.
In an embodiment of the present invention, the step S2200 of generating an initial training sample based on the saved initial training data, and training an initial machine learning model by using the initial training sample may further include the following steps S2210 to S2230:
step S2210, a configuration interface is provided for performing configuration with respect to model training.
Step S2220, according to the configuration information input through the configuration interface, generating an initial training sample from the stored behavior data and the feedback data.
The configuration information entered through the configuration interface relates to at least one of: the method comprises the steps of configuring a behavior data selection rule and a feedback data selection rule, configuring a data splitting rule for splitting the behavior data into training data and verification data, configuring an automatic feature generation rule for extracting features of the training data and the verification data according to data field types, configuring a model training stopping strategy, configuring a rule for selecting the features, and configuring automatic parameter adjustment for performing parameter adjustment according to a preset parameter adjustment mode.
The behavior data selection rule or the feedback data selection rule comprises the step of selecting the behavior data slice or the feedback data slice which is used as the basis of the initial model training according to the business time. The behavior data selection rule or the feedback data selection rule further comprises selecting all slices under the behavior data set or the feedback data set, and calling the behavior data slices or the feedback data slices according to the quantity range.
The selecting of the behavior data slice or the feedback data slice as the basis of the initial model training according to the service time may further include the following steps S2221 to S2223:
step S2221, in response to a trigger operation for selecting a behavior data slice or a feedback data slice according to the service time, obtains a time field and a slice granularity that are the basis of data slicing.
The granularity of a tile includes, but is not limited to, hours or days.
In this step S2221, for example, when the user imports behavior data or feedback data, data slicing may be performed according to the time field and the slicing granularity, for example, the time field may be set to be transaction time field deal _ time, the slicing granularity is hour, and a behavior data slice or a feedback data slice is selected according to the service time in the configuration interface related to the configuration of model training, and the electronic device 1000 obtains the transaction time field deal _ time serving as the basis of data slicing and the slicing granularity is hour in response to a trigger operation that the user selects the behavior data slice or the feedback data slice according to the service time, so that when consuming data, fast selection may be achieved according to the data slicing, and efficiency can be greatly improved and performance loss can be reduced.
Step S2222, provide the data selection interface of the data slice under the selection time field and slice granularity.
In step S2222, continuing the example in step S2221, after the transaction time field deal _ time serving as the basis of the data fragmentation is obtained and the fragmentation granularity is small, the data whose transaction time is the latest month is specifically selected, that is, 24 × 30 — 720 data may be selected.
And step S2223, according to the behavior data slice or the feedback data slice selected through the data selection interface, the behavior data slice or the feedback data slice serving as the basis of the initial model training is obtained.
The above data splitting rule for splitting behavior data into training data and verification data includes at least one of:
1) and directly splitting the behavior data into training data and verification data according to a preset splitting ratio.
For example, the splitting ratio of the training data and the verification data can be set to 4: 1, but other values are also possible and are not limited herein.
2) And randomly disorganizing the behavior data, and splitting the disorganized behavior data into training data and verification data according to a preset splitting ratio.
For example, after randomly disorganizing the behavior data, according to a preset splitting ratio of 4: 1, splitting the disturbed behavior data into training data and verification data.
3) Randomly selecting one behavior data as initial training data, and randomly selecting another behavior data as initial verification data; selecting a certain amount of behavior data as training data by taking initial training data as a starting point; and selecting a certain amount of behavior data as verification data by taking the initial verification data as a starting point.
4) Acquiring a data field type serving as a grouping basis; splitting the behavior data according to the data field type to obtain a plurality of groups of behavior data; according to a preset splitting ratio, splitting each group of behavior data into candidate training data and candidate verification data; splicing the candidate training data corresponding to the multiple groups of behavior data to obtain final training data; and splicing the candidate verification data corresponding to the multiple groups of behavior data to obtain final verification data.
For example, if the data field type is a type, for example, the data field type may be gender (including male and female), and the electronic device 1000 automatically recognizes that the difference between the male and female proportions is large, the behavior data may be divided into two groups according to the gender, the two groups of data are respectively divided into training data and verification data according to the division proportions, and finally, the corresponding training data are spliced, and the corresponding verification data are spliced.
5) Acquiring a script defining a data splitting rule; and according to the script, automatically splitting the behavior data into training data and verification data.
The script may be, for example, a Sql statement.
The above automatic feature generation for feature extraction of training data and validation data by data field type comprises at least one of:
1) a specific numerical operation is performed on the data field type.
The numerical operation may be, for example, operations such as logarithm, bucket division, and linear transformation, or may be a combination operation performed on different operations, which is not limited herein.
2) Operations are performed on a plurality of time fields included in the data field type.
In one example, the plurality of time fields included in the data field type may be subtracted.
In this example, the time interval characteristic from the recommended goods to the last order placing of the user can be obtained by taking the time difference of the time field, for example, the time of marketing in the marketing scene is subtracted from the time of the last order placing transaction.
In one example, the aggregation operation may be performed on a plurality of time fields included in the data field type.
In this example, the time field obtains a new aggregate characteristic through the time window, such as for a transaction flow meter, an aggregate operation may be performed according to the transaction time, and a new characteristic such as the number of transactions in the last month, the sum of the transaction amount in the last month, etc. is generated.
It is understood that other operations may be performed on the plurality of time fields included in the data field type, which is not limited herein.
3) The plurality of classification fields included in the data field type are subjected to upscaling.
In one example, the classification field may be subjected to dimension-increasing processing by concat, for example, the a field is a gender and includes two values of male and female, the B field is a study history and includes junior high middle, and junior high, and this, the students and the above equivalents, and string type values of two variables may be directly spliced by a concat operation to generate multiple types such as "male, female high middle, and male students and above", so that the original 6 feature dimensions of the two variables may be combined into 8 features, and when the permutation and combination increase, the dimension increase may be more obvious.
In one example, the classification field may be subjected to dimension-increasing processing by a one-hot (one-hot) method, which is as follows: a classification variable can have a plurality of values, for example, the native place can contain thirty provinces, the one-hot coding extracts each value as a feature and converts the feature into thirty features, and each feature has the meaning of 'whether the native place is Beijing, whether the native place is Anhui, whether the native place is Shanghai …', so that the feature dimension is improved.
4) And directly performing automatic feature combination on the data field types.
For example, different fields may be combined in various ways to form new features.
The above model training stopping strategy comprises at least one of:
1) and stopping training when the effect of the machine learning model trained by using at least one model training algorithm reaches a set effect.
In this embodiment, the model AUC may be set according to different application scenarios, for example, for a binary problem, setting the model AUC to reach 0.8 as a stop condition, and the training will automatically stop if and only if the yield AUC reaches or exceeds 0.8 of the model AUC. For example, the training may be stopped when the AUC of the machine learning model trained by at least one of the LR, GBDT, HE-TreeNet, or NN algorithm reaches or exceeds 0.8.
2) And stopping training when the training time for training the machine learning model by using at least one model training algorithm reaches a set time threshold.
The set time threshold is the maximum duration of training, which may be set according to specific application scenarios and simulation experiments, for example, 24 hours, and then the training is automatically stopped if and only if 24 hours are exceeded. For example, the training may be stopped when the training time of the machine learning model trained by at least one of the LR algorithm, the GBDT algorithm, the HE-TreeNet algorithm, or the NN algorithm exceeds 24 hours.
3) And stopping training when the number of training rounds reaches the set number of training rounds when the machine learning model is trained by at least one model training algorithm.
The set training round number is the maximum round number of training, the machine learning model obtains a group of hyper-parameters to perform one-round training, and when one-round model training selects multiple algorithms to perform training, the maximum round number refers to the total training round of each algorithm.
The above configuration of rules for selecting features may include setting the maximum number of feature selections, such as selecting only the features of top 20; it may also include setting a metric for the feature such as, but not limited to, setting a minimum value for the feature's importance, and setting a minimum value for the feature's gain for the tree model.
It should be noted that the features having larger influence on the business prediction target are generally selected by means of feature selection to enter the operation of the next stage. The feature selection is obtained by measuring the importance degree of a single variable to a service prediction target, specifically, the following two modes are provided, for a linear model, the AUC value of a single feature to the service prediction target is examined, and the larger the AUC value is, the more important the feature is represented; for the tree model, the gains obtained by splitting each single feature on each decision tree are added to serve as the importance degree measurement indexes of the feature, and the larger the gain is, the more important the feature is. In addition, the strictness degree of the user-defined feature selection is supported, the stricter the set standard is, the fewer features which can enter the model training finally are, the operation speed and the resource consumption are reduced, and the effect is possibly slightly lost.
In addition, the user is supported to customize the model used for training. For example, for binary problems, including but not limited to LR algorithms, GBDT algorithms, DSN algorithms, random forest binary classification algorithms, naive Bayes binary classification algorithms, SVM algorithms, and linear fractal classifier algorithms. For regression-like problems, including but not limited to GBRT algorithm, linear regression algorithm, and linear fractal regression algorithm. For multi-classification problems, including but not limited to logistic regression multi-classification algorithms. The above algorithms all support automatic exploration of hyper-parameters. In actual operation, the automatic algorithm supports an early stop (early stop) strategy, and particularly, when a plurality of algorithms are trained simultaneously, the algorithm which the training data is more suitable for can be judged in advance according to a certain strategy, so that the exploration of the unsuitable algorithm is suspended, and the time and the resources are spent on the more suitable algorithm. The early stop strategy is also applicable to the super-participation process within the same algorithm.
For example, in fig. 4, a configuration entry is provided for behavior data selection rules and feedback data selection rules, and the user may select "all slices of the data set", or "slices are called by number range", and "slices are called by time of service range". In fig. 4, a configuration of data splitting rules for splitting behavior data into training data and verification data is also provided, and the user may select "split by scale", "split by rule", and "sort-first then split". In fig. 4, a configuration of the model training stopping strategy is also provided, and the user may select "manual stop", "AUC reached", "training duration reached", and "number of training rounds reached". In fig. 4, a configuration of rules for selecting features is also provided, and the user may select "standard feature scheme" or "enhanced feature scheme". In fig. 4, a configuration of a specific preset algorithm is also provided, and a user may preset "GBDT algorithm" and "DSN algorithm". In fig. 4, a configuration of the proportion of the training set is also provided, and the user may set the proportion to be "0.8" or the like. After the information is configured, after the 'start training' is clicked, one-time automatic model training can be started, and the user does not need to intervene in the process and only needs to wait for the completion of the operation. After the model training is completed, an optimal model and other relevant information are generated. The optimal model is a model with the best effect automatically selected according to model evaluation indexes corresponding to problem types such as a two-class problem, a regression problem or a multi-class problem. Other relevant information is for example, but not limited to, at least one of:
1) feature dimension information. And displaying the characteristic dimension of the corresponding structure for each algorithm.
2) Feature importance information. Feature importance information of each feature is displayed, and the original features and the automatically generated features of the system are distinguished by different colors on a user interface.
3) Feature script information. The feature scripts automatically generated by the system can be displayed in an algorithm mode on the user interface. The feature script is used to generate a sample table used for model training. The display characteristic script is beneficial to the user to better understand the output of the automatic algorithm and is convenient for explaining and sorting the report.
4) All result information of model training. The system supports multiple forms of showing all training results of model training. Firstly, displaying the result trend of each round of training of each algorithm in a line graph mode; secondly, the trend graphs of each hyper-parameter and model effect of each algorithm are checked in a multi-coordinate axis mode; and thirdly, displaying the model effect indexes and the values of the hyper-parameters of each algorithm in a list form.
And step S2230, training an initial machine learning model based on the initial training sample by using at least one preset model training algorithm.
Illustratively, the directed acyclic graph (DAG graph) shown in the middle portion of FIG. 5 shows 8 nodes: the node comprises a 'feedback data' node, a 'behavior data' node, a 'data splitting' node, a 'feature engineering' node, a 'LR algorithm' node, a 'GBDT algorithm' node, a 'HE-TreeNet algorithm' node and a 'NN algorithm' node. It should be noted that fig. 5 shows 4 specific preset algorithms, but this is only an exemplary illustration, and the present disclosure does not limit the number of preset algorithms and the specific algorithms.
Referring to fig. 5, training data after the concatenation of behavioral data and feedback data may be split into training and validation sets via respective configurations at "data splitting" nodes in the DAG graph. Thereafter, via respective configurations at "feature engineering" nodes in the DAG graph, automatic feature generation may be performed on the training and validation sets, respectively, to extract at least one feature to generate a training sample. At least one round of training is respectively carried out on the four preset algorithms by using training samples at four nodes (namely an LR algorithm node, a GBDT algorithm node, an HE-TreeNet algorithm node and an NN algorithm node) corresponding to the lowest layer in the DAG graph, and then a plurality of corresponding machine learning models are trained.
After generating an initial training sample based on the stored behavior data and feedback data and training out an initial machine learning model by using the initial training sample, entering:
and step S2300, obtaining a model training scheme based on the trained initial machine learning model.
The model training scheme includes an algorithm for automatically training out an initial machine learning model and its hyper-parameters.
The algorithm includes, but is not limited to, any one of the above LR algorithm, GBDT algorithm, HE-TreeNet algorithm, NN algorithm, and the like.
The hyper-parameters may include model hyper-parameters and training hyper-parameters.
The above model hyper-parameters are hyper-parameters for defining a model, such as but not limited to activation functions (e.g., identity function, sigmoid function, and truncated ramp function), number of hidden layer nodes, number of convolutional layer channels, and number of fully-connected layer nodes.
The above training hyper-parameters are hyper-parameters for defining the model training process, such as, but not limited to, learning rate, batch size, and number of iterations.
After obtaining a model training scheme based on the trained initial machine learning model, entering:
and step S2400, based on the obtained model training scheme, obtaining an updated training sample by using the updated training data, and obtaining an updated machine learning model by using the updated training sample.
The updated training data includes at least one of updated behavioral data and updated feedback data.
The updated behavioral data and the updated feedback data are the latest input data.
The updated training sample is a data sample with a true conclusion, in other words, the updated training sample may be a sample generated by combining feedback data (as a label, i.e., label) corresponding to the updated behavior data after obtaining features through feature extraction on the collected updated behavior data according to a processing procedure related to feature generation defined in the model training scheme. Preferably, the updated training sample may be a sample generated by combining feedback data corresponding to the updated behavior data after various features are obtained by performing feature extraction and feature combination on the collected updated behavior data.
In this embodiment, the machine learning model updated by using the updated training samples may be regarded as self-learning of the machine learning model, and the self-learning task supports both single instant start and timed operation, and the timed operation includes two modes of single operation and cycle operation. Wherein, a single run needs to specify the running time, and a loop run can support the specification of the running time interval and can be realized by writing a crontab expression.
In an embodiment of the present invention, the obtaining of updated training samples using the updated training data and the obtaining of updated machine learning model using the updated training samples in step S2400 may further include steps S2410 to S2420 as follows:
step S2410, providing a configuration interface for performing configuration with respect to model self-learning.
And step S2420, generating updated training samples according to the updated behavior data and the updated feedback data according to the model training scheme according to the configuration information input through the configuration interface, and training an updated machine learning model based on the updated training samples.
The configuration information entered through the configuration interface relates to at least one of: the method comprises the steps of configuration of a model updating period for updating a machine learning model, configuration of a model result storage position, configuration of model result naming, task timeout duration and configuration of task starting check rules.
For model result storage locations, for example, machine learning models can be stored in a model center built into the electronic device 1000, which can also enable a user to view model-related interpretations and reports.
For the naming of the model result, the setting of the name prefix of the model result is supported, and naming can be performed in a mode of automatically adding a character string to the prefix when the model is subsequently generated, for example, the prefix is set to be automl, and the naming of the model result subsequently generated can be automl 20190807123.
The task start check rules include and have selected by default the check rules built in the electronic device 1000, which include but are not limited to the correctness of the schema (e.g. it may be checked whether the schema meets the specification, such as the field integrity, etc.), the data structure, the number of data slices, the connection of the data source, and the status of the cluster module. The check is actually followed by one detection task, for example, the cluster module status will check the cluster status, and if the cluster status is normal, the check will fail. The self-learning or batch pre-estimation task starting conditions can be set by self-defining rules, such as how many data slices are to be run. The self-defining rule can be realized by writing a Python script, and a user can define in the Python script under which conditions the self-learning task can run and under which conditions the self-learning task cannot run, and the like.
The task timeout duration may be, for example, the longest running duration of the task, and if the task is not stopped when the timeout duration is reached, the task is automatically suspended, so that a certain task can be prevented from occupying resources for a long time.
In addition, the electronic device 1000 supports centralized management of the self-learning tasks, and may perform statistics of information such as running time, running times, number of models, and abnormal information statistics of the tasks. For a single task, the method supports the operation details and the log information of the task. Meanwhile, breakpoint continuous running is supported for the task which fails to run.
After obtaining an updated training sample by using the updated behavior data and the updated feedback data based on the obtained model training scheme and obtaining an updated machine learning model by using the updated training sample, entering:
and step S2500, performing prediction service by using the selected machine learning model.
In an embodiment of the present invention, the predicting service using the selected machine learning model in step S2500 may further include steps S2510 to S2520:
in step S2510, a configuration interface is provided for configuring the prediction service provided by the machine learning model.
Step S2520, performing an online prediction service and/or a batch prediction service using the selected machine learning model according to the configuration information input through the configuration interface.
Illustratively, in FIG. 6, an "online forecast" button corresponding to an online forecast service and a "batch forecast" button corresponding to a batch forecast service are provided, respectively.
Configuration of model selection rules for selecting an online machine learning model from among the machine learning models and/or configuration of application resources is related to according to configuration information input through a configuration interface.
The above model selection rule for selecting an online machine learning model from among machine learning models relates to at least one of:
1) one machine learning model is arbitrarily selected from the model result storage location as the online machine learning model.
Such as, but not limited to, selecting an initial machine learning model or an updated machine learning model as the online machine learning model.
2) And automatically selecting a machine learning model as an online machine learning model according to a model selection strategy defined by the script.
The model selection policy may be, for example, a policy written in Python language, C language, and Java language, or may be other languages, which is not limited herein.
3) The newly generated machine learning model is selected as the online machine learning model.
4) And selecting the best machine learning model as the online machine learning model.
For example, the machine learning model with the highest AUC may be selected as the online machine learning model.
The configuration of the application resources refers to how to configure system resources when the selected online machine learning model is applied, for example, the resources to be allocated by the online pre-estimation service may be configured according to a data volume combination rule, or the resources to be allocated by the online pre-estimation service may be dynamically set according to a requested flow, where the resources include, but are not limited to, the number of instances of the service, a CPU, an image processor (GPU), a memory usage amount, and an FPGA.
For batch pre-estimation, the configuration information input through the configuration interface further relates to at least one of the following items: the method comprises the steps of task starting verification rule configuration, estimation result storage position configuration, estimation result naming and task timeout duration. In addition, the electronic device 1000 supports not only centralized management of batch estimation tasks, for example, statistics of information such as running time, running times, number of models, and the like, but also supports abnormal information statistics of tasks, for example, data abnormality, cluster abnormality, estimation task abnormality, estimation result abnormality, and statistics of abnormal number in each category. For a single task, the method also supports the operation details and the log information of the task. Meanwhile, breakpoint continuous running is supported for the task which fails to run.
The task start check rules include and have selected by default the check rules built in the electronic device 1000, which include but are not limited to the correctness of the schema (e.g. it may be checked whether the schema meets the specification, such as the field integrity, etc.), the data structure, the number of data slices, the connection of the data source, and the status of the cluster module. The self-learning or batch pre-estimation task starting conditions can be set by self-defining rules, such as how many data slices are to be run. The self-defining rule can be realized by writing a Python script, and a user can define in the Python script under which conditions the self-learning task can run and under which conditions the self-learning task cannot run.
The task timeout duration may be, for example, the longest running duration of the task, and if the task is not stopped when the timeout duration is reached, the task is automatically suspended, so that a certain task can be prevented from occupying resources for a long time.
For the naming of the estimation result, the name prefix of the estimation result is supported to be set, and the naming can be carried out in a mode of automatically adding character strings on the prefix when the estimation result is generated subsequently.
For the forecast storage location, for example, the forecast may be stored in a data set built into the electronic device 1000, which may also specify and view the storage location of the forecast.
In an example, the step S2520 of performing a batch prediction service using the selected machine learning model may further include: and responding to a set triggering event, and performing batch prediction service by using the selected machine learning model.
The set triggering event comprises at least one of the appearance of a new machine learning model in the model result storage position, the expiration of the set model configuration time and the reception of the machine learning model selected by the user.
In one example, the online prediction service using the selected machine learning model according to the configuration information input through the configuration interface in step S2520 may further include the following steps S2521 to S2523:
in step S2521, an interface for simulating the request prediction service is provided to the user.
Step S2522, a prediction service request including prediction data is generated according to the user' S input to the interface.
Step S2523, in response to the generated prediction service request, obtains a prediction result for the prediction data using the selected machine learning model, and presents the prediction result to the user.
In the online pre-estimation service, as shown in fig. 7, it is supported to input a piece of data to perform pre-estimation test, that is, a user is required to manually fill in the values of each field in each data table, then click "test" to simulate an online request, and call a model to perform prediction, and show the predicted result to the user.
In this embodiment, the configuration information input through the configuration interface further relates to the on-off state of the automatic reflow of the prediction data, and when the on-off state is on, the prediction data included in the prediction service request is stored in the corresponding behavior data group, so that the automatic reflow of the behavior data can be realized, thereby providing a necessary data source for the continuous cycle of the automatic machine learning process.
According to the method provided by the embodiment of the invention, on one hand, a model training scheme can be obtained on the basis of training an initial machine learning model, so that an updated machine learning model can be obtained according to the model training scheme, and therefore, the regular automatic updating of the machine learning model and the automatic updating of the online machine learning model can be realized; moreover, it can adapt to different structured data scenarios, such as supporting two-classification scenarios, multiple-classification scenarios, regression-class scenarios, and clustering problem scenarios.
On the other hand, the full-flow cycle operation is realized in multiple processes of data collection, model production, model application and the like through the automatic modeling of the machine learning model, and the features and the selection features can be generated in an automatic mode, so that the cost and the time of the machine learning modeling are greatly reduced, different feature processing strategies are provided, and better effects can be obtained in a non-time sequence scene and a time sequence scene respectively.
In the third aspect, the user can output the machine learning model through the automation capability only by completing some simple operations such as data introduction, data definition, model selection and the like, so that the landing threshold of the machine learning application is greatly reduced.
In one embodiment, after acquiring the behavior data and the feedback data as the basis for the initial model training, the automated modeling method may further include the following steps S2210 to S2220:
step S2210, respectively counting the behavior data or feedback data belonging to the same data field type to obtain a statistical result.
In step S2210, after the behavior data or the feedback data serving as the basis of the initial model training is obtained, the behavior data or the feedback data may be automatically counted, for example, the average value, the null number, the null ratio, the variance, the standard deviation, the minimum value, the maximum value, the lower quartile, the median, the upper quartile, the unique value number, the mode, the frequency of the mode, and the like of the behavior data or the feedback data belonging to the same data field type may be counted.
And step S2220, respectively displaying the statistical results according to the data field types.
In step S2220, the statistical result may be visually displayed in different display forms according to different types of the data fields, for example, the statistical result may be displayed according to a sector graph, a bar graph, a scatter diagram, and the like according to the types of the data fields.
< apparatus embodiment >
In this embodiment, an automated modeling apparatus 8000 is provided, as shown in fig. 8, including a data obtaining module 8100, an initial machine learning model training module 8200, a model training scheme obtaining module 8300, an updated machine learning model training module 8400, and a growth module 8500.
The data acquisition module 8100 is configured to acquire and store initial training data serving as a basis for initial model training.
The initial machine learning model training module 8200 is configured to generate an initial training sample based on the stored initial training data, and train an initial machine learning model using the initial training sample.
The model training scheme obtaining module 8300 is configured to obtain a model training scheme based on the trained initial machine learning model.
The updated machine learning model training module 8400 is configured to obtain an updated training sample by using the updated training data based on the obtained model training scheme, and obtain an updated machine learning model by using the updated training sample.
The prediction service providing module 8500 is configured to perform prediction service using the selected machine learning model.
In one embodiment, the initial training data comprises at least one of behavioural data and feedback data, and the updated training data comprises at least one of updated behavioural data and updated feedback data.
In an embodiment, the data obtaining module 8100 is further configured to provide at least one data import path for importing behavior data or importing feedback data, respectively; after the data import path is selected, uploading the behavior data table or the feedback data table; providing a configuration interface for configuring information of the uploaded data table; importing behavior data in a behavior data table or feedback data in a feedback data table according to configuration information input through the configuration interface; the behavioral data or feedback data is saved.
In one embodiment, the data obtaining module 8100 is further configured to automatically identify a data field type in the behavior data table or the feedback data table.
In one embodiment, the configuration information entered through the information configuration interface relates to at least one of a configuration of data base information of the uploaded data table and a data storage configuration.
In one embodiment, the data storage configuration of the uploaded data table relates to at least one of a configuration of a data structure of the uploaded data table, a configuration of a primary key field of the uploaded data table, and a configuration of exception data handling of the uploaded data table.
Wherein, the abnormal data processing of the uploaded data table comprises at least one of missing value filling and useless data cleaning.
In an embodiment, the data obtaining module 8100 is further configured to provide configuration interfaces for splicing the uploaded behavior data tables or the uploaded feedback data tables, respectively; and responding to the starting operation of training the initial machine learning model, and respectively acquiring at least one of a splicing main key and a relation type which are input through the configuration interface and used for splicing the uploaded behavior data tables or the uploaded feedback data tables.
And respectively splicing a plurality of behavior data tables or a plurality of feedback data tables according to the splicing main key and/or the relation type.
In one embodiment, the data obtaining module 8100 is further configured to separately count behavior data or feedback data belonging to the same data field type, and obtain a statistical result; and respectively displaying the statistical results according to the data field types.
In one embodiment, the initial machine learning model training module 8200 is also used for providing a configuration interface for configuration of model training; generating an initial training sample from the stored behavior data and feedback data according to configuration information input through the configuration interface; and training an initial machine learning model based on the initial training sample by utilizing at least one preset model training algorithm.
In one embodiment, the configuration information entered through the configuration interface relates to at least one of: the method comprises the steps of configuring a behavior data selection rule and a feedback data selection rule, configuring a data splitting rule for splitting the behavior data into training data and verification data, configuring an automatic feature generation rule for extracting features of the training data and the verification data according to data field types, configuring a model training stopping strategy, configuring a rule for selecting the features, and configuring automatic parameter adjustment for performing parameter adjustment according to a preset parameter adjustment mode.
The behavior data selection rule or the feedback data selection rule comprises the step of selecting a behavior data slice or a feedback data slice which is used as the basis of initial model training according to business time.
In one embodiment, the initial machine learning module training module 8200 is further configured to obtain a time field and a slicing granularity which are used as a basis for data slicing in response to a trigger operation for selecting a behavior data slice or a feedback data slice according to a service time; providing a data selection interface for selecting the time field and the data slice at the slice granularity; and according to the behavior data slice or the feedback data slice selected through the data selection interface, the behavior data slice or the feedback data slice serving as the basis of initial model training is obtained.
In one embodiment, the data splitting rule for splitting behavioural data into training data and verification data comprises at least one of:
directly splitting the behavior data into training data and verification data according to a preset splitting ratio; alternatively, the first and second electrodes may be,
randomly scrambling behavior data;
splitting the disordered behavior data into training data and verification data according to a preset splitting ratio; alternatively, the first and second electrodes may be,
randomly selecting one behavior data as initial training data, and randomly selecting another behavior data as initial verification data;
selecting a certain amount of behavior data as training data by taking the initial training data as a starting point; selecting a certain amount of behavior data as verification data by taking the initial verification data as a starting point; alternatively, the first and second electrodes may be,
acquiring a data field type serving as a grouping basis;
splitting the behavior data according to the data field type to obtain multiple groups of behavior data;
according to a preset splitting ratio, splitting each group of behavior data into candidate training data and candidate verification data;
splicing the candidate training data corresponding to the multiple groups of behavior data to obtain final training data; splicing the candidate verification data corresponding to the multiple groups of behavior data to obtain final verification data; alternatively, the first and second electrodes may be,
acquiring a script defining a data splitting rule;
and automatically splitting the behavior data into training data and verification data according to the script.
In one embodiment, the automatic feature generation for feature extraction of training data and validation data by data field type comprises at least one of:
carrying out specific numerical operation on the data field type; alternatively, the first and second electrodes may be,
operating a plurality of time fields included in the data field type; alternatively, the first and second electrodes may be,
performing dimension-increasing processing on a plurality of classified fields included in the data field type; alternatively, the first and second electrodes may be,
and directly performing automatic feature combination on the data field types.
In one embodiment, the model training stopping strategy comprises at least one of:
stopping training when the effect of the machine learning model trained by using at least one model training algorithm reaches a set effect; alternatively, the first and second electrodes may be,
stopping training when the training time for training the machine learning model by using at least one model training algorithm reaches a set time threshold; alternatively, the first and second electrodes may be,
and stopping training when the number of training rounds reaches the set number of training rounds when the machine learning model is trained by at least one model training algorithm.
In one embodiment, the model training scheme includes an algorithm for automatically training out the initial machine learning model and its hyper-parameters.
In one embodiment, the update machine learning model training module 8400 is further configured to provide a configuration interface for configuration with respect to model self-learning; and generating updated training samples according to the configuration information input through the configuration interface and the updated behavior data and the updated feedback data according to the model training scheme, and training an updated machine learning model based on the updated training samples.
In one embodiment, the configuration information entered through the configuration interface relates to at least one of: the method comprises the steps of configuration of a model updating period for updating a machine learning model, configuration of a model result storage position, configuration of model result naming and configuration of a task starting check rule.
In one embodiment, the prediction service providing module 8500 is further configured to provide a configuration interface for configuring a prediction service provided by a machine learning model; and performing online prediction service and/or batch prediction service by using the selected machine learning model according to the configuration information input through the configuration interface.
In one embodiment, the configuration information according to the input through the configuration interface relates to a configuration of a model selection rule for selecting an online machine learning model from among machine learning models and/or a configuration of an application resource.
In one embodiment, the model selection rule for selecting the online machine learning model from among the machine learning models relates to at least one of:
arbitrarily selecting one machine learning model from the model result storage positions as an online machine learning model; alternatively, the first and second electrodes may be,
automatically selecting a machine learning model as an online machine learning model according to a model selection strategy defined by a script; alternatively, the first and second electrodes may be,
selecting a newly generated machine learning model as an online machine learning model; alternatively, the first and second electrodes may be,
and selecting the best machine learning model as the online machine learning model.
In one embodiment, the prediction service providing module 8500 is further configured to perform a batch prediction service using the selected machine learning model in response to a set trigger event.
In one embodiment, the set triggering event includes at least one of when a new machine learning model appears in the model result storage location, when a set model configuration time expires, and after receiving a user selected machine learning model.
In one embodiment, the forecast service providing module 8500 is further configured to provide an interface for simulating a request for a forecast service to a user; generating a predicted service request including predicted data according to a user input to the interface; and responding to the generated prediction service request, obtaining a prediction result aiming at the prediction data by using the selected machine learning model, and displaying the prediction result to the user.
In one embodiment, the configuration information entered through the configuration interface also relates to a switch state that predicts automatic backflow of data.
< electronic apparatus >
In this embodiment, an electronic apparatus 9000 is also provided. The electronic device 9000 may be the electronic device 1000 shown in fig. 1.
In one aspect, the electronic device 9000 can comprise the aforementioned automated modeling apparatus 8000 for implementing the automated modeling method of any of the embodiments of the present invention.
In another aspect, as shown in fig. 9, an electronic device 9000 can further comprise a processor 9100 and a memory 9200, the memory 9200 for storing executable instructions; the processor 9100 is configured to operate the electronics 9000 to perform an automated modeling method according to any of the embodiments of the present invention as directed.
In this embodiment, the electronic device 9000 may be a mobile phone, a tablet computer, a palm computer, a desktop computer, a notebook computer, a workstation, a game console, or the like.
< computer-readable storage Medium >
In this embodiment, there is also provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements an automated modeling method according to any of the embodiments of the present invention.
The present invention may be an apparatus, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, by software, and by a combination of software and hardware are equivalent.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims (10)

1. An automated modeling method, comprising:
acquiring and storing initial training data serving as an initial model training basis;
generating an initial training sample based on the stored initial training data, and training an initial machine learning model by using the initial training sample;
obtaining a model training scheme based on the trained initial machine learning model;
based on the obtained model training scheme, obtaining an updated training sample by using updated training data, and obtaining an updated machine learning model by using the updated training sample; and the number of the first and second groups,
and performing prediction service by using the selected machine learning model.
2. The method of claim 1, wherein the initial training data comprises at least one of behavioral data and feedback data, and the updated training data comprises at least one of updated behavioral data and updated feedback data.
3. The method of claim 2, wherein the obtaining and saving initial training data on which initial model training is based comprises:
at least one data import path for respectively providing import behavior data or import feedback data;
after the data import path is selected, uploading the behavior data table or the feedback data table;
providing a configuration interface for configuring information of the uploaded data table;
importing behavior data in a behavior data table or feedback data in a feedback data table according to configuration information input through the configuration interface;
the behavioral data or feedback data is saved.
4. The method of claim 3, wherein the method further comprises, after uploading the behavior data table or the feedback data table:
data field types in the behavior data table or the feedback data table are automatically identified.
5. The method of claim 3, wherein the configuration information entered through the information configuration interface relates to at least one of a configuration of data base information and a data storage configuration of the uploaded data table.
6. The method of claim 5, wherein the data storage configuration of the uploaded data table relates to at least one of a configuration of a data structure of the uploaded data table, a configuration of a primary key field of the uploaded data table, and a configuration of exception data handling of the uploaded data table.
Wherein, the abnormal data processing of the uploaded data table comprises at least one of missing value filling and useless data cleaning.
7. The method of claim 3, wherein the method further comprises:
respectively providing a configuration interface for splicing the uploaded plurality of behavior data tables or the plurality of feedback data tables;
responding to the starting operation of training an initial machine learning model, and respectively acquiring at least one of a splicing main key and a relation type which are input through the configuration interface and used for splicing the uploaded behavior data tables or the uploaded feedback data tables;
and respectively splicing a plurality of behavior data tables or a plurality of feedback data tables according to the splicing main key and/or the relation type.
8. An automated modeling apparatus, comprising:
the data acquisition module is used for acquiring and storing initial training data serving as an initial model training basis;
the initial machine learning model training module is used for generating an initial training sample based on the stored initial training data and training an initial machine learning model by using the initial training sample;
the model training scheme acquisition module is used for acquiring a model training scheme based on the trained initial machine learning model;
the updated machine learning model training module is used for obtaining an updated training sample by using updated training data based on the obtained model training scheme and obtaining an updated machine learning model by using the updated training sample; and the number of the first and second groups,
and the prediction service providing module is used for performing prediction service by using the selected machine learning model.
9. An electronic device, comprising:
the automated modeling apparatus of claim 8; alternatively, the first and second electrodes may be,
a processor and a memory for storing instructions for controlling the processor to perform the method of any of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.
CN201911251943.0A 2019-12-09 2019-12-09 Automatic modeling method and device and electronic equipment Pending CN111008707A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911251943.0A CN111008707A (en) 2019-12-09 2019-12-09 Automatic modeling method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911251943.0A CN111008707A (en) 2019-12-09 2019-12-09 Automatic modeling method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN111008707A true CN111008707A (en) 2020-04-14

Family

ID=70114134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911251943.0A Pending CN111008707A (en) 2019-12-09 2019-12-09 Automatic modeling method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN111008707A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523676A (en) * 2020-04-17 2020-08-11 第四范式(北京)技术有限公司 Method and device for assisting machine learning model to be online
CN111611240A (en) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 Method, apparatus and device for executing automatic machine learning process
CN111815066A (en) * 2020-07-21 2020-10-23 上海数鸣人工智能科技有限公司 User click prediction method based on gradient lifting decision tree
CN112036492A (en) * 2020-09-01 2020-12-04 腾讯科技(深圳)有限公司 Sample set processing method, device, equipment and storage medium
CN112149838A (en) * 2020-09-03 2020-12-29 第四范式(北京)技术有限公司 Method, device, electronic equipment and storage medium for realizing automatic model building
CN113673707A (en) * 2020-05-15 2021-11-19 第四范式(北京)技术有限公司 Method and device for learning by applying machine, electronic equipment and storage medium
WO2022063274A1 (en) * 2020-09-27 2022-03-31 中兴通讯股份有限公司 Data annotation method and system, and electronic device
CN116933896A (en) * 2023-09-15 2023-10-24 上海燧原智能科技有限公司 Super-parameter determination and semantic conversion method, device, equipment and medium

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111611240A (en) * 2020-04-17 2020-09-01 第四范式(北京)技术有限公司 Method, apparatus and device for executing automatic machine learning process
CN111523676A (en) * 2020-04-17 2020-08-11 第四范式(北京)技术有限公司 Method and device for assisting machine learning model to be online
CN111523676B (en) * 2020-04-17 2024-04-12 第四范式(北京)技术有限公司 Method and device for assisting machine learning model to be online
WO2021208685A1 (en) * 2020-04-17 2021-10-21 第四范式(北京)技术有限公司 Method and apparatus for executing automatic machine learning process, and device
CN113673707A (en) * 2020-05-15 2021-11-19 第四范式(北京)技术有限公司 Method and device for learning by applying machine, electronic equipment and storage medium
CN111815066A (en) * 2020-07-21 2020-10-23 上海数鸣人工智能科技有限公司 User click prediction method based on gradient lifting decision tree
CN111815066B (en) * 2020-07-21 2021-03-26 上海数鸣人工智能科技有限公司 User click prediction method based on gradient lifting decision tree
CN112036492A (en) * 2020-09-01 2020-12-04 腾讯科技(深圳)有限公司 Sample set processing method, device, equipment and storage medium
CN112036492B (en) * 2020-09-01 2024-02-02 腾讯科技(深圳)有限公司 Sample set processing method, device, equipment and storage medium
CN112149838A (en) * 2020-09-03 2020-12-29 第四范式(北京)技术有限公司 Method, device, electronic equipment and storage medium for realizing automatic model building
WO2022063274A1 (en) * 2020-09-27 2022-03-31 中兴通讯股份有限公司 Data annotation method and system, and electronic device
CN116933896A (en) * 2023-09-15 2023-10-24 上海燧原智能科技有限公司 Super-parameter determination and semantic conversion method, device, equipment and medium
CN116933896B (en) * 2023-09-15 2023-12-15 上海燧原智能科技有限公司 Super-parameter determination and semantic conversion method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN111008707A (en) Automatic modeling method and device and electronic equipment
CN109791642B (en) Automatic generation of workflow
US11294754B2 (en) System and method for contextual event sequence analysis
CN110570217B (en) Cheating detection method and device
CN109299344A (en) The generation method of order models, the sort method of search result, device and equipment
US20170109667A1 (en) Automaton-Based Identification of Executions of a Business Process
CN110956272A (en) Method and system for realizing data processing
US11385898B2 (en) Task orchestration method for data processing, orchestrator, device and readable storage medium
CN110941467A (en) Data processing method, device and system
CN111611240A (en) Method, apparatus and device for executing automatic machine learning process
CN111405030B (en) Message pushing method and device, electronic equipment and storage medium
CN110363427A (en) Model quality evaluation method and apparatus
CN111460384A (en) Policy evaluation method, device and equipment
US20170109638A1 (en) Ensemble-Based Identification of Executions of a Business Process
US20180336459A1 (en) Unstructured key definitions for optimal performance
CN110880128A (en) Abnormal information mining method, device and system and terminal equipment
CN109542737A (en) Platform alert processing method, device, electronic device and storage medium
CN111723515A (en) Method, device and system for operating operator
CN113569162A (en) Data processing method, device, equipment and storage medium
CN111291082B (en) Data aggregation processing method, device, equipment and storage medium
CN112115313A (en) Regular expression generation method, regular expression data extraction method, regular expression generation device, regular expression data extraction device, regular expression equipment and regular expression data extraction medium
US11924018B2 (en) System for decomposing events and unstructured data
CN107430590B (en) System and method for data comparison
US20220382608A1 (en) Methods and systems for determining stopping point
CN113934894A (en) Data display method based on index tree and terminal equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination