CN110222710B

CN110222710B - Data processing method, device and storage medium

Info

Publication number: CN110222710B
Application number: CN201910361278.4A
Authority: CN
Inventors: 马纯
Original assignee: Beijing Shenyan Intelligent Technology Co ltd
Current assignee: Beijing Shenyan Intelligent Technology Co ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2022-03-08
Anticipated expiration: 2039-04-30
Also published as: CN110222710A

Abstract

The application discloses a data processing method, a data processing device and a storage medium, and belongs to the field of data processing. The method comprises the following steps: firstly, selecting a sample data set corresponding to a target modeling type from a plurality of stored sample data sets. At least one feature dimension is then selected from the plurality of feature dimensions, and a target training model is selected from a plurality of training models stored for the target modeling type. And then training the target training model according to the data of each sample in the at least one characteristic dimension, wherein the selected sample data set comprises the data. And finally, determining an expansion data set according to the model obtained after training and the target expansion multiple and the optimized target information. When the target model needs to be trained, the target model can be directly selected from the multiple training models without the need of compiling different codes for the multiple training models by an operator, so that the training process is simplified, and the process of determining the extended data set is more efficient.

Description

Data processing method, device and storage medium

Technical Field

The present application relates to the field of data processing, and in particular, to a data processing method, apparatus, and storage medium.

Background

Machine learning techniques are commonly used to mine some data. Namely, training a model to be trained according to the acquired sample data set, and then mining other data according to the model obtained after training. For example, in the field of marketing, a model to be trained can be trained according to an acquired marketing sample data set, and then some other data are mined according to the model obtained after training, so that a marketing strategy can be conveniently formulated. At present, the whole process of the machine learning technology is mainly realized by writing codes by operators, namely the whole process of realizing the machine learning technology requires that the operators have a certain coding basis, so that the whole process of the machine learning technology is difficult to realize.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device and a storage medium, and can solve the problem that in the related art, the whole process of realizing the machine learning technology requires operators to have a certain coding basis, so that the whole process of the machine learning technology is difficult to realize. The technical scheme is as follows:

in a first aspect, a data processing method is provided, the method including:

selecting a sample data set corresponding to a target modeling type from a plurality of stored sample data sets, wherein the selected sample data set comprises a plurality of samples, and each sample comprises data of a plurality of characteristic dimensions;

selecting at least one feature dimension from the plurality of feature dimensions, selecting a target training model from a plurality of training models stored for the target modeling type;

training the target training model according to the data of each sample in the at least one characteristic dimension, wherein the selected sample data set comprises;

according to the model obtained after training, determining an extended data set according to a target extension multiple and optimization target information, wherein the ratio of the number of samples included in the extended data set to the number of samples included in the selected sample data set is the target extension multiple, and the optimization target information refers to a matching index between the extended data set and the selected sample data set.

Optionally, before the training of the target training model according to the data of each sample in the at least one feature dimension, the selected sample data set further includes:

displaying a parameter setting interface, wherein the parameter setting interface comprises at least one parameter editing frame;

and acquiring at least one parameter set for the target training model from the at least one parameter editing box.

Optionally, after determining the extended data set according to the model obtained after training and the target expansion multiple and the optimization target information, the method further includes:

displaying an evaluation result, wherein the evaluation result is used for evaluating an extended data set determined by the model obtained after training;

and adjusting at least one parameter included in the target training model according to the evaluation result.

Optionally, the method further comprises:

in the training process of the target training model, displaying a training flow chart of the target training model, wherein the training flow chart comprises a plurality of training nodes, the display mode of each training node is a first display mode, a second display mode or a third display mode, the first display mode is used for indicating that the corresponding training node is finished, the second display mode is used for indicating that the corresponding training node is in the process of being trained, and the third display mode is used for indicating that the corresponding training node is not reached.

deploying a policy for each sample included in the extended data set.

In a second aspect, there is provided a data processing apparatus, the apparatus comprising:

the system comprises a first selection module, a second selection module and a third selection module, wherein the first selection module is used for selecting a sample data set corresponding to a target modeling type from a plurality of stored sample data sets, the selected sample data set comprises a plurality of samples, and each sample comprises data of a plurality of characteristic dimensions;

a second selection module for selecting at least one feature dimension from the plurality of feature dimensions, selecting a target training model from a plurality of training models stored for the target modeling type;

a training module, configured to train the target training model according to data of each sample included in the selected sample data set in the at least one feature dimension;

and the determining module is used for determining an extended data set according to a model obtained after training and according to a target expansion multiple and optimization target information, wherein the ratio of the number of samples included in the extended data set to the number of samples included in the selected sample data set is the target expansion multiple, and the optimization target information refers to a matching index between the extended data set and the selected sample data set.

Optionally, the apparatus further comprises:

the first display model is used for displaying a parameter setting interface, and the parameter setting interface comprises at least one parameter editing frame;

and the obtaining module is used for obtaining at least one parameter set aiming at the target training model from the at least one parameter editing frame.

Optionally, the apparatus further comprises:

the second display module is used for displaying an evaluation result, and the evaluation result is used for evaluating the extended data set determined by the model obtained after training;

and the adjusting module is used for adjusting at least one parameter included in the target training model according to the evaluation result.

Optionally, the apparatus further comprises:

the third display module is configured to display a training flow chart of the target training model in a training process of the target training model, where the training flow chart includes a plurality of training nodes, a display mode of each training node is a first display mode, a second display mode, or a third display mode, the first display mode is used to indicate that a corresponding training node is completed, the second display mode is used to indicate that the training node is currently located in the corresponding training node, and the third display mode is used to indicate that the corresponding training node is not reached.

Optionally, the apparatus further comprises:

a deployment module to deploy policies to each sample included in the extended dataset.

In a third aspect, a data processing apparatus is provided, the apparatus comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of any of the methods of the first aspect described above.

In a fourth aspect, a computer-readable storage medium is provided, having instructions stored thereon, which when executed by a processor, implement the steps of any of the methods of the first aspect described above.

In a fifth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the method of any of the first aspects above.

The technical scheme provided by the embodiment of the application can at least bring the following beneficial effects:

in the embodiment of the application, a sample data set corresponding to a target modeling type is selected from a plurality of stored sample data sets, and the selected sample data set comprises a plurality of samples. Since the selected sample data set includes a plurality of feature dimensions, at least one feature dimension may be selected from the plurality of feature dimensions, and a target training model may be selected from a plurality of training models stored for the target modeling type. And then, training the target training model according to the data of each sample in at least one characteristic dimension, wherein the selected sample data set comprises the data. And determining an expansion data set according to the model obtained after training and the target expansion multiple and the optimized target information. Since there is a match indicator between the expanded data set and the selected sample data set, there is similarity between the expanded data set and the selected sample data set. According to the embodiment of the application, as the plurality of training models aiming at the target modeling type are stored in the computer equipment in advance, when the target model in the plurality of training models needs to be trained, the target model can be directly selected from the plurality of training models, and then the target training model is trained according to the data of each sample in at least one characteristic dimension, wherein the data is included in the selected sample data set. Furthermore, the target training model may be trained multiple times, or different training models may be trained as the target training model. The data processing method provided by the embodiment of the application is realized without compiling different codes for a plurality of training models by operators and without a certain coding basis of the operators, so that the training process of the target training model is simplified, and the process of determining the extended data set is simpler, more convenient and more efficient.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application.

Fig. 2 is a flowchart of a first data processing method according to an embodiment of the present application.

Fig. 3 is a flowchart of a second data processing method according to an embodiment of the present application.

Fig. 4 is a schematic view of a first interface provided in an embodiment of the present application.

Fig. 5 is a schematic view of a second interface provided in an embodiment of the present application.

Fig. 6 is a schematic view of a third interface provided in an embodiment of the present application.

Fig. 7 is a schematic view of a fourth interface provided in an embodiment of the present application.

Fig. 8 is a schematic diagram of a fifth interface provided in an embodiment of the present application.

Fig. 9 is a schematic view of a sixth interface provided in an embodiment of the present application.

Fig. 10 is a schematic diagram of a seventh interface provided in an embodiment of the present application.

Fig. 11 is a schematic diagram of an eighth interface provided in an embodiment of the present application.

Fig. 12 is a schematic diagram of a ninth interface provided in an embodiment of the present application.

Fig. 13 is a schematic diagram of a tenth interface provided in the embodiment of the present application.

Fig. 14 is a block diagram of a data processing apparatus according to an embodiment of the present application.

Fig. 15 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with aspects of the present application.

Before explaining the embodiments of the present application in detail, an implementation environment of the embodiments of the present application is described:

fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application, and referring to fig. 1, the implementation environment includes a computer device 100, and the computer device 100 includes an output device 110, an input device 120, a business logic module 130, a lifecycle management module 140, and an algorithm scheduling module 150.

The output device 110 may be in communication with the business logic module 130, and the output device 110 may be used to display a plurality of interfaces. The output device 110 may be a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display device, a Cathode Ray Tube (CRT) display device, a projector (projector), or the like.

The input device 120 may be in communication with the business logic module 130, and the input device 120 may receive user input in a variety of ways. For example, the input device 120 may be a mouse, a keyboard, a touch screen device, or a sensing device, among others.

Business logic module 130 may receive user operational information, generate a task to train a target training model, and write the task to life cycle management module 140.

Lifecycle management module 140 includes a plurality of Application Programming Interfaces (APIs), and lifecycle management module 140 may store tasks written by business logic module 130 to train the target training model. Further, the codes of the plurality of training models, the parameters of the plurality of training models, and the evaluation result may be stored in the code library in the life cycle management module 140 by the API of the life cycle management module 140, or the codes of the plurality of training models, the parameters of the plurality of training models, and the evaluation result stored in the code library in the life cycle management module 140 may be deleted by the API of the life cycle management module 140. The lifecycle management module 140 may implement a plurality of functions through a Machine Learning flow (MLflow) application.

Algorithm scheduling module 150 may schedule and execute tasks in lifecycle management module 140 in a certain order and time of execution, and may add new tasks to lifecycle management module 140 or delete historical tasks in lifecycle management module 140 through an API of lifecycle management module 140.

Computer device 100 may be a general purpose computer device or a special purpose computer device. In a specific implementation, the computer device 100 may be a desktop computer, a laptop computer, a web server, a Personal Digital Assistant (PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, or an embedded device, and the embodiment of the present application does not limit the type of the computer device.

Fig. 2 is a flowchart of a data processing method provided in an embodiment of the present application, where the method is applied to a computer device. Wherein the computer device may be the computer device 100 shown in fig. 1, and referring to fig. 2, the method comprises:

step 201: and selecting a sample data set corresponding to the target modeling type from a plurality of stored sample data sets, wherein the selected sample data set comprises a plurality of samples, and each sample comprises data of a plurality of characteristic dimensions.

Step 202: at least one feature dimension is selected from a plurality of feature dimensions, and a target training model is selected from a plurality of training models stored for a target modeling type.

Step 203: and training the target training model according to the data of each sample in the at least one characteristic dimension, wherein the selected sample data set comprises the samples.

Step 204: and determining an extended data set according to the model obtained after training and the target extension multiple and the optimization target information, wherein the ratio of the number of samples included in the extended data set to the number of samples included in the selected sample data set is the target extension multiple, and the optimization target information refers to a matching index between the extended data set and the selected sample data set.

Optionally, before training the target training model according to the data of each sample in the at least one feature dimension included in the selected sample data set, the method further includes:

Optionally, after determining the extended data set according to the target expansion multiple and the optimized target information according to the model obtained after training, the method further includes:

Optionally, the method further comprises:

in the training process of the target training model, a training flow chart of the target training model is displayed, the training flow chart comprises a plurality of training nodes, the display mode of each training node is a first display mode, a second display mode or a third display mode, the first display mode is used for indicating that the corresponding training node is finished, the second display mode is used for indicating that the corresponding training node is in position, and the third display mode is used for indicating that the corresponding training node is not reached.

the policies are deployed for each sample included in the extended data set.

All the above optional technical solutions can be combined arbitrarily to form an optional embodiment of the present application, and the present application embodiment is not described in detail again.

Fig. 3 is a flowchart of a data processing method provided in an embodiment of the present application, where the method is applied to a computer device. Referring to fig. 3, the method includes:

step 301: and selecting a sample data set corresponding to the target modeling type from a plurality of stored sample data sets, wherein the selected sample data set comprises a plurality of samples, and each sample comprises data of a plurality of characteristic dimensions.

It should be noted that the target modeling type may be a consumption tendency modeling type, a crowd extension modeling type, a potential customer assessment modeling type, a user churn prediction modeling type, or a crowd clustering modeling type. The number of samples comprised by the selected sample data set is typically high, e.g. the selected sample data set may comprise 10000, 20000 samples, etc. The selected sample data set comprises a plurality of samples, each sample may comprise a sample identification, and the sample identification of each sample is used for uniquely indicating each sample. For example, the plurality of samples may be a plurality of users, and thus, the sample identification of each sample may be a user account of each user. In addition, the characteristic dimension may be age, gender, academic calendar, hobbies, region or purchasing tendency, and the like.

In a possible case, the selected sample data set includes a plurality of samples, and the plurality of feature dimensions included in two samples that are not the same may be completely the same, or partially the same, or completely different. When two samples that are not identical include the same feature dimension among a plurality of feature dimensions, the data of the two samples that are not identical may be identical or different in the same feature dimension.

For example, referring to table 1, sample 2, sample 3, sample 4, and sample 5 in table 1 are 5 samples of the plurality of samples included in the selected sample data set. As can be seen from table 1, sample 1 includes a plurality of characteristic dimensions of gender and age, and the data of these two characteristic dimensions are female and 20 years old, respectively; the sample 2 includes a plurality of characteristic dimensions of gender and age, and the data of the two characteristic dimensions are female and 20 years old, respectively; the sample 3 includes a plurality of characteristic dimensions of gender and age, and the data of the two characteristic dimensions are male and 30 years old respectively; the sample 4 comprises a plurality of characteristic dimensions which are a scholarly calendar and an interest, and the data of the two characteristic dimensions are a subject and a trip respectively; the sample 5 includes a plurality of characteristic dimensions, gender and academic calendar, and data of the two characteristic dimensions are male and subject, respectively.

That is, sample 1, sample 2, and sample 3 include 2 feature dimensions that are the same, all gender and age. The data of the 2 characteristic dimensions included in the sample 1 and the sample 2 are the same, and the data of the 2 characteristic dimensions included in the sample 1 and the sample 3 are different. In addition, sample 1 and sample 4 include 2 different feature dimensions. In the 2 feature dimensions included in the samples 1 and 5, there is a same feature dimension, i.e., gender, but the data of the samples 1 and 5 are not the same in the same feature dimension. Furthermore, there is one same feature dimension in the 2 feature dimensions included in the samples 4 and 5, i.e. the academic history, and the data of the samples 4 and 5 in the same feature dimension is the same.

TABLE 1

	Sex	Age (age)	Study calendar	Hobby
					Sample 1	Female with a view to preventing the formation of wrinkles	Age 20
Sample 2	Female with a view to preventing the formation of wrinkles	Age 20
					Sample 3	Male sex	Age 30
Sample 4			This section	Travel device
					Sample 5	Male sex		This section

Notably, the data for the plurality of feature dimensions included in each sample may be obtained via a sample database. Specifically, when each sample includes a sample identifier, data of a plurality of feature dimensions included in each sample may be acquired from the sample database according to the sample identifier of each sample. And the data of the plurality of feature dimensions included in each sample may be represented in binary numbers. Taking the characteristic dimension of gender as an example, the data of gender can be represented by 1 or 0, for example, the gender is female, and can be represented by 0; gender is male, and can be expressed as 1. For the characteristic dimensions with more data types such as ages and academic calendars, a plurality of characteristic dimension units of the characteristic dimensions can be determined firstly, if the data of the characteristic dimensions are located in a certain characteristic dimension unit, the data of the characteristic dimension unit is represented as 1, and the data of other characteristic dimension units are represented as 0, so that the data of the characteristic dimensions are represented. Taking the characteristic dimension of age as an example, the age can be divided into a plurality of characteristic dimension units, such as 0-18 years, 19-30 years, 31-40 years, 41-50 years, and the like. If the age is 20 years old, data of characteristic dimension units of 19-30 years old can be represented as 1, and data of other characteristic dimension units can be represented as 0, so that the data of the age can be represented. Of course, the data of the plurality of feature dimensions included in each sample may also be represented in other ways, which is not limited in this application.

In one possible implementation, a computer device may display a model type selection interface including a plurality of model types. When a selection operation of any one of the modeling types is detected, the any one of the modeling types may be determined as a target modeling type. At this point, the computer device may display a data set selection interface including a plurality of sample data sets therein. When a selection operation of any sample data set is detected, the sample data set can be determined as a sample data set corresponding to the target modeling type. If the plurality of sample data sets of the data set selection interface do not comprise the sample data set desired to be selected, the sample data set can be added to the computer equipment, so that the added sample data set is displayed on the data set selection interface when detection is made. Then, when a selection operation on the added sample data set is detected, the added sample data set can be determined as a sample data set corresponding to the target modeling type.

Illustratively, one modeling type may generally represent one scene, and thus, the modeling type selection interface may be referred to as a scene selection interface. Referring to fig. 4 to 5, fig. 4 is a modeling type selection interface including a plurality of modeling types, respectively: a consumption tendency modeling type, a crowd extension modeling type, a potential customer assessment modeling type, a user churn prediction modeling type, or a crowd clustering modeling type. When a selection operation to the crowd extension modeling type is detected, that is, when a click operation to the "enter" option on the modeling type selection interface is detected, the computer device may display a data set selection interface as shown in fig. 5, where the data set selection interface includes 2 sample data sets, and the 2 sample data sets are: "audience clicks on crowd of brand audiences in the second quarter of 2018" and "audience clicks on crowd of brand audiences in the first quarter of 2018". Then, when a selection operation of "clicking on a crowd by a certain brand audience in the second quarter of 2018" is detected, that is, when a click operation of a "selection" option corresponding to "clicking on a crowd by a certain brand audience in the second quarter of 2018" on the interface is detected, the computer device may determine "clicking on a crowd by a certain brand audience in the second quarter of 2018" as a sample data set corresponding to the crowd extension modeling type. If the 2 sample data sets of the data set selection interface do not include the sample data set which is wanted to be selected, the sample data set can be added into the computer equipment, and information such as the name and the path of the added sample data set is determined. And then when the clicking operation on the 'adding sample data set' option on the interface is detected, displaying an adding popup on the interface. The added popup window comprises an edit box of the sample data set, when the edit operation of the edit box of the sample data set is detected, the computer equipment can determine whether the information such as the name and the path of the sample data set obtained after editing is consistent with the information such as the name and the path of the added sample data set, and if so, the sample data set added in the computer equipment can be displayed in a data set selection interface. At this time, when a selection operation on the added sample data set is detected, that is, when a click operation on a "selection" option corresponding to the added sample data set is detected, the computer device may determine the added sample data set as a sample data set corresponding to the human community extension modeling type.

Typically, after the computer device determines the any modeling type as the target modeling type, the dataset selection interface may be displayed directly. For example, when a click operation on the "enter" option on the modeling type selection interface shown in fig. 4 is detected, the modeling type in which the "enter" option is located may be directly determined as the target modeling type, and then the data set selection interface shown in fig. 5 is directly displayed. Of course, after the computer device determines any of the modeling types as the target modeling type, when a selection operation of the dataset selection tab is detected, the dataset selection interface is displayed. For example, when a selection operation for the data set selection tab on the left side of the modeling type selection interface shown in fig. 4 is detected, the data set selection interface shown in fig. 5 is displayed.

Step 302: at least one feature dimension is selected from a plurality of feature dimensions, and a target training model is selected from a plurality of training models stored for a target modeling type.

In general, after the computer device determines the sample data set corresponding to the target modeling type, a feature dimension selection interface may be displayed, where the feature dimension selection interface includes a plurality of feature dimensions, and the plurality of feature dimensions are a plurality of feature dimensions of the sample data set corresponding to the target modeling type. When the selection operation of any at least one feature dimension is detected, the at least one feature dimension may be determined as at least one feature dimension selected from the plurality of feature dimensions.

For example, referring to fig. 6, a plurality of feature dimensions, respectively, gender, age, academic history, marital status, are included on the feature dimension selection interface shown in fig. 6, and when a selection operation of gender, age, and academic history is detected, that is, a selection operation of a selection box before gender, age, and academic history is detected, gender, age, and academic history may be determined as at least one feature dimension of the selection.

In general, after the computer device determines the sample data set corresponding to the target modeling type, the feature dimension selection interface may be directly displayed. For example, when a click operation on a "select" option on the data set selection interface shown in fig. 5 is detected, the data set in which the "select" option is located may be directly determined as the data set corresponding to the target modeling type, and then the feature dimension selection interface shown in fig. 6 is directly displayed. Of course, after the computer device determines the sample data set corresponding to the target modeling type, when a selection operation of the feature dimension selection tag is detected, a feature dimension selection interface is displayed. For example, when a selection operation of the feature dimension selection tab on the left side of the data set selection interface shown in fig. 5 is detected, the feature dimension selection interface shown in fig. 6 is displayed.

It should be noted that different modeling types may correspond to different training models. For example, when the modeling type is a crowd extension modeling type, the plurality of training models stored for the crowd extension modeling type may include a difference index enhancement model, a single Class Support Vector Machine (One Class SVM) model, or a two-layer Convolutional Neural network (2layers), CNN (2layers)) model, and the like. When the modeling type is a consumption tendency modeling type, the plurality of training models stored for the consumption tendency modeling type may include a hidden shopping mixture model, a hidden circular mixture model, and the like. When the modeling type is a crowd clustering modeling type, the plurality of training models stored for the crowd clustering modeling type may include a K-means clustering model and the like. Accordingly, one training model may be selected from a plurality of training models stored for the target modeling type as the target training module. Additionally, a plurality of training models may be stored in the life cycle management module 140 in the computer device 100 described in FIG. 1.

In one possible implementation, a computer device may display a model selection interface including a plurality of training models stored for a target modeling type. When a selection operation of any one of the training models is detected, the any one of the training models may be determined as a target training model.

Illustratively, referring to fig. 7, 3 training models are included in the model selection interface shown in fig. 7, and the 3 training models are respectively: a difference index enhancement model, a One Class SVM model, and a CNN (2layers) model. When a selection operation of any one of the 3 training models is detected, the any one training model may be determined as a target training model. That is, when the selection operation of any one of the selection boxes before the 3 training models is detected, any one of the training models may be determined as the target training model.

Typically, after the computer device determines the at least one feature dimension, the model selection interface may be displayed directly. For example, when a selection operation on a selection box before at least one feature dimension in fig. 6 is detected, at least one feature dimension may be determined from the feature dimension selected by the selection box, and then the model selection interface shown in fig. 7 is directly displayed. Of course, after the feature dimension selected by the selection box is determined as at least one feature dimension, the model selection interface may be displayed when the selection operation of the model selection tab is detected. Alternatively, the model selection interface is displayed upon checking a selection operation of a "next" option in the feature dimension selection interface.

It should be noted that, in practical applications, when the plurality of training models stored for the target modeling type do not include the required target training model, the operator may further add the target training model by storing the code of the target training model in the computer device, so that the target training model may be selected from the plurality of training models stored for the target modeling type. Or, if there is an unnecessary training model in a plurality of training models stored for the target modeling type, the operator may delete the code of the unnecessary training model in the computer device, thereby deleting the unnecessary training model. That is, in the embodiment of the present application, a training model for a target modeling type may be added or deleted according to a use requirement.

Illustratively, taking the implementation environment shown in fig. 1 as an example, when a required target training model is not included in the plurality of training models stored for the target modeling type, an operator may store the code of the target training model into the code library of the life cycle management module 140 through the API of the life cycle management module 140 in the computer device 100 to implement addition of the target training model. Alternatively, if there is an unnecessary training model among a plurality of training models stored for a target modeling type, an operator may delete the code of the unnecessary training model stored in the code library of the life cycle management module 140 through the API of the life cycle management module 140 in the computer apparatus 100, thereby implementing deletion of the unnecessary training model.

In some embodiments, each of the plurality of training models may include at least one parameter, so before performing step 303, at least one parameter set for the target training model may also be determined through the following steps (1) - (2).

(1): and displaying a parameter setting interface, wherein the parameter setting interface comprises at least one parameter editing frame.

It should be noted that the parameters of the target training model may include a maximum number of iterations, a convergence criterion, a regularization coefficient, a minimum convergence error, and the like.

In one possible implementation, the parameter setting interface may be a separate interface or a part of the model selection interface. Illustratively, referring to fig. 8, the parameter setting interface in fig. 8 is a part of the model selection interface, that is, on the interface shown in fig. 8, not only a plurality of training models stored for a target modeling type but also at least one parameter edit box is displayed.

(2): and acquiring at least one parameter set for the target training model from at least one parameter editing box.

In one possible case, a value is already set in advance in at least one parameter edit box on the parameter setting interface, and in this case, the setting of at least one parameter in the parameter edit box may not be performed. In other words, at least one parameter may each be considered a default value. For example, when the parameters include the maximum number of iterations, the convergence criterion, the regularization coefficient, the minimum convergence error, and the like, the maximum number of iterations may be set to 500, the convergence criterion may be set to 1, the regularization coefficient may be set to 0, the minimum convergence error may be set to 0.005, and the like in advance. Of course, in this case, the default value in at least one parameter edit box may also be modified.

In another possible case, no numerical value is set in at least one parameter edit box on the parameter setting interface, and at this time, when an edit operation on the at least one parameter edit box is detected, the at least one parameter obtained after the edit may be used as the at least one parameter set for the target training model.

Step 303: and training the target training model according to the data of each sample in the at least one characteristic dimension, wherein the selected sample data set comprises the samples.

In a possible implementation manner, the target training model is trained, that is, data of each sample included in the selected sample data set on at least one feature dimension is input into the target training model, and the target training model is trained. In some embodiments, the target training model may be regarded as an algorithm, and the data of each sample included in the selected sample data set in at least one feature dimension is input into the target training model, that is, the data of each sample included in the selected sample data set in at least one feature dimension is processed according to the algorithm. And for different target training models, the data of each sample included in the selected sample data set on at least one characteristic dimension are processed in different modes.

For example, after the data of each sample in at least one feature dimension included in the selected sample data set is input into the target training model, since the data of the total samples in any feature dimension of the at least one feature dimension can be classified into multiple categories, for any feature dimension of the at least one feature dimension, a ratio between the number of samples corresponding to each type of data in the any feature dimension and the number of total samples can be determined. These ratios are then determined as reference values for the data for any of the feature dimensions.

For example, the selected sample data set comprises 1000 samples. The selected sample data set includes a plurality of characteristic dimensions of gender, age, and click behavior. Data of gender can be divided into two types, namely female and male; age can be divided into a number of characteristic dimension units: the ages of 0-18, 19-30, 31-40 and 41-50, namely, the data of the ages can be divided into four types; data for click behavior can be divided into two categories, namely click and no click. The target training model determines that the number of the sex-female samples in the 1000 samples is 600, and the ratio of the sex-female sample number to the total sample number is 600 divided by 1000, i.e. 0.6. The number of the sex-male samples is 400, and the ratio of the sex-male sample number to the total sample number is 0.4. The data of the ages are 100, 400, 300 and 200 samples in the ages of 0-18, 19-30, 31-40 and 41-50. Similarly, the ratio of the number of samples aged 0-18, 19-30, 31-40 and 41-50 years to the total number of samples is: 0.1, 0.4, 0.3 and 0.2. Similarly, the number of samples clicked in the click behavior is 800, the number of samples not clicked in the click behavior is 200, the ratio of the number of samples clicked in the click behavior to the total number of samples is 0.8, and the ratio of the number of samples not clicked in the click behavior to the total number of samples is 0.2. After determining the ratios, the target training model may determine the ratios as reference values corresponding to the data of each feature dimension.

In addition, after the computer device trains the target training model, the computer device may display a model training details interface including information of the target training model and a training progress of the target training model, so that the training progress of the target training model may be observed through the model training details interface.

Illustratively, assuming that the target training model selected in the model selection interface shown in fig. 7 is a difference index enhanced model, at this time, the computer device may display a model training detail interface as shown in fig. 9, and information of the target training model is included on the model training detail interface shown in fig. 9, and a training progress of the target training model. The information of the target training model may include a target modeling type, a selected sample data set, the number of the selected at least one feature dimension, the target training model, the number of times that the target training model has been trained, the training progress of the target training model, and the like.

Typically, after the computer device determines the target training model, the target training model may be trained directly according to data of each sample included in the selected sample data set in at least one feature dimension. Of course, in some embodiments, the computer device may not perform training of the target training model first, but rather display a model training details interface. At this time, when a selection operation of a "run" option on the model training detail interface is detected, the target training model may be trained. Illustratively, training of the target training model begins when a click operation on a "run" option on the model training details interface shown in FIG. 9 is detected.

It is noted that historical training records for training other training models can also be displayed on the model training detail interface. Therefore, the operator can more conveniently master the information of the historical training records of the target training model or other training models.

Optionally, in the training process of the target training model, a training flowchart of the target training model is displayed, where the training flowchart includes a plurality of training nodes, and a display mode of each training node is a first display mode, a second display mode, or a third display mode.

It should be noted that, in some embodiments, the plurality of training nodes may include inputting a selected sample data set, determining a reference value corresponding to data in each feature dimension, and the like. In addition, the first display mode is used for indicating that the corresponding training node is finished, the second display mode is used for indicating that the training node is positioned at the corresponding training node, and the third display mode is used for indicating that the corresponding training node is not reached.

In one possible implementation, a training flow diagram of the target training model may be displayed on a model training details interface. See, for example, fig. 10. The training flow chart of the target training model is displayed on the model training detail interface shown in fig. 10.

Illustratively, the first display mode, the second display mode and the third display mode may be represented by different colors. For example, the first display mode is to set the color of the node that has completed the corresponding training to gray, the second display mode is to set the color of the node that is in the corresponding training to red, and the third display mode is to set the color of the node that has not reached the corresponding training to green. Of course, the first display mode, the second display mode, and the third display mode may be expressed in other forms, which is not limited in the embodiments of the present application.

It should be noted that, in the training process of the target training model, the training flowchart of the target training model is displayed, so that the operator can more intuitively and clearly master the training process of the target training model, that is, the operator can intuitively observe which step the training of the target training model is performed.

Step 304: and determining an expansion data set according to the model obtained after training and the target expansion multiple and the optimized target information.

It should be noted that the ratio between the number of samples included in the extended data set and the number of samples included in the selected sample data set is the target expansion multiple. The optimization target information refers to a matching index between the extended data set and the selected sample data set. The optimization target information can be click behavior, interests, hobbies or region diversity and the like. It should be understood that the optimization objective information may be different and the determined extended data set may be different.

In addition, the target expansion multiple and the optimization target information can be set in the computer device in advance, or can be set on an interface displayed by the computer device before the training of the target training model.

Illustratively, if the target expansion factor and the optimization target information are set before the training of the target training model, referring to fig. 11, a drag bar of the expansion factor and a plurality of selection boxes of the optimization target information may also be displayed on the model selection interface. The target expansion multiple can be set by dragging the drag bar of the expansion multiple, or by clicking the "+" and "-" options. In addition, the optimization target information can be set by checking a plurality of selection boxes of the optimization target information. After the target training model is trained, the extended data set can be determined according to the model obtained after training and the target expansion multiple and the optimized target information.

For example, if the optimization target is click behavior, after determining a plurality of reference values corresponding to a plurality of data of any one of the at least one feature dimension according to the above step 303, the reference values corresponding to the data of the plurality of feature dimensions included in each sample may also be determined from the sample database. Then, a plurality of reference numerical values of each sample in the sample database corresponding to a plurality of characteristic dimensions are added to obtain a reference score of each sample. And sequencing all samples included in the sample database according to the sequence of the reference score of each sample from large to small. And selecting partial samples from the sorted samples according to the target expansion multiple to form an expanded data set. At this time, the matching index between the composed extended data set and the selected sample data set is the click behavior, in other words, the multiple samples included in the composed extended data set and the multiple samples included in the selected sample data set have similar click behaviors. If the ratio of the number of samples clicked to the total number of samples is 0.8, in the selected sample data set, then in the extended data set, there may be a ratio of the number of samples clicked to the total number of samples clicked to be 0.8. If the ads are placed on all of the samples included in the expanded dataset, then there may be 80% of the samples that click on the placed ads and 20% of the samples that do not click on the placed ads.

For example, the optimization target information is click behavior, the target expansion multiple is 10, and the selected sample data set includes 1000 samples, that is, the determined expansion data set needs to include 10000 samples. In connection with the example in the above step 303, the sample database includes 20000 samples, if the 20000 samples include data of 2 feature dimensions of gender and age, if the gender is 0.6 for the female, the gender is 0.4 for the male, and the ages are 0-18, 19-30, 31-40 and 41-50, the reference values are 0.1, 0.4, 0.3 and 0.2 respectively, and the reference score of each of the 20000 samples is determined. Specifically, for one of the samples, the reference value corresponding to the data of the characteristic dimension of gender is added to the reference value corresponding to the data of the characteristic dimension of age to obtain the reference score of the sample. Then, the 20000 samples are sorted in descending order of the reference score of each sample. And selecting samples with reference scores in the range of 1-10000 to form an extended data set. Since the optimization target information is click behavior, that is, if the advertisement is placed on the 10000 samples in the extended data set, 8000 samples may click on the placed advertisement, and 2000 samples do not click on the placed advertisement.

Optionally, before step 304, the weight of the enhancement indicator may also be set. The enhanced metrics may include purchasing power, subjective interest, and browsing history, among others.

It should be noted that, the 3 enhancement indicators of the purchasing power, the subjective interest, and the browsing history may respectively correspond to a plurality of relevant feature dimensions, and after the weights of the 3 enhancement indicators are set, a plurality of reference values corresponding to data of the plurality of feature dimensions, which are relevant to the 3 enhancement indicators, of each sample in the extended data set may be respectively multiplied by the weights of the corresponding 3 enhancement indicators, so as to determine the importance degree of the 3 enhancement indicators in the extended data set. That is, the greater the weight, the higher the degree of importance, and the smaller the weight, the lower the degree of importance.

In one possible implementation, referring to FIG. 12, edit boxes of purchasing power, subjective interest, and weights of browsing history may be displayed on the model selection interface. In a possible case, the weights of the 3 enhancement indicators are set in advance in the edit boxes of the weights of the purchasing power, the subjective interest and the browsing history on the model selection interface, and in this case, the weights of the 3 enhancement indicators may not be set in the edit boxes of the weights of the 3 enhancement indicators. Or, on the model selection interface, the weight of the 3 enhancement indicators is not set in the edit boxes of the weights of the purchasing power, the subjective interest and the browsing history, and at this time, when the edit operation of the edit boxes of the weight of the 3 enhancement indicators is detected, the 3 weights obtained after the edit can be used as the weights corresponding to the purchasing power, the subjective interest and the browsing history.

For example, the weights of the purchasing power, the subjective interest and the browsing history are 0.5, 0.3 and 0.1, respectively, a plurality of reference values corresponding to a plurality of characteristic dimensions related to the purchasing power of each sample in the sample database may be multiplied by 0.5, a plurality of reference values corresponding to a plurality of characteristic dimensions related to the subjective interest of each sample in the sample database may be multiplied by 0.3, a plurality of reference values corresponding to a plurality of characteristic dimensions related to the browsing history of each sample in the sample database may be multiplied by 0.1, and then, the reference scores of each sample may be obtained by summing.

Optionally, the following steps a-B may also be included after step 304.

Step A: and displaying the evaluation result.

It should be noted that the evaluation result is used to evaluate the extended data set determined by the model obtained after training. The evaluation result may include a plurality of evaluation indexes. Wherein, the plurality of evaluation indexes can be represented by a Receiver Operating Characteristic (ROC) graph, a precision recall (P-R) graph, an optimization goal and population expansion multiple graph, a distribution graph of a plurality of samples included in the selected sample data set and a plurality of samples included in the expansion data set, and the like. In addition, the plurality of evaluation indexes may further include accuracy, the accuracy being used to evaluate a degree of accuracy of the extended data set, a higher accuracy indicating a higher degree of accuracy of the extended data set, and a lower accuracy indicating a lower degree of accuracy of the extended data set.

The ROC graph has the false positive rate as the horizontal axis and the true positive rate as the vertical axis. Each point on the ROC curve reflects the sensitivity of the extended dataset to the same signal stimulus. The false positive rate refers to a ratio between the number of negative samples predicted to be positive by the target training model in the extended data set and the number of actual negative samples, and the true positive rate refers to a ratio between the number of positive samples predicted to be positive by the target training model in the extended data set and the number of actual positive samples. Positive and negative samples refer to two different samples divided in some sort. A larger Area (AUC) between the ROC Curve and the horizontal and vertical axes indicates a higher quality of the expanded data set, and a smaller Area indicates a lower quality of the expanded data set.

The P-R graph has recall on the horizontal axis and accuracy on the negative axis. Alternatively, in some embodiments, the P-R curve has the recall ratio on the horizontal axis and the precision ratio on the vertical axis. The recall ratio refers to a ratio between the positive samples predicted to be positive by the target training model and all the positive samples in the extended data set, and the precision ratio refers to a ratio between the positive samples predicted to be positive and all the positive samples predicted to be positive in the samples predicted to be positive by the target training model in the extended data set. A larger area between the P-R curve and the horizontal and vertical axes indicates a higher quality of the expanded data set, and a smaller area indicates a lower quality of the expanded data set.

The graph of the expansion multiple of the optimization target and the crowd takes the expansion multiple as the horizontal axis and the click rate as the vertical axis. Generally, the larger the expansion factor, the lower the click rate. It can also be understood that the larger the expansion factor, the lower the quality of the expanded data set.

The distribution diagram of the plurality of samples included in the selected sample data set and the plurality of samples included in the extended data set is a distribution diagram of the samples included in the selected sample data set and the samples included in the extended data set, and the samples included in the extended data set are mapped onto a two-dimensional plane by using a certain technology, so that the similar situation between the selected sample data set and the extended data set can be observed visually.

In one possible implementation, the assessment results may be displayed on a model training details interface. Specifically, when an evaluation result display operation is detected, the evaluation result may be displayed on the model training detail interface. Illustratively, referring to FIG. 9, when a click operation on the "evaluate" option on the model training details interface shown in FIG. 9 is detected, the evaluation results may be displayed on the training model details interface.

It should be noted that, since the evaluation result is data that can determine the quality of the extended data set determined by the model obtained after training, that is, the better the evaluation result is, the higher the quality of the determined extended data set is. Displaying the evaluation results helps the operator to intuitively judge the quality of the determined extended data set. And the evaluation result is displayed on the interface of the computer equipment, so that an operator only needs to observe and does not need to obtain the evaluation result by other tools, and the method is more convenient and labor-saving.

And B: and adjusting at least one parameter included in the target training model according to the evaluation result.

In a possible implementation manner, at least one parameter included in the target training model may be adjusted according to a plurality of evaluation indexes included in the evaluation result, so that the evaluation result of the extended data set determined by the target training model better meets a preset requirement.

Step 305: the policies are deployed for each sample included in the extended data set.

In one possible implementation, a policy may be deployed for each sample included in the extended dataset according to the optimization objective information. For example, the optimization target information is click behavior, and each sample in the selected sample data set may be a sample of a placed advertisement, that is, each sample in the selected sample data set clicks or does not click on a placed advertisement. At this time, the same advertisement can be delivered to the expanded data set, and the ratio between the number of samples for clicking the same advertisement in the expanded data set and the total number of samples included in the expanded data set is similar to the corresponding ratio in the selected sample data set; the ratio of the number of samples in the expanded data set that are not clicked on the same advertisement to the total number of samples included in the expanded data set is similar to the corresponding ratio in the selected sample data set. In other words, the click behavior of the sample in the expanded data set on the delivered advertisement is similar to the click behavior of the sample in the selected sample data set on the delivered advertisement.

In one possible implementation, after the computer device determines the extended data set, the extended data set may be stored and the computer device may display a deployment interface. The deployment interface comprises related information of the expansion data set and an option of carrying out deployment strategy on the expansion data set. Specifically, after the computer device determines to expand the data set, when a selection operation of a deployment tab is detected, a deployment interface is displayed. Alternatively, the deployment interface is displayed upon detecting a selection operation of a "next" option in the model training details interface.

Illustratively, referring to fig. 13, the deployment interface shown in fig. 13 includes related information of the extended data set, which includes a target modeling type, a target training model, a data processing manner, and an evaluation index. The data processing manner is the processing manner of the target training model mentioned in the above step 303 on the selected sample data set, and the evaluation index is the evaluation index including AUC, accuracy, and the like mentioned in the above step a. When a click operation on a "deploy policy" option on the deployment interface is detected, a deployment policy may be included for each sample of the extended dataset. Specifically, the extended data set may be deployed in a certain production environment, and a corresponding policy may be deployed for each sample included in the extended data set according to the usage requirement of the production environment. Of course, each sample deployment policy included in the extended data set may also be deployed in other ways, which is not limited in this application embodiment.

In the embodiment of the application, a sample data set corresponding to a target modeling type is selected from a plurality of stored sample data sets, and the selected sample data set comprises a plurality of samples. Since the selected sample data set includes a plurality of feature dimensions, at least one feature dimension may be selected from the plurality of feature dimensions, and a target training model may be selected from a plurality of training models stored for the target modeling type. And then, training the target training model according to the data of each sample in at least one characteristic dimension, wherein the selected sample data set comprises the data. And determining an expansion data set according to the model obtained after training and the target expansion multiple and the optimized target information. Since there is a match indicator between the expanded data set and the selected sample data set, there is similarity between the expanded data set and the selected sample data set. Finally, a policy may be deployed for each sample included in the extended data set. According to the embodiment of the application, as the plurality of training models aiming at the target modeling type are stored in the computer equipment in advance, when the target model in the plurality of training models needs to be trained, the target model can be directly selected from the plurality of training models, and then the target training model is trained according to the data of each sample in at least one characteristic dimension, wherein the data is included in the selected sample data set. Furthermore, the target training model may be trained multiple times, or different training models may be trained as the target training model. The data processing method provided by the embodiment of the application is realized without compiling different codes for a plurality of training models by operators and without a certain coding basis of the operators, so that the training process of the target training model is simplified, and the process of determining the extended data set is simpler, more convenient and more efficient.

Fig. 14 is a block diagram of a data processing apparatus according to an embodiment of the present application, which is applied to a computer device. Referring to fig. 14, the apparatus includes: a first selection module 1401, a second selection module 1402, a training module 1403, and a determination module 1404.

A first selecting module 1401, configured to select, from a plurality of stored sample data sets, a sample data set corresponding to a target modeling type, where the selected sample data set includes a plurality of samples, and each sample includes data of a plurality of feature dimensions;

a second selection module 1402, configured to select at least one feature dimension from the plurality of feature dimensions, and select a target training model from a plurality of training models stored for the target modeling type;

a training module 1403, configured to train the target training model according to the data of each sample included in the selected sample data set in the at least one feature dimension;

a determining module 1404, configured to determine, according to the model obtained after training, an extended data set according to a target expansion multiple and optimization target information, where a ratio between the number of samples included in the extended data set and the number of samples included in the selected sample data set is the target expansion multiple, and the optimization target information is a matching index between the extended data set and the selected sample data set.

Optionally, the apparatus further comprises:

the second display module is used for displaying an evaluation result, and the evaluation result is used for evaluating an extended data set determined by the model obtained after training;

Optionally, the apparatus further comprises:

the third display module is used for displaying a training flow chart of the target training model in the training process of the target training model, the training flow chart comprises a plurality of training nodes, the display mode of each training node is a first display mode, a second display mode or a third display mode, the first display mode is used for indicating that the corresponding training node is finished, the second display mode is used for indicating that the corresponding training node is in the process of being positioned, and the third display mode is used for indicating that the corresponding training node is not reached.

Optionally, the apparatus further comprises:

a deployment module to deploy policies for each sample included in the extended dataset.

It should be noted that: in the data processing apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing data processing, and in practical applications, the functions may be distributed by different functional modules as needed, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the functions described above. In addition, the data processing apparatus and the data processing method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

Fig. 15 is a schematic structural diagram of a data processing apparatus 1500 according to an embodiment of the present disclosure, where the data processing apparatus 1500 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 1501 and one or more memories 1502, where the memory 1502 stores at least one instruction, and the at least one instruction is loaded and executed by the processor 1501 to implement the data processing method in the foregoing embodiments. Of course, the data processing apparatus 1500 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the data processing apparatus 1500 may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor in a data processing apparatus to perform the data processing method in the above-described embodiments. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of data processing, the method comprising:

determining an extended data set according to a model obtained after training and according to a target extension multiple and optimization target information, wherein the ratio of the number of samples included in the extended data set to the number of samples included in the selected sample data set is the target extension multiple, the optimization target information refers to a matching index between the extended data set and the selected sample data set, and the optimization target information comprises at least one of click behavior, interest and hobbies and region diversity;

the method further comprises the following steps:

deleting the training model by deleting the code of the training model;

adding a training model by storing the code of the training model;

before the training the target training model according to the data of each sample included in the selected sample data set in the at least one feature dimension, the method further includes:

displaying a model selection interface;

displaying a dragging bar with expansion multiple and a plurality of selection frames of optimization target information in the model selection interface;

and setting the target expansion multiple by dragging the dragging bar of the expansion multiple, and setting the optimization target information by checking the selection boxes of the plurality of optimization target information.

2. The method of claim 1, wherein before training the target training model according to the data of each sample in the at least one feature dimension comprised in the selected set of sample data, the method further comprises:

3. The method of claim 1, wherein after determining the expanded data set according to the target expansion multiple and the optimization target information based on the trained model, the method further comprises:

4. The method of claim 1, wherein the method further comprises:

5. The method of claim 1, wherein after determining the expanded data set according to the target expansion multiple and the optimization target information based on the trained model, the method further comprises:

deploying a policy for each sample included in the extended data set.

6. A data processing apparatus, characterized in that the apparatus comprises:

a determining module, configured to determine an extended data set according to a model obtained after training and according to a target expansion multiple and optimization target information, where a ratio between the number of samples included in the extended data set and the number of samples included in the selected sample data set is the target expansion multiple, and the optimization target information is a matching index between the extended data set and the selected sample data set;

the apparatus also includes means for:

deleting the training model by deleting the code of the training model;

adding a training model by storing the code of the training model;

the apparatus also includes means for:

displaying a model selection interface;

7. The apparatus of claim 6, wherein the apparatus further comprises:

8. The apparatus of claim 6, wherein the apparatus further comprises:

9. A data processing apparatus, characterized in that the apparatus comprises:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the steps of the method of any one of claims 1-5.

10. A computer-readable storage medium having instructions stored thereon, wherein the instructions, when executed by a processor, implement the steps of the method of any of claims 1-5.