CN112417304B

CN112417304B - Data analysis service recommendation method and system for constructing data analysis flow

Info

Publication number: CN112417304B
Application number: CN202011453932.3A
Authority: CN
Inventors: 王菁; 赵汝涛; 陈高建; 栗倩文; 袁云静
Original assignee: North China University of Technology
Current assignee: North China University of Technology
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2023-06-23
Anticipated expiration: 2040-12-10
Also published as: CN112417304A

Abstract

The invention provides a data analysis service recommendation method for constructing a data analysis flow, which comprises the following steps: s1, acquiring a data set to be analyzed and data analysis task description information corresponding to the data set to be analyzed; s2, responding to the first recommendation requirement of the data analysis service, and performing first-step single-step service recommendation on the data set to be analyzed from the data analysis service set according to the data analysis task description information corresponding to the data set to be analyzed and the characteristics of the data set to be analyzed to obtain a first-step single-step service recommendation set. The invention provides instant service recommendation for the user in the process of constructing the data analysis flow, guides the user to carry out decision analysis, can support that the common business user without data analysis expert knowledge can conveniently use the data analysis service to finish business requirements, greatly reduces the difficulty of data analysis, effectively saves labor cost and improves the quality and efficiency of data analysis.

Description

Data analysis service recommendation method and system for constructing data analysis flow

Technical Field

The invention relates to the field of data analysis, in particular to a method for constructing a data analysis service combination flow based on machine learning, and more particularly relates to a data analysis service recommendation method and system for constructing a data analysis flow.

Background

Data analysis often requires that multiple services be combined to build a data analysis flow. One data analysis flow includes a plurality of data analysis services and a sequential execution relationship between the services. The data analysis service is the encapsulation of data analysis algorithms, such as data preprocessing, feature selection, classification, regression, evaluation, etc., and provides interfaces for invocation by encapsulating the corresponding algorithms. When the flow runs, each data analysis service is called in turn according to the sequence execution relationship, the data set to be analyzed is processed, the output of each step is used as the input of the next step, and finally the data analysis result is obtained. However, in the prior art, when a data analysis flow is constructed, the problems of difficult service selection and difficult parameter optimization exist, and selecting a proper service and optimizing parameters according to different data sets is a very troublesome matter, and the final result is often obtained by searching for multiple times in the data analysis process. For example, a traffic management department in a certain province wants to predict recent traffic flow by analyzing traffic data of a highway toll station, and further sets a lane in advance to relieve traffic jam. The goal of traffic managers is to build a model that predicts traffic flow, but what algorithms are used, what parameters are set, and expert data analysis knowledge is required, which is very difficult for traffic managers. Meanwhile, due to uncertainty of data analysis, a user often needs to modify the data analysis flow after checking the result, such as introducing or replacing other services, modifying service parameters and the like, and the final satisfactory result can be obtained after multiple attempts, which presents greater challenges for data analysts. The traditional service recommendation technology is used for recommending the general service combination problem, and is not used for recommending the data analysis service for users in a multi-time interaction mode according to the specific requirement of the data analysis, so that the problems of low recommendation efficiency, inaccurate recommendation result and the like exist.

Disclosure of Invention

Therefore, an object of the present invention is to overcome the above-mentioned drawbacks of the prior art, and to provide a new data analysis service recommendation method and system for constructing a data analysis flow.

According to a first aspect of the present invention, there is provided a data analysis service recommendation method for constructing a data analysis flow, for performing data analysis service recommendation for the data analysis flow for constructing a data set to be analyzed, the data analysis service recommendation method comprising: s1, acquiring a data set to be analyzed and data analysis task description information corresponding to the data set to be analyzed; s2, responding to the first recommendation requirement of the data analysis service, and performing first-step single-step service recommendation on the data set to be analyzed from the data analysis service set according to the data analysis task description information corresponding to the data set to be analyzed and the characteristics of the data set to be analyzed to obtain a first-step single-step service recommendation set, wherein the data analysis service set is a set formed by all data analysis services for constructing a data analysis service flow. Preferably, the recommendation method further includes: s3, responding to re-recommending requirements of the data analysis service, and performing single-step service recommendation from the data analysis service set to the data set to be analyzed from the next step of the constructed flow segment according to data analysis task description information corresponding to the data set to be analyzed, characteristics of the data set to be analyzed and the constructed data analysis flow segment.

The data analysis task description information corresponding to the data set to be analyzed at least comprises: the data analysis method comprises the steps of data analysis task type, operation time upper limit value, task target data characteristic set, performance evaluation index set and data set division strategy.

In some embodiments of the present invention, the first step single step service recommendation is performed by: s21, generating an available data analysis flow set and performance scores of each data analysis flow based on the data analysis service set according to data analysis task description information corresponding to the data set to be analyzed and the characteristics of the data set to be analyzed; s22, selecting a data analysis flow with the comprehensive score higher than a preset threshold value from the available data analysis flow sets, and selecting data analysis services corresponding to the first step in all selected data analysis flows to generate a first-step single-step recommended service set. Preferably, in the step S21, an available data analysis flow set is generated by S211, extracting a data set to be analyzed and calculating data set meta-characteristics; s212, based on meta-characteristics of the data set to be analyzed, taking the data analysis service corresponding to each step in the data analysis flow as a node, and adopting a game model to perform multiple-time circular searches to obtain a set of available data analysis flows formed by multiple data analysis flows, wherein each search generates one available data analysis flow and obtains the performance scores of the flows. In some embodiments of the present invention, each cycle in step S212 includes: s2121, initializing a game model, wherein the game model comprises a first Monte Carlo tree search model, a second Monte Carlo tree search model and a first LSTM prediction model connected with the first Monte Carlo tree search model; s2122, taking a data analysis service corresponding to each step in the data analysis flow as a node, performing multiple searches by using a first Monte Carlo tree search model and a first LSTM prediction model to obtain a plurality of data analysis flows and performance scores corresponding to each data analysis flow, and training the first LSTM prediction model by using the obtained plurality of data analysis flows, meta-characteristics of a data set to be analyzed and the current state of the first Monte Carlo tree search model to obtain a second LSTM prediction model; s2122, taking the first Monte Carlo tree search model and the first LSTM prediction model as a first participant, taking the second Monte Carlo tree search model and the second LSTM prediction model as a second participant, performing game searching, comparing search results, taking the LSTM prediction model corresponding to the participant which searches out the available data analysis flow first as the first LSTM prediction model of the next cycle, and taking the available data analysis flow which is searched out first as the available data analysis flow searched out at the moment. Preferably, when the first monte carlo tree search model and the first LSTM prediction model, the second monte carlo tree search model and the second LSTM prediction model perform game search, starting from a root node of the monte carlo tree, calculating a comprehensive score of each data analysis service in turn based on the data analysis service set, wherein: aiming at each level of sub-node, taking the data analysis service with the comprehensive score exceeding a default threshold value as the available service corresponding to the level of sub-node to form an available service set; selecting the data analysis service with the highest comprehensive score from the available service set corresponding to the current level child node as the data analysis service corresponding to the current level child node, and continuously searching the subsequent node of the node until the data analysis flow is finished; and after the search is finished, the access times and the quality value of each node in the search are reversely propagated upwards along the Monte Carlo tree structure, an available data analysis flow is generated according to the data analysis service selected by each level of node, and the flow is executed to obtain the performance score of the flow. Preferably, the composite score for each data analysis service is calculated by:

Wherein T (l, a) is the service association constraint intensity between the data analysis services corresponding to the front node and the rear node, l is the previous node of the current node in the flow, and a is the data analysis service selected by the current node; q (s, a): a quality value representing the current node; p (s, a): representing a probability value of selecting the service obtained by predicting the LSTM model; n(s) is the number of searches for MCTS; n (s, a) is the number of times this service is accessed; s represents the current state of the Monte Carlo tree, and w and c are weights which are accumulated and judged according to historical experimental data.

In some embodiments of the present invention, the service association constraint strength between the data analysis services corresponding to the front and rear nodes described in the foregoing embodiments is obtained by mining an analysis history data analysis flow, where:

the quality value of the current node is calculated as follows:

Q(s,a)＝(N(s,a)*Q(s,a)+v)/(1+N(s,a))

where v is a preset parameter of the Monte Carlo tree search model.

In some embodiments of the present invention, single step service recommendation is performed from the next step of the constructed flow segment by using the same method as the first step of single step service recommendation, wherein, when a game is searched, starting from a node corresponding to the last service of the constructed flow segment in the Monte Carlo tree, a comprehensive score of each data analysis service is calculated in turn based on the data analysis service set, wherein: aiming at each level of sub-node, taking the data analysis service with the comprehensive score exceeding a default threshold value as the available service corresponding to the level of sub-node to form an available service set; selecting the data analysis service with the highest comprehensive score from the available service set corresponding to the current level child node as the data analysis service corresponding to the current level child node, and continuously searching the subsequent node of the node until the data analysis flow is finished; and after the search is finished, the access times and the quality value of each node in the search are reversely propagated upwards along the Monte Carlo tree structure, an available data analysis flow is generated according to the data analysis service selected by each level of node, and the flow is executed to obtain the performance score of the flow.

According to a second aspect of the present invention, there is provided a data analysis service recommendation system for constructing a data analysis flow, the recommendation system including a service recommendation interface module for receiving a data set to be analyzed, a data analysis task, and a data analysis flow fragment constructed for the data set to be analyzed, and returning a service recommendation result; the data analysis flow generating module is used for generating an available data analysis flow set and performance scores of each data analysis flow based on the data analysis service set according to the data analysis task description information corresponding to the data set to be analyzed, the characteristics of the data set to be analyzed and the constructed data analysis flow fragments; the single-step service recommendation module is used for selecting a data analysis flow with the comprehensive score higher than a preset threshold value from the available data analysis flow sets, and selecting data analysis services corresponding to the next step of the constructed data analysis flow fragments in all the selected data analysis flows to generate a single-step recommendation service set of the step; the data analysis service management module is used for managing service information related to data analysis; and the service association relation management module is used for managing the association relation between the data analysis services.

Compared with the prior art, the invention has the advantages that: the invention provides instant service recommendation for the user in the process of constructing the data analysis flow, guides the user to carry out decision analysis, can support that the common business user without data analysis professional knowledge can conveniently use the data analysis service to finish business requirements, can better adapt to the personalized requirements of the user, greatly reduces the difficulty of data analysis, effectively saves labor cost and improves the quality and efficiency of data analysis.

Drawings

Embodiments of the invention are further described below with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of a data analysis service recommendation method for constructing a data analysis flow according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a usable data analysis flow generation process according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary search of a Monte Carlo tree search model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a data analysis service recommendation system for constructing a data analysis flow according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail by means of specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In order to solve the problems of low recommendation efficiency and inaccurate recommendation results in the prior art, the invention carries out service recommendation aiming at a specific data set and a data analysis task, and improves the recommendation efficiency and the recommendation result accuracy.

According to an embodiment of the present invention, as shown in fig. 1, a data analysis service recommendation method for constructing a data analysis flow of the present invention includes steps S1, S2, S3, and each step is described in detail below.

In step S1, a data set to be analyzed and data analysis task description information are acquired. Firstly, acquiring a data set to be analyzed appointed by a user, wherein the data set is a data set, and can be in a text form, such as a csv (comma separated value) file format, each row represents a record, each record consists of a plurality of fields, and the fields are separated by separators such as commas or tab; it may also be a set of image files, and a description information file of the set of files, including a name, tag information, etc. of each image file. The description information of the data analysis task comprises a data analysis task type (such as classification, regression and the like), a running time upper limit value, a task target data feature set, a performance evaluation index set (such as common evaluation indexes of the regression task include Mean Square Error (MSE), root Mean Square Error (RMSE) and Mean Absolute Error (MAE), and common evaluation indexes of the classification task include Accuracy (Accumacy), precision (Precision), recall rate (Recall), P-R Curve (Precision-Recall) F1 Score, confusion Matrix (fusion Matrix), ROC, AUC and the like), and a data set division strategy (default training set and test set proportion 6:4).

In step S2, a set of data analysis services, and data analysis processes and performance values associated with each service are recommended to the user in combination with the data features and the service associations. At this time, according to the description information of the data analysis task corresponding to the data set to be analyzed and the characteristics of the data set to be analyzed, the first step single step service recommendation is performed on the data set to be analyzed from the data analysis service set to obtain a first step single step service recommendation set, and the first step data analysis service is selected from the first step single step service recommendation set to form a data analysis flow fragment, wherein the data analysis service set is a set of all data analysis service components for constructing a data analysis service flow. According to one embodiment of the invention, when the first step single step service recommendation is performed on the data set to be analyzed, the first step single step service recommendation comprises two parts of data analysis flow generation and single step service recommendation, wherein:

1) And (3) generating a data analysis flow:

based on the data set to be analyzed and the description information of the data analysis task, a data analysis flow is calculated, a group of optional data analysis flows are generated to form an available data analysis flow set, in this embodiment, a game model is adopted to generate a data analysis flow, as shown in fig. 2, and game countermeasures are performed by adopting the game model, which comprises the following steps: data set element characteristic calculation, a game model initialization and a game stage.

Data set meta-feature calculation stage: data set meta-features are calculated from the data set to be analyzed, including data dimension (dimension), number of features (NumberOfFeatures), maximum classification feature base (maxcardiyityofCategoricalfeatures), maximum quartile sparsity (quartileskewnesoffnummericfeatures) of the digital features, mean feature value (meannshielding ofnumericfeatures), mean digital attribute entropy (meannnumericattributeditropy), deficiency value ratio (ratio ofmissingvalues), deficiency value-having feature ratio (ratio offeatthreshold values), and the like.

Initializing a game model: and initializing a game model according to the information, wherein the initialized game model comprises two Monte Carlo Tree Search (MCTS) models serving as game participants and a long-short-term memory neural network model (LSTM) model.

It should be noted that, the MCTS is a tree-based data structure, which can balance exploration and utilization, and is still more effective when the search space is huge, and can be used to solve the problem of overlarge service search space. The basic idea is as follows: the upper bound confidence interval (UCB) values of most nodes are obtained through continuous simulation, then nodes which are worthy of utilization and exploration are selected in a strategy mode according to the UCB values in the next simulation are continuously simulated, and under the conditions of huge search space and limited computing power, better nodes can be found more intensively and with higher probability through heuristic search. Each MCTS model includes an initial tree structure, where a root node is an initial node, each primary child node of the root node represents a possible first-step service node in the data analysis flow, a secondary child node represents a possible second-step service node in the data analysis flow, and so on. The depth of the tree is the maximum length of the data analysis flow, and the value can be set according to historical data experience.

The LSTM model is a special Recurrent Neural Network (RNN). Four states are within a single loop structure of the LSTM, and compared with the RNN, the LSTM loop structure keeps a persistent unit state to be continuously transferred, and is used for deciding which information to forget or continuously transfer. The LSTM model generated by initialization is composed of an LSTM layer and 2 fully connected layers, and is used for predicting the service selection probability of each step in the process in the game stage. The LSTM layer includes an input layer, a hidden layer, and an output layer. The data received by the input layer are metadata characteristics of the data set to be analyzed, task description information of the data analysis, the current state of the MCTS tree (comprising service nodes selected in each step from the root node of the MCTS tree and selection probability of the service nodes in each step). The hidden layer is composed of a plurality of nerve units, and each nerve unit obtains the current output through the current input and the output at the last moment. The output layer is an LSTM prediction result, and dimension transformation is carried out through the full connection layer, so that a probability value of selecting a certain service node in the current state of the MCTS tree is obtained.

Gaming stage: and in the game stage, the game model is used for performing the game in a self-game mode. In the game stage, nodes of the MCTS tree corresponding to the data analysis service are selected, so that a data analysis flow can be formed along a path from the root node of the tree to the end of the leaf node.

According to one embodiment of the present invention, the gaming session includes a number of loops, which may be set based on historical recommended experiences. In each cycle, the following steps are performed:

a) An existing LSTM model LSTM1 is configured for a first MCTS participant MCTS1 as a prediction model thereof, and when a loop is executed for the first time, the LSTM model is an LSTM model obtained in the stage of initializing a game model;

b) Performing multiple MCTS searches by a first MCTS participant to obtain a plurality of optional data analysis flows and performance scores thereof, and training an LSTM model by combining the acquired plurality of optional data analysis flows with meta-features of a data set to be analyzed to obtain a new LSTM model serving as a prediction model of a second MCTS participant;

c) Starting game: the participant 1 and the participant 2 fight against each other, perform an MCTS search for each MCTS participant, compare the search results, firstly generate one party of available data analysis flow as the winner of the game, reserve the LSTM model of the winner, set the LSTM model as the existing LSTM model, and enter the next round of circulation; the winner generated available data analysis flow is used as the search result of the cycle.

In the steps b and c, the specific steps of searching by using the MCTS are as follows:

Starting from the root node of the MCTS tree, a composite score for each data analysis service is computed in turn for the set of data analysis services. The data analysis services with the comprehensive scores exceeding a default threshold (which can be judged according to the accumulation of historical experimental data) are used as the available services of the step, and an available service set is formed. If the available service set is not empty, selecting the data analysis service with the highest comprehensive score from the available service set, and continuing to search for the subsequent nodes of the node. When the maximum number of steps of the flow is reached or the calculated set of available services is empty, the MCTS search is ended and the number of accesses and quality values for each node are propagated back up the tree structure, which will be used for the next calculation of the composite score. At this point a data analysis flow is available from the root node to the end node. And executing the process, calculating to obtain the performance score of the process according to the execution result, and recording the data analysis process and the performance score to form an optional data analysis process set for subsequent single-step service recommendation. According to an example of the present invention, taking the tree depth of 3 as an example, describing an MCTS searching process, as shown in fig. 3, starting searching from a root node T0, calculating a comprehensive score of each data analysis service in turn as an available service of this step, and calculating four available services corresponding to a first level child node by the comprehensive score, wherein the comprehensive score of the child node T12 is highest, selecting the data analysis service corresponding to the child node T12 at the first level child node as the data analysis service of the first step, and then continuing searching for the service corresponding to a subsequent node from T12; in the subsequent nodes of the sub-node T12, three available services are respectively expressed as sub-nodes T21, T22 and T23, wherein the comprehensive score of the sub-node T21 is highest, the data analysis service corresponding to the sub-node T21 is selected at the secondary sub-node as the data analysis service of the second step, and then the service corresponding to the subsequent node is continuously searched from the point of T21; among the following nodes of the sub-node T21, three available services are respectively expressed as sub-nodes T31, T32 and T33, wherein the comprehensive score of the sub-node T32 is highest, the data analysis service corresponding to the sub-node T32 is selected at the three-level sub-node as the data analysis service of the third step, and the depth of the tree is 3, so that the search is ended after the data analysis service of the third step is completed, and finally, one data analysis flow consisting of the data analysis services corresponding to the paths formed by the sub-nodes T12-T21-T32 is obtained, and the data analysis flow searched at this time is obtained.

Wherein in calculating the composite score for each service, the calculation is performed using an upper bound confidence interval (UCB) formula:

t (l, a) is the service association constraint intensity between the front data analysis service and the rear data analysis service, wherein l is the previous service node of the current node in the flow, and a is the selected service, wherein the service association relation is obtained by mining the analysis historical data analysis flow, and specifically comprises the following elements: l (L) _i ＝<sourceID,targetID,consType,consValue>Wherein sourceID is the unique identifier of the source service, targetID is the unique identifier of the target service, condyType represents the association constraint category between two services, including three categories of strong association, weak association and no association, and condValue represents a specific constraintValues. When the association constraint class between two services is strong association, the T (l, a) output value is 1, when the association constraint class is weak association, the T (l, a) output value is a constraint value, and when the association constraint class is non-association, the T (l, a) output value is 0; according to an example of the present invention, a service association relation mining flow includes: (1) acquiring a historical data analysis flow text file to be mined, wherein the historical data analysis flow text file comprises flow structure description and performance data of a flow; (2) counting the number of times each service appears immediately after other services from the historical data analysis flow; (3) counting the total number of times of each service in the process; (4) and calculating constraint categories and constraint values between the two services. For example, by mining the service association relation of the historical data analysis flow, the service association relation related to the simple data filling (simpleimenter) of the data analysis service can be obtained as follows:

"SimpleImput", "Binarizer", "Weak Association", "0.1198"

"SimpleImput", "KBinnsDiscritizer", "Weak Association", "0.1053"

"SimpleImpuler", "Kernel center", "weak correlation", "0.1130"

"SimpleImput", "normizer", "Weak Association", "0.1189"

"SimpleImput", "OneHotEncoder", "Weak Association", "0.0204"

"SimpleImput", "OrdinalEncoder", "Weak Association", "0.0272"

"SimpleImput", "PowerTransformer", "Weak Association", "0.1798"

"SimpleImput", "QuantileTransformer", "Weak Association", "0.1139".

Q (s, a): the quality value representing the current node, whose initial value is 0, is calculated when the MCTS is counter-propagating by the following formula, where v is a preset parameter of the MCTS model:

Q(s,a)＝(N(s,a)*Q(s,a)+v)/(1+N(s,a))

p (s, a): representing the predicted probability value of selecting the service by the LSTM model. The predicted input information comprises the metadata characteristics of the data set to be analyzed, the description information of the data analysis task and the current state of the MCTS tree (the selected service node of each step from the root node of the MCTS tree and the selection probability of the service node of each step), and the probability of selecting a certain service node in the state is predicted;

N(s) is the number of searches for MCTS;

n (s, a) is the number of times this service is accessed;

w and c are weights accumulated and judged according to historical experimental data.

After the game is completed, the LSTM model of the winner is reserved and used for guiding the MCTS to select the service in the next circulation, so that the aim of continuous self-optimization is fulfilled by using a reinforcement learning method.

2) Single step service recommendation:

aiming at a group of data analysis flow sets comprising performance scores generated in the steps, firstly eliminating data analysis flows lower than a certain performance threshold value, and storing the flows with good performance; then, the processes are sequenced from high to low according to the performance scores, and the following processes are sequentially carried out:

initializing a single-step service recommendation result set;

extracting a first step service node in the process, and judging whether the service exists in a single step service recommendation result set;

if not, adding the recommended service and the corresponding flow and flow performance scores into a single-step service recommended result set < recommended service, < flow, flow performance scores >;

if the recommended service exists, adding the flow and the performance score to a related flow and performance score set of the recommended service;

and after all the processes are finished, returning the single-step service recommendation result set to be selected.

In step S3, a data analysis flow segment constructed by the user is obtained, and a subsequent service set of the segment is calculated again to perform a subsequent service recommendation. At this time, recommends are mainly performed for the data set to be analyzed, and a flow segment is constructed for the data set to be analyzed, so that single-step service recommendation needs to be performed from the data analysis service set for the data set to be analyzed from the next step of the constructed flow segment. The constructed flow segment can be a data analysis service selected from the single-step service recommendation result set, other services can be selected to construct the flow, if the system recommendation needs to be acquired again in the process of constructing the flow, the subsequent single-step service can be recommended according to the constructed flow segment, and the two stages of data analysis flow generation and single-step service recommendation are still included.

In the data analysis flow generation stage, based on the data set to be analyzed, the data analysis task description information and the constructed data analysis flow fragments, the data analysis flow is calculated, and a group of optional data analysis flows are generated, wherein the data analysis flow comprises two stages of game model updating and game:

and in the game model updating stage, MCTS tree structures of two participants of the game model are updated according to the received data analysis flow segments, so that only edges where corresponding service nodes are located are reserved according to the services selected from the flow segments from the root node, other branches are cut, and finally the subtree where the last service node in the flow segments is located is reserved. Therefore, repeated calculation is avoided, the searching speed can be increased, and the data analysis efficiency is improved. Still taking the example of fig. 3 as an example, a data analysis flow with depth of 3, assuming that the constructed flow segment is a data analysis flow segment composed of data analysis services corresponding to the child nodes T12-T21, when the subsequent service needs to be recommended, the game model is directly searched from the child node T21 after being updated, and the search from the root node T0 is not needed; assuming that the constructed flow segment is a data analysis flow segment composed of other selected services and including the first step and the second step, when the recommendation of the subsequent service is needed, the game model is updated and then the search is directly started from the next level of sub-node of the second level of sub-node.

The game stage is similar to the game stage in the first recommendation step, except that the LSTM model of the winner generated in the last recommendation process is configured for the first MCTS participant, so as to achieve the effect of multiplexing the existing results and continuously optimizing. The LSTM model of the winner is still retained after the game is completed.

In the single-step service recommending step, a group of data analysis flow sets comprising performance scores generated in the step is firstly eliminated, and a flow with good performance is saved by eliminating the data analysis flow below a certain performance threshold; then, the processes are sequenced from high to low according to the performance scores, and the following processes are sequentially carried out:

initializing a single-step service recommendation result set;

extracting a next service node after the fragments are constructed in the process, and judging whether the service exists in a single-step service recommendation result set;

if the recommended service already exists, the flow and performance scores are added to the set of related flow and performance scores for the recommended service.

In the subsequent service selection of the data analysis flow segment constructed for the data analysis service flow, the steps described above may be repeated, and single step recommendation may be performed for the subsequent service until the last step of the data analysis flow is completed.

According to an embodiment of the present invention, as shown in fig. 4, there is provided a data analysis service recommendation system for constructing a data analysis flow. The system comprises a service recommendation interface module, a data analysis flow generation module, a single-step service recommendation module, a data analysis service management module and a service association management module.

The service recommendation interface module is used for receiving a data set to be analyzed, data analysis task description information and a flow segment constructed by a user and returning a single-step service recommendation result.

The data analysis flow generation module is used for generating an available data analysis flow according to the data set to be analyzed, the data analysis task description information and the flow fragments constructed by the user; the game model setting module is used for initializing or updating a game model; the game module is used for executing a game model and generating an optional data analysis flow set.

The single-step service recommending module is used for recommending a group of services to form a single-step recommending service set according to the available data analysis flow set and the flow fragments constructed by the user.

The data analysis service management module is used for managing service information related to data analysis.

The service association relation management module is used for managing association relation between data analysis services. The service association relation is obtained by mining according to the historical data analysis flow and is stored in the module for guiding the data analysis flow to be generated.

The technical scheme of the embodiment of the invention can have the following beneficial effects:

by providing instant service recommendation for users in the data analysis flow construction process, the users are guided to conduct decision analysis, so that the common business users without data analysis expertise can be supported to conveniently use the data analysis service to finish business requirements, personalized requirements of the users can be better met, difficulty of data analysis is greatly reduced, labor cost is effectively saved, and quality and efficiency of data analysis are improved.

The application of the recommendation method and system of the present invention will be described in detail below with reference to a specific example, where traffic control departments in a certain province want to alleviate traffic congestion by traffic flow prediction, and business personnel aim to build a model for predicting traffic flow. For traffic personnel in traffic departments, selecting appropriate services and training, and optimizing parameters are very troublesome for traffic data sets, and often require exploration multiple times in the data analysis process to obtain satisfactory final results. In the data analysis service recommendation method and system, a user can send requirements to a service recommendation interface, and interactively construct a data analysis flow through intelligent assistance of the system:

1. Acquiring a data set to be analyzed and describing information of a data analysis task: first, the system receives a user-specified data set in csv format for a high-speed toll station (data attributes are shown in table 1), and description information of a data analysis task: the data analysis task type is regression, the running time upper limit value is set to be 1 hour, the task target data characteristic is time and traffic flow attribute, the performance evaluation index is mean square error Mean Absolute Error, and the data set dividing strategy is default (default training set and test set proportion is 6:4).

TABLE 1

Field name	Description of the invention
		CARDID	MTC card numbering
CARDNETWORK	Staff numbering
		ENTRYLANE	Landing column
ENTRYSTATION	Inbound site ID
		ENTRYTIME	Time of arrival
ETCCPUID	ETC card number
		EXITLANE	Outbound column
EXITSTATION	Outbound site ID
		EXITTIME	Time of outbound
VEHICLECLASS	Vehicle type
		VEHICLELICENSE	License plate number
VEHICLETYPE	Vehicle type

2. Recommending a group of data analysis services for users by combining the data characteristics and the service association, and recommending data analysis flow and performance values related to each service: the method comprises two steps of data analysis flow generation and single-step service recommendation.

1) Data analysis flow generation

An example of a data analysis flow for high speed traffic prediction generated by a MCTS search is shown in table 2 below. By executing the flow, the predicted high-speed traffic flow of a certain month at a certain place is 283742, and compared with a test set, the mean square error performance score can be calculated and obtained as follows: 0.8298058131123824.

TABLE 2

After a plurality of game cycles, a set of optional data analysis flows and corresponding performance scores are generated as shown in table 3:

TABLE 3 Table 3

2) Single step service recommendation:

taking the data analysis flows p1, p2, p3 and p4 as input, extracting a first step service node of the flow, and forming a single step service recommendation result set < recommendation service, < flow, flow performance scoring >;

"simple data population",

[{p1,0.8298058131123824},

{p3,0.5638540471504101}]

the "missing value indicator",

[{p2,0.788058131123556},

{p4,0.4868540471504120}]。

the business personnel can select the data analysis service of the first step from the single-step service recommendation result set to construct a data analysis flow fragment.

3. After the data analysis flow segment constructed by the user is obtained, recommending again: after the traffic manager checks the recommended result, the recommended service can be selected from the recommended result set, and the data analysis flow corresponding to the recommended service can also be directly selected. For example, after the user views the recommended result, a "simple data filling" service is selected from the recommended result set, and then a "sequential encoding" service is selected as a second-step service, so as to construct a flow fragment containing the two-step service. If the user hopes that the system continues to recommend the subsequent service, the service recommendation interface can be called again, and the constructed flow fragments are sent to the service recommendation system.

After the service recommendation system acquires the flow segment constructed by the user, the service recommendation system can recommend the subsequent single-step service and subsequent flows of the flow segment to the user. In the data analysis flow generating step, a data set, data analysis task description information and constructed flow fragments are taken as inputs to calculate a data analysis flow, and the data analysis flow comprises two stages of game model updating and game. A set of data analysis flows including performance scores may be obtained, assuming that the available data analysis flows generated in the re-recommendation are shown in table 4 below:

TABLE 4 Table 4

In the single-step service recommending step, the data analysis flow containing the performance scores and the constructed flow fragments are taken as inputs, and the next service node after the constructed fragments in the flow is extracted to form a single-step service recommending result set < recommending service, < flow, flow performance scores >;

"logistic regression" [ { p1,0.8378058131123652} ]

"ARD regression", [ { p2,0.6138540471504308} ]

"Gaussian Process", [ { p3,0.45068947368421941} ]

After receiving the recommended result set returned by the system, the user can select a service from the recommended result set and continue to construct a data analysis flow until a data analysis result satisfied by the user is obtained. Therefore, by adopting the recommendation method and the recommendation system, service personnel can quickly and accurately construct a data analysis flow without professional data analysis technical knowledge.

Although the present invention has been described by way of the above embodiments, the present invention is not limited to the embodiments described herein, but includes various changes and modifications made without departing from the scope of the invention.

It should be noted that, although the steps are described above in a specific order, it is not meant to necessarily be performed in the specific order, and in fact, some of the steps may be performed concurrently or even in a changed order, as long as the required functions are achieved.

The present invention may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.

The computer readable storage medium may be a tangible device that retains and stores instructions for use by an instruction execution device. The computer readable storage medium may include, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing.

The foregoing description of embodiments of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the technical improvements in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A data analysis service recommendation method for constructing a data analysis flow, for performing data analysis service recommendation for the data analysis flow for constructing a data set to be analyzed, wherein the data analysis flow includes a plurality of single-step data analysis services that are sequentially executed, the data analysis service recommendation method comprising:

s1, acquiring a data set to be analyzed and data analysis task description information corresponding to the data set to be analyzed;

s2, responding to first recommendation requirements of data analysis service, and according to data analysis task description information corresponding to a data set to be analyzed and characteristics of the data set to be analyzed, performing first-step single-step service recommendation for the data set to be analyzed from a data analysis service set to obtain a first-step single-step service recommendation set, wherein the data analysis service set is a set formed by all data analysis services for constructing a data analysis service flow, and performing the first-step single-step service recommendation in the following manner:

S21, generating an available data analysis flow set and performance scores of each data analysis flow based on the data analysis service set according to data analysis task description information corresponding to the data set to be analyzed and the characteristics of the data set to be analyzed; wherein the set of available data analysis flows is generated by:

s211, extracting a data set to be analyzed to calculate data set element characteristics;

s212, based on meta-characteristics of a data set to be analyzed, taking a data analysis service corresponding to each step in the data analysis flow as a node, and adopting a game model to perform multiple-time circular searches to obtain a plurality of data analysis flows to form an available data analysis flow set, wherein each search generates an available data analysis flow and obtains performance scores of the flows;

s22, selecting a data analysis flow with the comprehensive score higher than a preset threshold value from the available data analysis flow set, and selecting data analysis services corresponding to the first step in all selected data analysis flows to form a first-step single-step recommendation service set.

2. The data analysis service recommendation method for constructing a data analysis flow according to claim 1, wherein the recommendation method further comprises:

S3, responding to re-recommending requirements of the data analysis service, and performing single-step service recommendation from the data analysis service set to the data set to be analyzed from the next step of the constructed flow segment according to data analysis task description information corresponding to the data set to be analyzed, characteristics of the data set to be analyzed and the constructed data analysis flow segment.

3. The data analysis service recommendation method for constructing a data analysis flow according to claim 2, wherein the data analysis task description information corresponding to the data set to be analyzed at least includes: the data analysis method comprises the steps of data analysis task type, operation time upper limit value, task target data characteristic set, performance evaluation index set and data set division strategy.

4. The data analysis service recommendation method for constructing a data analysis flow according to claim 1, wherein each loop in step S212 comprises:

s2121, initializing a game model, wherein the game model comprises a first Monte Carlo tree search model, a second Monte Carlo tree search model and a first LSTM prediction model connected with the first Monte Carlo tree search model;

S2122, taking a data analysis service corresponding to each step in the data analysis flow as a node, performing multiple searches by using a first Monte Carlo tree search model and a first LSTM prediction model to obtain a plurality of data analysis flows and performance scores corresponding to each data analysis flow, and training the first LSTM prediction model by using the obtained plurality of data analysis flows and meta-features of a data set to be analyzed to obtain a second LSTM prediction model;

s2122, taking the first Monte Carlo tree search model and the first LSTM prediction model as a first participant, taking the second Monte Carlo tree search model and the second LSTM prediction model as a second participant, performing game searching, comparing search results, taking the LSTM prediction model corresponding to the participant which searches out the available data analysis flow first as the first LSTM prediction model of the next cycle, and taking the available data analysis flow which is searched out first as the available data analysis flow searched out at the moment.

5. The data analysis service recommendation method for constructing a data analysis flow according to claim 4, wherein when the first monte carlo tree search model and the first LSTM prediction model, the second monte carlo tree search model and the second LSTM prediction model perform a game search, a composite score of each data analysis service is sequentially calculated from a root node of the monte carlo tree based on a data analysis service set, wherein:

Aiming at each level of sub-node, taking the data analysis service with the comprehensive score exceeding a default threshold value as the available service corresponding to the level of sub-node to form an available service set; selecting the data analysis service with the highest comprehensive score from the available service set corresponding to the current level child node as the data analysis service corresponding to the current level child node, and continuously searching the subsequent node of the node until the data analysis flow is finished; and after the search is finished, the access times and the quality value of each node in the search are reversely propagated upwards along the Monte Carlo tree structure, an available data analysis flow is generated according to the data analysis service selected by each level of node, and the flow is executed to obtain the performance score of the flow.

6. The data analysis service recommendation method for constructing a data analysis flow according to claim 5, wherein the composite score of each data analysis service is calculated by:

wherein T (l, a) is the service association constraint intensity between the data analysis services corresponding to the front node and the rear node, l is the previous node of the current node in the flow, and a is the data analysis service selected by the current node;

q (s, a): a quality value representing the current node;

P (s, a): representing a probability value of selecting the service obtained by predicting the LSTM model;

n(s) is the number of searches for MCTS;

n (s, a) is the number of times this service is accessed;

s represents the current state of the Monte Carlo tree, and w and c are weights which are accumulated and judged according to historical experimental data.

7. The data analysis service recommendation method for constructing a data analysis flow according to claim 6, wherein the service association constraint strength between the data analysis services corresponding to the front and rear nodes is obtained by analyzing historical data analysis flow mining, wherein:

8. the data analysis service recommendation method for constructing a data analysis flow according to claim 7,

Q(s,a)＝(N(s,a)*Q(s,a)+v)/(1+N(s,a))

where v is a preset parameter of the Monte Carlo tree search model.

9. The data analysis service recommendation method for constructing a data analysis flow according to claim 8, wherein single step service recommendation is performed from the next step of the constructed flow segment by the same method as the first step single step service recommendation, wherein, in the game search, a composite score of each data analysis service is sequentially calculated based on the data analysis service set from a node corresponding to the last service of the constructed flow segment in the monte carlo tree, wherein:

10. A data analysis service recommendation system for constructing a data analysis flow, the recommendation system comprising:

the service recommendation interface module is used for receiving a data set to be analyzed, a data analysis task and a data analysis flow fragment constructed for the data set to be analyzed and returning a service recommendation result;

the data analysis flow generating module is used for generating an available data analysis flow set and performance scores of each data analysis flow based on the data analysis service set according to the data analysis task description information corresponding to the data set to be analyzed, the characteristics of the data set to be analyzed and the constructed data analysis flow fragments; wherein the data analysis flow generation module is configured to generate a set of available data analysis flows by:

Extracting a data set to be analyzed and calculating the metadata characteristics of the data set;

based on meta-characteristics of a data set to be analyzed, taking a data analysis service corresponding to each step in the data analysis flow as a node, and adopting a game model to perform multiple-time circular searches to obtain a set of available data analysis flows formed by multiple data analysis flows, wherein each search generates an available data analysis flow and obtains performance scores of the flows;

the single-step service recommendation module is used for selecting a data analysis flow with the comprehensive score higher than a preset threshold value from the available data analysis flow sets, and selecting data analysis services corresponding to the next step of the constructed data analysis flow fragments in all the selected data analysis flows to generate a single-step recommendation service set of the step;

the data analysis service management module is used for managing service information related to data analysis;

and the service association relation management module is used for managing the association relation between the data analysis services.

11. A computer readable storage medium having embodied thereon a computer program executable by a processor to perform the steps of the method of any of claims 1 to 9.

12. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the electronic device to perform the steps of the method of any of claims 1-9.