CN107688663B

CN107688663B - Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue

Info

Publication number: CN107688663B
Application number: CN201710847663.0A
Authority: CN
Inventors: 高英; 成昱霖
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2017-09-19
Filing date: 2017-09-19
Publication date: 2020-06-05
Anticipated expiration: 2037-09-19
Also published as: CN107688663A

Abstract

The invention provides a method for forming a loop-free data analysis queue and a big data support platform comprising the loop-free data analysis queue, wherein the method for forming the loop-free data analysis queue comprises the following steps: s1, a user constructs a task flow through the user terminal and fills in each operator parameter used by the task flow; s2, the server receives the task flow from the user terminal and each operator parameter used by the task flow; s3, the server establishes an adjacency matrix M: setting a task flow as a directed graph G, wherein nodes of the directed graph G are positions where operators are located in the task flow, the number of the operators in the task flow is N, and establishing an N-N adjacency matrix; and S4, judging the ring. The loop-free data analysis queue forming method and the big data support platform comprising the same solve the problem that in the prior art, closed loops exist in a task establishment process of a user, and the phenomenon of endless loops occurs when bottom layer calling is carried out.

Description

Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue

Technical Field

The invention relates to a data analysis platform, in particular to a loop-free data analysis queue forming method and a big data support platform comprising the loop-free data analysis queue forming method.

Background

The Chinese patent discloses a big data flow modeling analysis engine with the publication number of CN 105550268B, which comprises a platform layer, a task scheduling layer and an interface layer; the platform layer completes the resource scheduling and allocation work; the task scheduling layer comprises a checking module, an analysis module, a task scheduling module and an algorithm package; the verification module provides a verification function for judging whether the data analysis process accords with the process design rule, and the part which accords with the verification rule can enter the analysis module; the analysis module provides an analysis function for converting the data analysis process generated by the interface layer into an executable data analysis process task; the task scheduling module schedules various data analysis algorithm interfaces in the algorithm package according to the complete data analysis process generated by the analysis module to form a complete executable analysis process task program, and schedules bottom-layer resources to execute the data analysis program; the interface layer: the method comprises the steps of providing a platform interface for data analysis modeling operation, enabling each algorithm package for data analysis to exist on the interface in a draggable component with a unique identifier, enabling a user to operate each algorithm component through the interface, connecting the algorithm components through directed lines, representing data analysis flow directions and steps, combining the algorithm components into a complete business data analysis algorithm model, operating a background task scheduling module and the algorithm packages through a starting function of the interface, and scheduling resources to finish rapid analysis and processing of data. Although the big data flow modeling analysis engine can process a large amount of data efficiently and quickly to a certain extent, the big data flow modeling analysis engine has the following disadvantages:

the big data flow modeling analysis engine is used for creating a data analysis flow chart for a user, then a platform automatically generates an operator queue corresponding to the data analysis flow chart, then the user inputs data to be analyzed, and the data to be analyzed is processed by each operator in the operator queue one by one to finally obtain analysis data. However, in the process of creating the data analysis flowchart by the user, there is a high possibility of a closed loop, so that when data to be analyzed is processed and calculation is performed in the closed loop, dead loop calculation occurs, a large amount of CPU calculation space is occupied, and other calculation is slow.

Disclosure of Invention

The invention provides a loop-free data analysis queue forming method and a big data support platform comprising the same, and solves the problem that in the prior art, closed loops exist in a task establishment process of a user, so that the phenomenon of dead loops occurs when bottom layer calling is carried out.

In order to achieve the purpose, the invention adopts the following technical scheme:

a loop-free data analysis queue forming method comprises the following steps:

s1, a user constructs a task flow through the user terminal and fills in each operator parameter used by the task flow;

s2, the server receives the task flow from the user terminal and each operator parameter used by the task flow;

s3, the server establishes an adjacency matrix M:

s31, setting the task flow as a directed graph G, setting the nodes of the directed graph G as the positions of operators in the task flow, setting the number of the operators in the task flow as N, and establishing an N-N adjacency matrix;

s32, pairEach pair of one operator to another operator has a directed edge between them to do the following: let an operator i₁To another operator j₁With directed edges in between, then the ith in the adjacency matrix M₁Line j (th)₁Value of column

Is assigned as 1; and performing the following processing on the condition that no directed edge exists between each operator and the other operator: let an operator i₂To another operator j₂There is no directed edge between them, then the ith in the adjacency matrix M₂Line j (th)₂Value of column

Is assigned 0;

s4, judging the ring:

s41, finding out the ith in M₃The values on the columns are all 0, and find the ith₃Operator opi corresponding to column₃；

S42, operator opi₃Corresponds to the ith₃Go to find out the ith₃Column value j of not 0 on a row₃Find out j₃Correspondence operator opj₃；

S46, delete ith in M₃Row and ith₃Repeating the steps S41-S42 until no columns with all values of 0 exist in the matrix M;

operators corresponding to the remaining rows in S47 and M are judged to be invalid.

It is preferable that the first and second liquid crystal layers are formed of,

after step S3 and before step S4, operator input number determination and operator parameter determination are performed;

the operator input number judgment specifically comprises the following steps: firstly, according to a task flow constructed by a user, counting the input quantity of each operator; then, comparing the counted input quantity of each operator with the input port quantity specified in the operator design, and if the input quantity of each operator is equal to the input port quantity specified in the operator design, judging that the operator is currently effective; if not, the operator is judged to be invalid;

the operator parameter judgment specifically comprises the following steps: and judging the quantity of the parameters of each operator, and if the parameter values of the operators are empty, judging the operators to be invalid.

Preferably, the step S4 further includes the following steps performed between the steps S42 and S46:

s43, judgment operator opj₃Whether it is effective or not, if opj₃Currently active, opj₃Effectiveness equal to opi₃The effectiveness of (a); if opj₃Currently invalid opi₃Continuously keeping the original invalid state;

s44, repeating the steps S42-S43 until the ith₃All columns on a row that are not 0 have been processed;

s45, judgment operator opi₃Whether it is effective or not, if opi₃If valid, opi will be₃Join the executable task queue and go to step S246; if opi₃If the operation is invalid, the executable task queue intQ is not added and is kept invalid;

preferably, the operator opi is marked while the step S43 is performed₃The marking principle is as follows;

the execution categories are: normal, start of merge, end of merge and in merge; among operators are the following operator categories: merging processing start sub-operators and merging processing end sub-operators; the execution category of the merging processing start sub-operator is merging start, the execution category of the merging processing end sub-operator is merging end, the operator execution category positioned between the merging processing start sub-operator and the merging processing end sub-operator in the task flow is merging, and the operator execution category positioned outside the merging processing start sub-operator and the merging processing end sub-operator in the task flow is common;

after step S47, an operator execution category validity judgment needs to be performed, where the operator execution category validity judgment specifically includes the following steps:

s51, dequeuing an intQ head operator q of the executable task queue to obtain an operator type typeq and an execution type dtq;

s52, carrying out the following validity judgment on the operator q according to typeq and dtq:

a. when the execution type of the operator q is merging start, judging whether the input of the operator q is the output of an operator in merging or another operator at the beginning of merging, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state;

b. when the execution type of the operator q is in combination and the operator q is a multi-input operator, judging whether the operator q is one input data stream with double data sets and the other input data stream with single data sets, if so, keeping the operator q in an original invalid or valid state; if not, the operator q is invalid;

c. when the execution type of the operator q is in combination, judging that the output of the operator q is transmitted to a data writing source operator, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state;

d. the operator q except the conditions of a, b and c is not judged, and the operator q keeps the original invalid or valid state;

s53, checking the invalid or valid state of the operator q, and if the operator q is in the valid state, adding the operator q into a final execution queue taskQ; if the operator q is in an invalid state, the operator q is removed;

s54, repeating the steps S3-4 until the executable task queue intQ is empty, and obtaining a final executable queue taskQ.

The invention also provides a big data support platform, comprising: the user terminal and the server use the loop-free data analysis queue forming method.

Compared with the prior art, the utility model discloses following beneficial effect has:

by setting the steps S3 and S4, the problem that dead loop occurs when a user establishes a closed loop in a task flow and calls a bottom layer in the prior art is solved, dead loop is avoided during calculation, the burden of a CPU is reduced, the CPU has more space to calculate a correct task flow, and unnecessary calculation is avoided.

Drawings

Fig. 1 is example 1 of the determination of no closed ring in step S4 in example 1;

fig. 2 is example 2 of embodiment 1 in which no closed ring is determined at step S4;

fig. 3 is example 3 in which the decision operator is invalid in case a in step S52 in embodiment 1;

fig. 4 is example 4 in which the decision operator is invalid in the case of b in step S52 in embodiment 1;

fig. 5 shows example 5 in which the decision operator is invalid in the case of c in step S52 in embodiment 1.

Detailed Description

Example 1:

a loop-free data analysis queue forming method comprises the following steps:

s3, the server establishes an adjacency matrix M:

s32, performing the following processing for each pair of directional edges between one operator and the other operator: let an operator i₁To another operator j₁With directed edges in between, then the ith in the adjacency matrix M₁Line j (th)₁Value of column

Is assigned 0;

s4, judging the ring:

S46, delete ith in M₃Row and ith₃Repeating the steps S41-S45 until no columns with all values of 0 exist in the matrix M;

If the task flow established by the user is as shown in fig. 1, then a 3 × 3 adjacency matrix M should be established, since operators 11 to 12 have directed edges, then M₁₂1 is ═ 1; since operators 12 through 13 have directed edges, then M₂₃1 is ═ 1; and the remainder are 0. The adjacency matrix of example 1 is finally obtained

When step S4 is performed:

when the step S41 is performed for the first time, the operator found in the step S41 is the 1 st column, the operator corresponding to the 1 st column is the filter operator in the graph, the filter operator corresponds to the 1 st row, the column value with the numerical value not being 0 in the 1 st row is 2, the column value being 2 corresponds to the transformation operator for the feature, if the filter operator is valid, the transformation operator for the feature is valid, the former operator is valid, and the latter operator is valid, the 1 st column and the 1 st row are deleted, so that the 1 st column and the 1 st row are obtained

At the moment, the 1 st column corresponds to a feature transformation operator, the 2 nd column corresponds to a clustering operator, the 1 st line corresponds to a feature transformation operator, and the 2 nd line corresponds to a clustering operator;

when step S41 is performed for the second time, the operator found in step S41 is the 1 st column, the operator corresponding to the 1 st column is the transformation operator for the feature in the graph, the transformation operator for the feature corresponds to the 1 st row, the column value of the numerical value in the 1 st row that is not 0 is 2, the column value is 2 corresponding to the clustering operator, the filter operator is effective, the transformation operator for the feature is effective, the previous operator is effective, the subsequent operator is effective, the 1 st column and the 1 st row are deleted, M is (0), at this time, the 1 st column corresponds to the clustering operator, and the 1 st row corresponds to the clustering operator;

when step S41 is performed for the third time, row 1 and column 1 are deleted, and M is obtained to be no residue, which indicates that the task flow is a closed loop-free one, and meets the requirements, and no dead loop calculation occurs;

if the task flow established by the user is as shown in fig. 2, then a 3 × 3 adjacency matrix M should be established, since operators 11 to 12 have directed edges, then M₁₂1 is ═ 1; since operators 12 through 13 have directed edges, then M₂₃1 is ═ 1; since operators 13 through 11 have directed edges, then M₃₁1 and the remainder 0. The adjacency matrix of example 1 is finally obtained

When step S4 is performed:

when step S41 is performed, columns with column values all 0 are not found in step S41, all operators are determined to be invalid, and the filter operator, the feature transformation operator, and the clustering operator form a closed loop, which may cause dead loops.

For convenience of subsequent understanding, we describe a computation package stored in the server, where a large number of operators are stored in the computation package, and for data analysis, the operators in the computation package are roughly classified into the following 5 categories from the operator category:

class 1: data Source read and write

The data source read-write class comprises two operators of a read data source and a write data source, and a plurality of operators of the read data source and the write data source can be arranged in one working space.

The data source reading operator is a starting point of a task flow, a list consisting of all data files of a current user needs to be obtained, and one file is selected as a data source and is transmitted into the flow. An operator for reading the data source is not actually realized on the bottom layer, the function of the operator is to transmit parameters, and only the path of the selected file needs to be transmitted to the subsequent operator, so that the task of reading the data source is completed.

The data writing source operator is the end point of the task flow and is used for storing the intermediate result input by the data writing source operator as the final result of the file name set by the user, and the concrete realization of the data writing source operator is arranged on the bottom layer.

Class 2: data pre-processing

The data preprocessing class contains a plurality of operators for data preprocessing, including string processing, data set processing, table processing, and the like.

The processing of the character string comprises operations of segmentation, combination, substring fetching and the like. In order to ensure the fine granularity of the operator, the invention provides a single character string processing operator which can only process one field in the DataFrame. If a plurality of character strings need to be processed, the method can be realized in a multi-step operator superposition mode. Each character string processing operator can generate a plurality of new fields according to specific conditions, for example, a substring taking operation is carried out on the character string.

Although a data set is recorded in the memory in the form of a DataFrame, that is, a table after entering an operator, in the description, a distinction is made between the data set and the table processing, the table processing involves the processing of specific fields and the SQL table processing, and the processing of the data set means that only the data set itself is processed, and no specific fields are involved.

The processing of the data set mainly aims at the scale of the data set, and comprises proportional sampling, proportional splitting of the data set, merging of the data set and the like.

The processing of the table includes linking, deduplication, filtering, field deletion, modifying field type, mathematical formula calculation for the field, and the like. The processing of the table benefits from the existence of Spark SQL and the optimized enhancement of DataFrame after Spark 2.0, because Spark provides an SQL interface and supports a large number of SQL core operations, the processing of the table can be performed based on SQL entirely. SQL is a skill that must be mastered for most data analysts, so the use of SQL to process tables can adapt well to the habits of the data analysts and can also complete data processing tasks well.

Class 3: feature engineering

The feature engineering operator is mainly based on a feature packet in Spark MLlib, and relates to the aspects of text processing, feature transformation, feature encoding and the like.

In the case of feature engineering, all processing of features is a transformation of features, but in the present description, a distinction is made between them. In the invention, the transformation of the features refers to the transformation of the features of the same type, and the coding of the features refers to the transformation of the features of other types into the features of numerical type through coding.

The processing of the text is related to Natural Language Processing (NLP) and is mainly used for acquiring key information of the text, including word segmentation, stop word deletion, NGram, TF-IDF and the like.

The transformation of the features mainly comprises Principal Component Analysis (PCA), polynomial expansion, various scaling, binarization and the like. The transformation of the features is mainly used for the standardization of the features and ensuring the independence of the features, so that the accuracy of a model trained later can be improved.

The feature coding mainly comprises one-hot coding, character string coding, vector coding and the like. Since most models require that features be represented numerically, it is necessary to numerically represent various types of features in the feature engineering section.

Class 4: model (model)

The model type operator mainly comprises models of types such as classification, regression, clustering and collaborative filtering. Since Spark is a distributed parallel computing framework, and most models are not adapted to the distributed environment at the beginning of their design, Spark adopts a model algorithm modified and adapted to the distributed environment, but has a slightly different effect on the specific problem compared with a single model algorithm.

The classification model mainly comprises random forest classification, GBDT classification, logistic regression and the like; the regression model mainly comprises random forest regression, GBDT regression, linear regression and the like; the clustering model mainly comprises Kmeans clustering and the like; the collaborative filtering mainly comprises an alternating least square method ALS and the like.

The model is used as a core part of machine learning and is also a core component for a data analysis platform. The invention provides abundant models, which can meet the requirements of analysts to a certain extent.

Class 5: merging process

In an analysis task, the same operation may be performed on different data sets, such as performing the same mathematical expression calculation on the same named fields of two data sets. Since there is such a case, the merging process can be performed on the portions thereof that need to perform the same operation. Because each operator is used as a single-function Spark task, the operation amount really needed to be performed is not large, and the time consumption is not long, so that compared with the prior art, the time for starting the program and initializing the context is long, and the proportion of the total time is large. Therefore, by adopting the merging processing, the scale of the task can be reduced, and the time for starting the task and initializing the context of the original two operators can be shortened to about half of the original time.

In machine learning, there is a need to process multiple data sets, typically having a training data set, a data set that needs to be classified or clustered, and possibly a test set. The feature engineering operation is required to be performed on the data sets, and some feature engineering operations are required to be performed on a plurality of data sets at the same time, the transformation relation of the feature engineering operations needs to be applied to the plurality of data sets, and the transformation relation has a large influence on the data result. Taking the most common character string codes on a training set and a test set as an example, sorting is carried out according to the occurrence times of character strings in the training set, then numerical value coding is carried out, in order to ensure that the coded result is equally effective to the test set, the test set needs to be transformed according to the mapping relation of the step of coding, if the training set and the test set are respectively subjected to character string coding, the occurrence times of the same character strings in the two data sets are probably different, the coding results are also different, and finally, the model is caused to be wrong. In order to prevent the situations from occurring, the data processing method abstracts the data processing method into merging processing, and in the merging process, the data processing method operates other data sets together with the feature engineering information processed by the training set, so that the consistency of the coding result is ensured.

In order to meet the input port number definition of operators of other categories, in the merging process, two data sets are used as one data set to be operated to form a double-data-set data stream, so that operators of other categories can be used in the merging process without realizing two input numbers for the same operator, but judgment is carried out on the bottom layer according to the position of each operator, and correct processing is carried out.

The design of the merging operator reasonably optimizes the use of resources under the condition of no error, and can solve the problem of conversion of characteristic engineering information in machine learning, thereby improving the overall efficiency and solving the potential problem.

Because there is a merging process and the operator does not develop two sets, the execution category of the Spark operator needs to be judged, and the operator is told the number of data sets which need to be operated.

Operators are roughly divided into four execution categories according to the position of the operator in the task flow and the category of the operator: normal, start of merging processing, end of merging processing, and in merging processing. The execution category of the operator can provide part of control basis for the flow control of the operator later.

The operator in the merging process is between the operator at the beginning of the merging process and the operator at the end of the merging process, and the other operators are ordinary operators.

Normally, each input port of an operator inputs a single data set data stream. Its input operator must be either the normal case operator or the merge process end operator.

The merge process start operator has two input ports and one output port and functions to merge two data sets into one data stream for processing. The input operator is necessarily an ordinary operator or a merging processing ending operator, and each input port acquires a single data set data stream.

The merging processing end operator has one input port and two input ports, the input operator must be the merging processing start operator or the subsequent operator, the input port acquires the double data set data stream, and the output is the single data set data stream.

The output ports of the operators in the merging process both output two data sets. When the operator is a single input operator, the input port inputs a double data set data stream, and the output port outputs the double data set data stream. When the operator is a double-input operator, one of the input ports inputs a double-dataset data stream, namely the output of the operator in the merging processing, and the other input port inputs a single-dataset data stream, namely the output of the operator in the ordinary condition.

The filtering, reading data source, clustering and the like are the operator categories of the operators.

In order to ensure that each input of the multi-input operator has an input and avoid being incapable of calculating, and in order to ensure that each parameter of the operator needing to set a plurality of parameters is set, operator input number judgment and operator parameter judgment are required after the step S3 and before the step S4;

Step S4 further includes the following steps performed between step S42 and step S46:

s43, judgment operator opj₃Whether it is effective or not, if opj₃Currently active, opj₃Effectiveness equal to opi₃The effectiveness of (a); if opj₃Currently invalid opi₃Continuously keeping the original invalid state; (in the step, because the lower operator is invalid in order to ensure that the upper operator is invalid in the task flow, the lower operator is avoided to be still calculated when the upper operator has no output result, and invalid transmission is realized)

S44, repeating the steps from S242 to S243 until the ith₃All columns on a row that are not 0 have been processed;

s45, judgment operator opi₃Whether it is effective or not, if opi₃If valid, opi will be₃Adding an executable task queue and performing step S46; if opi₃If the operation is invalid, the executable task queue intQ is not added and is kept invalid (the establishment of the executable task queue intQ simplifies the effectiveness judgment of the subsequent execution category, deletes the loop judgment, the operator input number judgment, the operator parameter judgment and the operator which is judged to be invalid by the upper layer and is invalid by the operator at the lower layer, and avoids the condition that whether the invalid operator is invalid is judged when the category is executed or not);

in order to facilitate the subsequent operator to perform the class validity judgment, the operator opi is also marked while the step S43 is performed₃The marking principle is as follows;

a. when the execution type of the operator q is merging start, judging whether the input of the operator q is the output of an operator in merging or another operator at the beginning of merging, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state; (this step avoids the two merging operators being overlapped to process the result and the result being difficult to separate into four or three output results after the operator outputs in the merging, and avoids the result confusion, the invalid condition is shown in FIG. 3)

b. When the execution type of the operator q is in combination and the operator q is a multi-input operator, judging whether the operator q is one input data stream with double data sets and the other input data stream with single data sets, if so, keeping the operator q in an original invalid or valid state; if not, the operator q is invalid; (this step avoids the output results of the multiple input operators being unable to achieve the separation of the end of the merge process due to the two output results being simultaneously transmitted to another multiple input operator during the merge process, and avoids the situation of result confusion, such as the invalid situation shown in FIG. 4)

c. When the execution type of the operator q is in combination, judging that the output of the operator q is transmitted to a data writing source operator, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state; (since the results obtained from the merge are not the data to be written into the server and the writing method is not legal, the confusion of the stored data caused by the fact that the results obtained from the operator in the merge are stored in the server through the data source operator is avoided, and the invalid condition is shown in FIG. 5)

s53, checking the invalid or valid state of the operator q, and if the operator q is in the valid state, adding the operator q into a final execution queue taskQ; if the operator q is in an invalid state, the operator q is removed; (this step excludes a, b, and c, which does not cause confusion).

S54, repeating the steps S2-S3 until the executable task queue intQ is empty, and obtaining a final executable queue taskQ.

And finally, the execution queue taskQ is an ordered and effective operator and eliminates invalidity, so that the server can execute the effective operator, display the result to the client and prompt the invalid operator.

Example 2:

the embodiment provides a big data support platform, includes: the user terminal and the server use the loop-free data analysis queue forming method in embodiment 1.

Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the present invention can be modified or replaced by other means without departing from the spirit and scope of the present invention, which should be construed as limited only by the appended claims.

Claims

1. A loop-free data analysis queue forming method is characterized by comprising the following steps:

s3, the server establishes an adjacency matrix M:

s32, performing the following processing for each pair of directional edges between one operator and the other operator: let an operator i₁To another operator j₁With directed edges in between, then the ith in the adjacency matrix M₁Line j (th)₁Column value M_i1j1Is assigned as 1; and performing the following processing on the condition that no directed edge exists between each operator and the other operator: let an operator i₂To another operator j₂There is no directed edge between them, then the ith in the adjacency matrix M₂Line j (th)₂Column value M_i2j2Is assigned 0;

s4, judging the ring:

2. The method of forming a loop-free data analysis queue of claim 1,

3. The loop-free data analysis queue forming method according to claim 2, wherein the step S24 further includes the following steps performed between the steps S42 and S46:

s43, judgment operator opj₃Whether it is effective or not, if opj₃Currently active, opj₃Effectiveness equal to opi₃The effectiveness of (a); if opj₃Currently invalid opi₃Continuously keeping the original invalid state; s44, repeating the steps S42-S43 until the ith₃All columns on a row that are not 0 have been processed;

s45, judgment operator opi₃Whether it is effective or not, if opi₃If valid, opi will be₃Adding an executable task queue and performing step S46; if opi₃And if the executable task queue is invalid, not adding the intQ of the executable task queue and keeping invalid.

4. The method as claimed in claim 3, wherein the step S43 is performed while marking the operator opi₃The marking principle is as follows;

s54, repeating the steps S3-S4 until the executable task queue intQ is empty, and obtaining a final executable queue taskQ.

5. A big data support platform, comprising: a user terminal and a server using the loop-free data analysis queue forming method according to any one of claims 1 to 4.