CN107688663B - Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue - Google Patents

Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue Download PDF

Info

Publication number
CN107688663B
CN107688663B CN201710847663.0A CN201710847663A CN107688663B CN 107688663 B CN107688663 B CN 107688663B CN 201710847663 A CN201710847663 A CN 201710847663A CN 107688663 B CN107688663 B CN 107688663B
Authority
CN
China
Prior art keywords
operator
invalid
merging
task flow
operators
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710847663.0A
Other languages
Chinese (zh)
Other versions
CN107688663A (en
Inventor
高英
成昱霖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201710847663.0A priority Critical patent/CN107688663B/en
Publication of CN107688663A publication Critical patent/CN107688663A/en
Application granted granted Critical
Publication of CN107688663B publication Critical patent/CN107688663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2237Vectors, bitmaps or matrices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer And Data Communications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method for forming a loop-free data analysis queue and a big data support platform comprising the loop-free data analysis queue, wherein the method for forming the loop-free data analysis queue comprises the following steps: s1, a user constructs a task flow through the user terminal and fills in each operator parameter used by the task flow; s2, the server receives the task flow from the user terminal and each operator parameter used by the task flow; s3, the server establishes an adjacency matrix M: setting a task flow as a directed graph G, wherein nodes of the directed graph G are positions where operators are located in the task flow, the number of the operators in the task flow is N, and establishing an N-N adjacency matrix; and S4, judging the ring. The loop-free data analysis queue forming method and the big data support platform comprising the same solve the problem that in the prior art, closed loops exist in a task establishment process of a user, and the phenomenon of endless loops occurs when bottom layer calling is carried out.

Description

Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue
Technical Field
The invention relates to a data analysis platform, in particular to a loop-free data analysis queue forming method and a big data support platform comprising the loop-free data analysis queue forming method.
Background
The Chinese patent discloses a big data flow modeling analysis engine with the publication number of CN 105550268B, which comprises a platform layer, a task scheduling layer and an interface layer; the platform layer completes the resource scheduling and allocation work; the task scheduling layer comprises a checking module, an analysis module, a task scheduling module and an algorithm package; the verification module provides a verification function for judging whether the data analysis process accords with the process design rule, and the part which accords with the verification rule can enter the analysis module; the analysis module provides an analysis function for converting the data analysis process generated by the interface layer into an executable data analysis process task; the task scheduling module schedules various data analysis algorithm interfaces in the algorithm package according to the complete data analysis process generated by the analysis module to form a complete executable analysis process task program, and schedules bottom-layer resources to execute the data analysis program; the interface layer: the method comprises the steps of providing a platform interface for data analysis modeling operation, enabling each algorithm package for data analysis to exist on the interface in a draggable component with a unique identifier, enabling a user to operate each algorithm component through the interface, connecting the algorithm components through directed lines, representing data analysis flow directions and steps, combining the algorithm components into a complete business data analysis algorithm model, operating a background task scheduling module and the algorithm packages through a starting function of the interface, and scheduling resources to finish rapid analysis and processing of data. Although the big data flow modeling analysis engine can process a large amount of data efficiently and quickly to a certain extent, the big data flow modeling analysis engine has the following disadvantages:
the big data flow modeling analysis engine is used for creating a data analysis flow chart for a user, then a platform automatically generates an operator queue corresponding to the data analysis flow chart, then the user inputs data to be analyzed, and the data to be analyzed is processed by each operator in the operator queue one by one to finally obtain analysis data. However, in the process of creating the data analysis flowchart by the user, there is a high possibility of a closed loop, so that when data to be analyzed is processed and calculation is performed in the closed loop, dead loop calculation occurs, a large amount of CPU calculation space is occupied, and other calculation is slow.
Disclosure of Invention
The invention provides a loop-free data analysis queue forming method and a big data support platform comprising the same, and solves the problem that in the prior art, closed loops exist in a task establishment process of a user, so that the phenomenon of dead loops occurs when bottom layer calling is carried out.
In order to achieve the purpose, the invention adopts the following technical scheme:
a loop-free data analysis queue forming method comprises the following steps:
s1, a user constructs a task flow through the user terminal and fills in each operator parameter used by the task flow;
s2, the server receives the task flow from the user terminal and each operator parameter used by the task flow;
s3, the server establishes an adjacency matrix M:
s31, setting the task flow as a directed graph G, setting the nodes of the directed graph G as the positions of operators in the task flow, setting the number of the operators in the task flow as N, and establishing an N-N adjacency matrix;
s32, pairEach pair of one operator to another operator has a directed edge between them to do the following: let an operator i1To another operator j1With directed edges in between, then the ith in the adjacency matrix M1Line j (th)1Value of column
Figure GDA0002407511280000021
Is assigned as 1; and performing the following processing on the condition that no directed edge exists between each operator and the other operator: let an operator i2To another operator j2There is no directed edge between them, then the ith in the adjacency matrix M2Line j (th)2Value of column
Figure GDA0002407511280000022
Is assigned 0;
s4, judging the ring:
s41, finding out the ith in M3The values on the columns are all 0, and find the ith3Operator opi corresponding to column3
S42, operator opi3Corresponds to the ith3Go to find out the ith3Column value j of not 0 on a row3Find out j3Correspondence operator opj3
S46, delete ith in M3Row and ith3Repeating the steps S41-S42 until no columns with all values of 0 exist in the matrix M;
operators corresponding to the remaining rows in S47 and M are judged to be invalid.
It is preferable that the first and second liquid crystal layers are formed of,
after step S3 and before step S4, operator input number determination and operator parameter determination are performed;
the operator input number judgment specifically comprises the following steps: firstly, according to a task flow constructed by a user, counting the input quantity of each operator; then, comparing the counted input quantity of each operator with the input port quantity specified in the operator design, and if the input quantity of each operator is equal to the input port quantity specified in the operator design, judging that the operator is currently effective; if not, the operator is judged to be invalid;
the operator parameter judgment specifically comprises the following steps: and judging the quantity of the parameters of each operator, and if the parameter values of the operators are empty, judging the operators to be invalid.
Preferably, the step S4 further includes the following steps performed between the steps S42 and S46:
s43, judgment operator opj3Whether it is effective or not, if opj3Currently active, opj3Effectiveness equal to opi3The effectiveness of (a); if opj3Currently invalid opi3Continuously keeping the original invalid state;
s44, repeating the steps S42-S43 until the ith3All columns on a row that are not 0 have been processed;
s45, judgment operator opi3Whether it is effective or not, if opi3If valid, opi will be3Join the executable task queue and go to step S246; if opi3If the operation is invalid, the executable task queue intQ is not added and is kept invalid;
preferably, the operator opi is marked while the step S43 is performed3The marking principle is as follows;
the execution categories are: normal, start of merge, end of merge and in merge; among operators are the following operator categories: merging processing start sub-operators and merging processing end sub-operators; the execution category of the merging processing start sub-operator is merging start, the execution category of the merging processing end sub-operator is merging end, the operator execution category positioned between the merging processing start sub-operator and the merging processing end sub-operator in the task flow is merging, and the operator execution category positioned outside the merging processing start sub-operator and the merging processing end sub-operator in the task flow is common;
after step S47, an operator execution category validity judgment needs to be performed, where the operator execution category validity judgment specifically includes the following steps:
s51, dequeuing an intQ head operator q of the executable task queue to obtain an operator type typeq and an execution type dtq;
s52, carrying out the following validity judgment on the operator q according to typeq and dtq:
a. when the execution type of the operator q is merging start, judging whether the input of the operator q is the output of an operator in merging or another operator at the beginning of merging, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state;
b. when the execution type of the operator q is in combination and the operator q is a multi-input operator, judging whether the operator q is one input data stream with double data sets and the other input data stream with single data sets, if so, keeping the operator q in an original invalid or valid state; if not, the operator q is invalid;
c. when the execution type of the operator q is in combination, judging that the output of the operator q is transmitted to a data writing source operator, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state;
d. the operator q except the conditions of a, b and c is not judged, and the operator q keeps the original invalid or valid state;
s53, checking the invalid or valid state of the operator q, and if the operator q is in the valid state, adding the operator q into a final execution queue taskQ; if the operator q is in an invalid state, the operator q is removed;
s54, repeating the steps S3-4 until the executable task queue intQ is empty, and obtaining a final executable queue taskQ.
The invention also provides a big data support platform, comprising: the user terminal and the server use the loop-free data analysis queue forming method.
Compared with the prior art, the utility model discloses following beneficial effect has:
by setting the steps S3 and S4, the problem that dead loop occurs when a user establishes a closed loop in a task flow and calls a bottom layer in the prior art is solved, dead loop is avoided during calculation, the burden of a CPU is reduced, the CPU has more space to calculate a correct task flow, and unnecessary calculation is avoided.
Drawings
Fig. 1 is example 1 of the determination of no closed ring in step S4 in example 1;
fig. 2 is example 2 of embodiment 1 in which no closed ring is determined at step S4;
fig. 3 is example 3 in which the decision operator is invalid in case a in step S52 in embodiment 1;
fig. 4 is example 4 in which the decision operator is invalid in the case of b in step S52 in embodiment 1;
fig. 5 shows example 5 in which the decision operator is invalid in the case of c in step S52 in embodiment 1.
Detailed Description
Example 1:
a loop-free data analysis queue forming method comprises the following steps:
s1, a user constructs a task flow through the user terminal and fills in each operator parameter used by the task flow;
s2, the server receives the task flow from the user terminal and each operator parameter used by the task flow;
s3, the server establishes an adjacency matrix M:
s31, setting the task flow as a directed graph G, setting the nodes of the directed graph G as the positions of operators in the task flow, setting the number of the operators in the task flow as N, and establishing an N-N adjacency matrix;
s32, performing the following processing for each pair of directional edges between one operator and the other operator: let an operator i1To another operator j1With directed edges in between, then the ith in the adjacency matrix M1Line j (th)1Value of column
Figure GDA0002407511280000042
Is assigned as 1; and performing the following processing on the condition that no directed edge exists between each operator and the other operator: let an operator i2To another operator j2There is no directed edge between them, then the ith in the adjacency matrix M2Line j (th)2Value of column
Figure GDA0002407511280000043
Is assigned 0;
s4, judging the ring:
s41, finding out the ith in M3The values on the columns are all 0, and find the ith3Operator opi corresponding to column3
S42, operator opi3Corresponds to the ith3Go to find out the ith3Column value j of not 0 on a row3Find out j3Correspondence operator opj3
S46, delete ith in M3Row and ith3Repeating the steps S41-S45 until no columns with all values of 0 exist in the matrix M;
operators corresponding to the remaining rows in S47 and M are judged to be invalid.
If the task flow established by the user is as shown in fig. 1, then a 3 × 3 adjacency matrix M should be established, since operators 11 to 12 have directed edges, then M121 is ═ 1; since operators 12 through 13 have directed edges, then M231 is ═ 1; and the remainder are 0. The adjacency matrix of example 1 is finally obtained
Figure GDA0002407511280000041
When step S4 is performed:
when the step S41 is performed for the first time, the operator found in the step S41 is the 1 st column, the operator corresponding to the 1 st column is the filter operator in the graph, the filter operator corresponds to the 1 st row, the column value with the numerical value not being 0 in the 1 st row is 2, the column value being 2 corresponds to the transformation operator for the feature, if the filter operator is valid, the transformation operator for the feature is valid, the former operator is valid, and the latter operator is valid, the 1 st column and the 1 st row are deleted, so that the 1 st column and the 1 st row are obtained
Figure GDA0002407511280000051
At the moment, the 1 st column corresponds to a feature transformation operator, the 2 nd column corresponds to a clustering operator, the 1 st line corresponds to a feature transformation operator, and the 2 nd line corresponds to a clustering operator;
when step S41 is performed for the second time, the operator found in step S41 is the 1 st column, the operator corresponding to the 1 st column is the transformation operator for the feature in the graph, the transformation operator for the feature corresponds to the 1 st row, the column value of the numerical value in the 1 st row that is not 0 is 2, the column value is 2 corresponding to the clustering operator, the filter operator is effective, the transformation operator for the feature is effective, the previous operator is effective, the subsequent operator is effective, the 1 st column and the 1 st row are deleted, M is (0), at this time, the 1 st column corresponds to the clustering operator, and the 1 st row corresponds to the clustering operator;
when step S41 is performed for the third time, row 1 and column 1 are deleted, and M is obtained to be no residue, which indicates that the task flow is a closed loop-free one, and meets the requirements, and no dead loop calculation occurs;
if the task flow established by the user is as shown in fig. 2, then a 3 × 3 adjacency matrix M should be established, since operators 11 to 12 have directed edges, then M121 is ═ 1; since operators 12 through 13 have directed edges, then M231 is ═ 1; since operators 13 through 11 have directed edges, then M311 and the remainder 0. The adjacency matrix of example 1 is finally obtained
Figure GDA0002407511280000052
When step S4 is performed:
when step S41 is performed, columns with column values all 0 are not found in step S41, all operators are determined to be invalid, and the filter operator, the feature transformation operator, and the clustering operator form a closed loop, which may cause dead loops.
For convenience of subsequent understanding, we describe a computation package stored in the server, where a large number of operators are stored in the computation package, and for data analysis, the operators in the computation package are roughly classified into the following 5 categories from the operator category:
class 1: data Source read and write
The data source read-write class comprises two operators of a read data source and a write data source, and a plurality of operators of the read data source and the write data source can be arranged in one working space.
The data source reading operator is a starting point of a task flow, a list consisting of all data files of a current user needs to be obtained, and one file is selected as a data source and is transmitted into the flow. An operator for reading the data source is not actually realized on the bottom layer, the function of the operator is to transmit parameters, and only the path of the selected file needs to be transmitted to the subsequent operator, so that the task of reading the data source is completed.
The data writing source operator is the end point of the task flow and is used for storing the intermediate result input by the data writing source operator as the final result of the file name set by the user, and the concrete realization of the data writing source operator is arranged on the bottom layer.
Figure GDA0002407511280000053
Figure GDA0002407511280000061
Class 2: data pre-processing
The data preprocessing class contains a plurality of operators for data preprocessing, including string processing, data set processing, table processing, and the like.
The processing of the character string comprises operations of segmentation, combination, substring fetching and the like. In order to ensure the fine granularity of the operator, the invention provides a single character string processing operator which can only process one field in the DataFrame. If a plurality of character strings need to be processed, the method can be realized in a multi-step operator superposition mode. Each character string processing operator can generate a plurality of new fields according to specific conditions, for example, a substring taking operation is carried out on the character string.
Although a data set is recorded in the memory in the form of a DataFrame, that is, a table after entering an operator, in the description, a distinction is made between the data set and the table processing, the table processing involves the processing of specific fields and the SQL table processing, and the processing of the data set means that only the data set itself is processed, and no specific fields are involved.
The processing of the data set mainly aims at the scale of the data set, and comprises proportional sampling, proportional splitting of the data set, merging of the data set and the like.
The processing of the table includes linking, deduplication, filtering, field deletion, modifying field type, mathematical formula calculation for the field, and the like. The processing of the table benefits from the existence of Spark SQL and the optimized enhancement of DataFrame after Spark 2.0, because Spark provides an SQL interface and supports a large number of SQL core operations, the processing of the table can be performed based on SQL entirely. SQL is a skill that must be mastered for most data analysts, so the use of SQL to process tables can adapt well to the habits of the data analysts and can also complete data processing tasks well.
Figure GDA0002407511280000062
Class 3: feature engineering
The feature engineering operator is mainly based on a feature packet in Spark MLlib, and relates to the aspects of text processing, feature transformation, feature encoding and the like.
In the case of feature engineering, all processing of features is a transformation of features, but in the present description, a distinction is made between them. In the invention, the transformation of the features refers to the transformation of the features of the same type, and the coding of the features refers to the transformation of the features of other types into the features of numerical type through coding.
The processing of the text is related to Natural Language Processing (NLP) and is mainly used for acquiring key information of the text, including word segmentation, stop word deletion, NGram, TF-IDF and the like.
The transformation of the features mainly comprises Principal Component Analysis (PCA), polynomial expansion, various scaling, binarization and the like. The transformation of the features is mainly used for the standardization of the features and ensuring the independence of the features, so that the accuracy of a model trained later can be improved.
The feature coding mainly comprises one-hot coding, character string coding, vector coding and the like. Since most models require that features be represented numerically, it is necessary to numerically represent various types of features in the feature engineering section.
Figure GDA0002407511280000071
Class 4: model (model)
The model type operator mainly comprises models of types such as classification, regression, clustering and collaborative filtering. Since Spark is a distributed parallel computing framework, and most models are not adapted to the distributed environment at the beginning of their design, Spark adopts a model algorithm modified and adapted to the distributed environment, but has a slightly different effect on the specific problem compared with a single model algorithm.
The classification model mainly comprises random forest classification, GBDT classification, logistic regression and the like; the regression model mainly comprises random forest regression, GBDT regression, linear regression and the like; the clustering model mainly comprises Kmeans clustering and the like; the collaborative filtering mainly comprises an alternating least square method ALS and the like.
The model is used as a core part of machine learning and is also a core component for a data analysis platform. The invention provides abundant models, which can meet the requirements of analysts to a certain extent.
Figure GDA0002407511280000072
Class 5: merging process
In an analysis task, the same operation may be performed on different data sets, such as performing the same mathematical expression calculation on the same named fields of two data sets. Since there is such a case, the merging process can be performed on the portions thereof that need to perform the same operation. Because each operator is used as a single-function Spark task, the operation amount really needed to be performed is not large, and the time consumption is not long, so that compared with the prior art, the time for starting the program and initializing the context is long, and the proportion of the total time is large. Therefore, by adopting the merging processing, the scale of the task can be reduced, and the time for starting the task and initializing the context of the original two operators can be shortened to about half of the original time.
In machine learning, there is a need to process multiple data sets, typically having a training data set, a data set that needs to be classified or clustered, and possibly a test set. The feature engineering operation is required to be performed on the data sets, and some feature engineering operations are required to be performed on a plurality of data sets at the same time, the transformation relation of the feature engineering operations needs to be applied to the plurality of data sets, and the transformation relation has a large influence on the data result. Taking the most common character string codes on a training set and a test set as an example, sorting is carried out according to the occurrence times of character strings in the training set, then numerical value coding is carried out, in order to ensure that the coded result is equally effective to the test set, the test set needs to be transformed according to the mapping relation of the step of coding, if the training set and the test set are respectively subjected to character string coding, the occurrence times of the same character strings in the two data sets are probably different, the coding results are also different, and finally, the model is caused to be wrong. In order to prevent the situations from occurring, the data processing method abstracts the data processing method into merging processing, and in the merging process, the data processing method operates other data sets together with the feature engineering information processed by the training set, so that the consistency of the coding result is ensured.
In order to meet the input port number definition of operators of other categories, in the merging process, two data sets are used as one data set to be operated to form a double-data-set data stream, so that operators of other categories can be used in the merging process without realizing two input numbers for the same operator, but judgment is carried out on the bottom layer according to the position of each operator, and correct processing is carried out.
The design of the merging operator reasonably optimizes the use of resources under the condition of no error, and can solve the problem of conversion of characteristic engineering information in machine learning, thereby improving the overall efficiency and solving the potential problem.
Figure GDA0002407511280000081
Because there is a merging process and the operator does not develop two sets, the execution category of the Spark operator needs to be judged, and the operator is told the number of data sets which need to be operated.
Operators are roughly divided into four execution categories according to the position of the operator in the task flow and the category of the operator: normal, start of merging processing, end of merging processing, and in merging processing. The execution category of the operator can provide part of control basis for the flow control of the operator later.
The operator in the merging process is between the operator at the beginning of the merging process and the operator at the end of the merging process, and the other operators are ordinary operators.
Normally, each input port of an operator inputs a single data set data stream. Its input operator must be either the normal case operator or the merge process end operator.
The merge process start operator has two input ports and one output port and functions to merge two data sets into one data stream for processing. The input operator is necessarily an ordinary operator or a merging processing ending operator, and each input port acquires a single data set data stream.
The merging processing end operator has one input port and two input ports, the input operator must be the merging processing start operator or the subsequent operator, the input port acquires the double data set data stream, and the output is the single data set data stream.
The output ports of the operators in the merging process both output two data sets. When the operator is a single input operator, the input port inputs a double data set data stream, and the output port outputs the double data set data stream. When the operator is a double-input operator, one of the input ports inputs a double-dataset data stream, namely the output of the operator in the merging processing, and the other input port inputs a single-dataset data stream, namely the output of the operator in the ordinary condition.
The filtering, reading data source, clustering and the like are the operator categories of the operators.
In order to ensure that each input of the multi-input operator has an input and avoid being incapable of calculating, and in order to ensure that each parameter of the operator needing to set a plurality of parameters is set, operator input number judgment and operator parameter judgment are required after the step S3 and before the step S4;
the operator input number judgment specifically comprises the following steps: firstly, according to a task flow constructed by a user, counting the input quantity of each operator; then, comparing the counted input quantity of each operator with the input port quantity specified in the operator design, and if the input quantity of each operator is equal to the input port quantity specified in the operator design, judging that the operator is currently effective; if not, the operator is judged to be invalid;
the operator parameter judgment specifically comprises the following steps: and judging the quantity of the parameters of each operator, and if the parameter values of the operators are empty, judging the operators to be invalid.
Step S4 further includes the following steps performed between step S42 and step S46:
s43, judgment operator opj3Whether it is effective or not, if opj3Currently active, opj3Effectiveness equal to opi3The effectiveness of (a); if opj3Currently invalid opi3Continuously keeping the original invalid state; (in the step, because the lower operator is invalid in order to ensure that the upper operator is invalid in the task flow, the lower operator is avoided to be still calculated when the upper operator has no output result, and invalid transmission is realized)
S44, repeating the steps from S242 to S243 until the ith3All columns on a row that are not 0 have been processed;
s45, judgment operator opi3Whether it is effective or not, if opi3If valid, opi will be3Adding an executable task queue and performing step S46; if opi3If the operation is invalid, the executable task queue intQ is not added and is kept invalid (the establishment of the executable task queue intQ simplifies the effectiveness judgment of the subsequent execution category, deletes the loop judgment, the operator input number judgment, the operator parameter judgment and the operator which is judged to be invalid by the upper layer and is invalid by the operator at the lower layer, and avoids the condition that whether the invalid operator is invalid is judged when the category is executed or not);
in order to facilitate the subsequent operator to perform the class validity judgment, the operator opi is also marked while the step S43 is performed3The marking principle is as follows;
the execution categories are: normal, start of merge, end of merge and in merge; among operators are the following operator categories: merging processing start sub-operators and merging processing end sub-operators; the execution category of the merging processing start sub-operator is merging start, the execution category of the merging processing end sub-operator is merging end, the operator execution category positioned between the merging processing start sub-operator and the merging processing end sub-operator in the task flow is merging, and the operator execution category positioned outside the merging processing start sub-operator and the merging processing end sub-operator in the task flow is common;
after step S47, an operator execution category validity judgment needs to be performed, where the operator execution category validity judgment specifically includes the following steps:
s51, dequeuing an intQ head operator q of the executable task queue to obtain an operator type typeq and an execution type dtq;
s52, carrying out the following validity judgment on the operator q according to typeq and dtq:
a. when the execution type of the operator q is merging start, judging whether the input of the operator q is the output of an operator in merging or another operator at the beginning of merging, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state; (this step avoids the two merging operators being overlapped to process the result and the result being difficult to separate into four or three output results after the operator outputs in the merging, and avoids the result confusion, the invalid condition is shown in FIG. 3)
b. When the execution type of the operator q is in combination and the operator q is a multi-input operator, judging whether the operator q is one input data stream with double data sets and the other input data stream with single data sets, if so, keeping the operator q in an original invalid or valid state; if not, the operator q is invalid; (this step avoids the output results of the multiple input operators being unable to achieve the separation of the end of the merge process due to the two output results being simultaneously transmitted to another multiple input operator during the merge process, and avoids the situation of result confusion, such as the invalid situation shown in FIG. 4)
c. When the execution type of the operator q is in combination, judging that the output of the operator q is transmitted to a data writing source operator, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state; (since the results obtained from the merge are not the data to be written into the server and the writing method is not legal, the confusion of the stored data caused by the fact that the results obtained from the operator in the merge are stored in the server through the data source operator is avoided, and the invalid condition is shown in FIG. 5)
d. The operator q except the conditions of a, b and c is not judged, and the operator q keeps the original invalid or valid state;
s53, checking the invalid or valid state of the operator q, and if the operator q is in the valid state, adding the operator q into a final execution queue taskQ; if the operator q is in an invalid state, the operator q is removed; (this step excludes a, b, and c, which does not cause confusion).
S54, repeating the steps S2-S3 until the executable task queue intQ is empty, and obtaining a final executable queue taskQ.
And finally, the execution queue taskQ is an ordered and effective operator and eliminates invalidity, so that the server can execute the effective operator, display the result to the client and prompt the invalid operator.
Example 2:
the embodiment provides a big data support platform, includes: the user terminal and the server use the loop-free data analysis queue forming method in embodiment 1.
Finally, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the present invention can be modified or replaced by other means without departing from the spirit and scope of the present invention, which should be construed as limited only by the appended claims.

Claims (5)

1. A loop-free data analysis queue forming method is characterized by comprising the following steps:
s1, a user constructs a task flow through the user terminal and fills in each operator parameter used by the task flow;
s2, the server receives the task flow from the user terminal and each operator parameter used by the task flow;
s3, the server establishes an adjacency matrix M:
s31, setting the task flow as a directed graph G, setting the nodes of the directed graph G as the positions of operators in the task flow, setting the number of the operators in the task flow as N, and establishing an N-N adjacency matrix;
s32, performing the following processing for each pair of directional edges between one operator and the other operator: let an operator i1To another operator j1With directed edges in between, then the ith in the adjacency matrix M1Line j (th)1Column value Mi1j1Is assigned as 1; and performing the following processing on the condition that no directed edge exists between each operator and the other operator: let an operator i2To another operator j2There is no directed edge between them, then the ith in the adjacency matrix M2Line j (th)2Column value Mi2j2Is assigned 0;
s4, judging the ring:
s41, finding out the ith in M3The values on the columns are all 0, and find the ith3Operator opi corresponding to column3
S42, operator opi3Corresponds to the ith3Go to find out the ith3Column value j of not 0 on a row3Find out j3Correspondence operator opj3
S46, delete ith in M3Row and ith3Repeating the steps S41-S42 until no columns with all values of 0 exist in the matrix M;
operators corresponding to the remaining rows in S47 and M are judged to be invalid.
2. The method of forming a loop-free data analysis queue of claim 1,
after step S3 and before step S4, operator input number determination and operator parameter determination are performed;
the operator input number judgment specifically comprises the following steps: firstly, according to a task flow constructed by a user, counting the input quantity of each operator; then, comparing the counted input quantity of each operator with the input port quantity specified in the operator design, and if the input quantity of each operator is equal to the input port quantity specified in the operator design, judging that the operator is currently effective; if not, the operator is judged to be invalid;
the operator parameter judgment specifically comprises the following steps: and judging the quantity of the parameters of each operator, and if the parameter values of the operators are empty, judging the operators to be invalid.
3. The loop-free data analysis queue forming method according to claim 2, wherein the step S24 further includes the following steps performed between the steps S42 and S46:
s43, judgment operator opj3Whether it is effective or not, if opj3Currently active, opj3Effectiveness equal to opi3The effectiveness of (a); if opj3Currently invalid opi3Continuously keeping the original invalid state; s44, repeating the steps S42-S43 until the ith3All columns on a row that are not 0 have been processed;
s45, judgment operator opi3Whether it is effective or not, if opi3If valid, opi will be3Adding an executable task queue and performing step S46; if opi3And if the executable task queue is invalid, not adding the intQ of the executable task queue and keeping invalid.
4. The method as claimed in claim 3, wherein the step S43 is performed while marking the operator opi3The marking principle is as follows;
the execution categories are: normal, start of merge, end of merge and in merge; among operators are the following operator categories: merging processing start sub-operators and merging processing end sub-operators; the execution category of the merging processing start sub-operator is merging start, the execution category of the merging processing end sub-operator is merging end, the operator execution category positioned between the merging processing start sub-operator and the merging processing end sub-operator in the task flow is merging, and the operator execution category positioned outside the merging processing start sub-operator and the merging processing end sub-operator in the task flow is common;
after step S47, an operator execution category validity judgment needs to be performed, where the operator execution category validity judgment specifically includes the following steps:
s51, dequeuing an intQ head operator q of the executable task queue to obtain an operator type typeq and an execution type dtq;
s52, carrying out the following validity judgment on the operator q according to typeq and dtq:
a. when the execution type of the operator q is merging start, judging whether the input of the operator q is the output of an operator in merging or another operator at the beginning of merging, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state;
b. when the execution type of the operator q is in combination and the operator q is a multi-input operator, judging whether the operator q is one input data stream with double data sets and the other input data stream with single data sets, if so, keeping the operator q in an original invalid or valid state; if not, the operator q is invalid;
c. when the execution type of the operator q is in combination, judging that the output of the operator q is transmitted to a data writing source operator, and if so, invalidating the operator q; if not, the operator q keeps the original invalid or valid state;
d. the operator q except the conditions of a, b and c is not judged, and the operator q keeps the original invalid or valid state;
s53, checking the invalid or valid state of the operator q, and if the operator q is in the valid state, adding the operator q into a final execution queue taskQ; if the operator q is in an invalid state, the operator q is removed;
s54, repeating the steps S3-S4 until the executable task queue intQ is empty, and obtaining a final executable queue taskQ.
5. A big data support platform, comprising: a user terminal and a server using the loop-free data analysis queue forming method according to any one of claims 1 to 4.
CN201710847663.0A 2017-09-19 2017-09-19 Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue Active CN107688663B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710847663.0A CN107688663B (en) 2017-09-19 2017-09-19 Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710847663.0A CN107688663B (en) 2017-09-19 2017-09-19 Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue

Publications (2)

Publication Number Publication Date
CN107688663A CN107688663A (en) 2018-02-13
CN107688663B true CN107688663B (en) 2020-06-05

Family

ID=61156311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710847663.0A Active CN107688663B (en) 2017-09-19 2017-09-19 Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue

Country Status (1)

Country Link
CN (1) CN107688663B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115167352B (en) * 2022-07-05 2023-05-02 南方电网科学研究院有限责任公司 Algebraic loop identification method and device for electric power simulation secondary control system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270204A (en) * 2010-06-02 2011-12-07 上海佳艾商务信息咨询有限公司 Method for calculating influence of online bulletin board system users based on matrix decomposition
CN102420701A (en) * 2011-11-28 2012-04-18 北京邮电大学 Method for extracting internet service flow characteristics
CN106682343A (en) * 2016-08-31 2017-05-17 电子科技大学 Method for formally verifying adjacent matrixes on basis of diagrams

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7617194B2 (en) * 2006-12-29 2009-11-10 Microsoft Corporation Supervised ranking of vertices of a directed graph

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270204A (en) * 2010-06-02 2011-12-07 上海佳艾商务信息咨询有限公司 Method for calculating influence of online bulletin board system users based on matrix decomposition
CN102420701A (en) * 2011-11-28 2012-04-18 北京邮电大学 Method for extracting internet service flow characteristics
CN106682343A (en) * 2016-08-31 2017-05-17 电子科技大学 Method for formally verifying adjacent matrixes on basis of diagrams

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于运行数据分析的spark任务参数优化;陈侨安等;《计算机工程与科学》;20160131;第38卷(第1期);第11-19页 *

Also Published As

Publication number Publication date
CN107688663A (en) 2018-02-13

Similar Documents

Publication Publication Date Title
CN107590254B (en) Big data support platform with merging processing method
AU2018272840B2 (en) Automated dependency analyzer for heterogeneously programmed data processing system
Zhang et al. On complexity and optimization of expensive queries in complex event processing
CN106020950B (en) The identification of function call graph key node and identification method based on Complex Networks Analysis
CN108052394B (en) Resource allocation method based on SQL statement running time and computer equipment
US20180032375A1 (en) Data Processing Method and Apparatus
US10936950B1 (en) Processing sequential interaction data
JP2011186729A (en) Data processing device
CN108984155B (en) Data processing flow setting method and device
WO2022126984A1 (en) Cache data detection method and apparatus, computer device and storage medium
CA3179300C (en) Domain-specific language interpreter and interactive visual interface for rapid screening
EP4006909B1 (en) Method, apparatus and device for quality control and storage medium
CN106293891B (en) Multidimensional investment index monitoring method
JP7098327B2 (en) Information processing system, function creation method and function creation program
CN104834599A (en) WEB security detection method and device
CN107688663B (en) Method for forming loop-free data analysis queue and big data support platform comprising loop-free data analysis queue
US20160154634A1 (en) Modifying an analytic flow
WO2023093909A1 (en) Workflow node recommendation method and apparatus
CN110888888A (en) Personnel relationship analysis method and device, electronic equipment and storage medium
JP2010072876A (en) Rule creation program, rule creation method, and rule creation device
TW201619822A (en) Variable inference system and method for software program
JP6336922B2 (en) Business impact location extraction method and business impact location extraction device based on business variations
Wang et al. Interactive inconsistency fixing in feature modeling
IL300167A (en) Natural solution language
EP2924560A1 (en) Apparatus and process for automating discovery of effective algorithm configurations for data processing using evolutionary graphical search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant