CN105573836B - Data processing method and device - Google Patents
Data processing method and device Download PDFInfo
- Publication number
- CN105573836B CN105573836B CN201610098936.1A CN201610098936A CN105573836B CN 105573836 B CN105573836 B CN 105573836B CN 201610098936 A CN201610098936 A CN 201610098936A CN 105573836 B CN105573836 B CN 105573836B
- Authority
- CN
- China
- Prior art keywords
- node
- data processing
- processing model
- object instance
- model object
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 claims description 19
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000005540 biological transmission Effects 0.000 abstract 1
- 230000006870 function Effects 0.000 description 25
- 238000010586 diagram Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000007405 data analysis Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 230000008676 import Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000001960 triggered effect Effects 0.000 description 2
- 238000012896 Statistical algorithm Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/48—Indexing scheme relating to G06F9/48
- G06F2209/483—Multiproc
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the invention discloses a kind of data processing method and devices, data processing model is indicated with digraph, when receiving the instruction for carrying node listing of client transmission, to any one node in node listing, if the corresponding data set of the father node of the node is not processed, then preferentially the corresponding data set of the father node of the node is handled, if the corresponding data set of the father node of the node is processed, then input data set of the output data set of father node as the node directly is read from execution context, the input data set of the node is handled based on the node corresponding data set, generate the output data set of the node, the output data set of the node is charged into execution context.As it can be seen that data processing method provided in an embodiment of the present invention, the data set for the node being successfully processed is not repeated to handle, and realizes and only handles the data of part of nodes, to improve data-handling efficiency.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus.
Background
Spark is an efficient distributed computing system that can perform data mining and analysis at the Terabyte (TB) level of data size. To process data using Spark, one of three languages, Java, Scala and Python, needs to be grasped, and usually, an analyst needs to implement a scene of data analysis as a fixed program in one of the three languages, then compile the program into a file that can be recognized by a machine, and load, interpret and execute the file through a Java virtual machine.
However, in the data analysis scenario, an analyst often has no clear analysis idea in the early stage, and needs to use various statistical algorithms to try on the data, and finally, the most effective or interpretable data analysis process is solidified by combining with experience. In this process, an analyst needs to change a large amount of programs, and each change requires a procedure of compiling and executing a program file again, which brings inconvenience in two aspects: firstly, a certain time is needed for analysts to modify, compile and execute each program file, secondly, the re-execution of the program causes all nodes in the data processing flow to be re-executed, the execution period of the program under the background of big data processing is very time-consuming, and analysts need to waste a large amount of time to wait for the result of the modified program. The data processing efficiency is overall low.
Therefore, how to improve the data processing efficiency becomes an urgent problem to be solved.
Disclosure of Invention
The invention aims to provide a data processing method and a data processing device so as to improve the data processing efficiency.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of data processing, comprising:
acquiring a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
when an execution instruction which is sent by a client and carries a node list formed by a plurality of nodes in the data processing model object instance is received, for a first node in the node list, if input data of the first node comes from a father node of the first node and a data set corresponding to the father node of the first node is not successfully processed, adding the father node of the first node into the node list and preferentially processing the father node; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
In the above method, preferably, the obtaining, based on the data processing model description file sent by the client, the data processing model object instance corresponding to the data processing model description file includes:
converting the data processing model description file sent by the client into a first data processing model object example;
judging whether the data processing model description file is created with a data processing model object example or not according to the unique identification code of the data processing model;
if the data processing model description file is not created with a data processing model object instance, determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file;
and if the data processing model description file has already created a data processing model object instance, merging the first data processing model object instance with a created second data processing model object instance corresponding to the data processing model description file to obtain the data processing model object instance corresponding to the data processing model description file.
In the above method, preferably, the merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file includes:
comparing the first data processing model object instance with the second data processing model object instance;
for a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has the same unique identification code as the second node, if the parameters in the data set corresponding to the second node are different from the parameters in the data set corresponding to the third node, updating the data set corresponding to the second node to the third node, and marking the third node as an unprocessed state;
if the first data processing model object instance has a fourth node and the second data processing model object instance does not contain the fourth node, inserting the fourth node into the second data processing model object instance and marking the fourth node in the second data processing model object instance as an unprocessed state;
if the second data processing model object instance has a fifth node and the first data processing model object instance does not contain the fifth node, deleting the fifth node in the second data processing model object instance and marking all child nodes of the fifth node as unprocessed states;
and marking the nodes of which the states of all the parent nodes are unprocessed states in the second data processing model object instance as unprocessed states.
In the above method, preferably, the processing the input data set of the first node based on the data set corresponding to the first node, and the generating the output data set of the first node includes:
generating an operation function file corresponding to the first node based on the data set corresponding to the first node;
dynamically compiling the operation function file and loading a corresponding function object;
executing the function object on the input data set of the first node to generate an output data set of the first node.
In the above method, preferably, the generating an operation function file corresponding to the first node based on the data set corresponding to the first node includes:
reading the type and the parameters of the first node from the data set corresponding to the first node;
determining a program file template corresponding to the first node based on the type of the first node;
filling the parameters into the program file template to generate a program source file corresponding to the first node;
and compiling the program source file to obtain an operation function file corresponding to the first node.
A data processing apparatus comprising:
the acquisition module is used for acquiring a data processing model object instance corresponding to the data processing model description file based on the data processing model description file sent by the client; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
a processing module, configured to, when receiving an execution instruction sent by a client and carrying a node list formed by a plurality of nodes in the data processing model object instance, add, for a first node in the node list, a parent node of the first node to the node list and perform preferential processing if input data of the first node is from the parent node of the first node and a data set corresponding to the parent node of the first node is not successfully processed; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
Preferably, the above apparatus, the obtaining module includes:
the conversion submodule is used for converting the data processing model description file sent by the client into a first data processing model object example;
the judging submodule is used for judging whether the data processing model description file is created with a data processing model object example according to the unique identification code of the data processing model;
the determining submodule is used for determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file if the data processing model description file has not created the data processing model object instance;
and the merging submodule is used for merging the first data processing model object instance and the created second data processing model object instance corresponding to the data processing model description file to obtain the data processing model object instance corresponding to the data processing model description file if the data processing model description file has created the data processing model object instance.
In the foregoing apparatus, preferably, the merging submodule includes:
a comparison unit for comparing the first data processing model object instance with the second data processing model object instance;
a first processing unit, configured to, for a second node in the first data processing model object instance and a third node in the second data processing model object instance that has a same unique identification code as the second node, update a data set corresponding to the second node to the third node if a parameter in the data set corresponding to the second node is different from a parameter in the data set corresponding to the third node, and mark the third node in an unprocessed state;
a second processing unit, configured to insert a fourth node into the second data processing model object instance and mark the fourth node in the second data processing model object instance as an unprocessed state if the first data processing model object instance has the fourth node and the second data processing model object instance does not include the fourth node;
a third processing unit, configured to delete a fifth node in the second data processing model object instance and mark all child nodes of the fifth node as an unprocessed state if the second data processing model object instance has the fifth node and the first data processing model object instance does not include the fifth node;
and the fourth processing unit is used for marking the nodes of which the states of all the father nodes are unprocessed states in the second data processing model object instance as unprocessed states.
In the apparatus, preferably, in terms of processing the input dataset of the first node based on the dataset corresponding to the first node to generate the output dataset of the first node, the processing module is specifically configured to generate an operation function file corresponding to the first node based on the dataset corresponding to the first node; dynamically compiling the operation function file and loading a corresponding function object; executing the function object on the input data set of the first node to generate an output data set of the first node.
In the foregoing apparatus, preferably, in terms of generating an operation function file corresponding to the first node based on the dataset corresponding to the first node, the processing module is specifically configured to read the type and the parameter of the first node from the dataset corresponding to the first node; determining a program file template corresponding to the first node based on the type of the first node; filling the parameters into the program file template to generate a program source file corresponding to the first node; and compiling the program source file to obtain an operation function file corresponding to the first node.
According to the scheme, the data processing method and the data processing device provided by the application represent a data processing model by using a directed graph, when an instruction which is sent by a client and carries a node list is received, for any node in the node list, if a data set corresponding to a parent node of the node is not processed, a data set corresponding to the parent node of the node is preferentially processed, if the data set corresponding to the parent node of the node is processed, an output data set of the parent node is directly read from an execution context to serve as an input data set of the node, the input data set of the node is processed based on the data set corresponding to the node, the output data set of the node is generated, and the output data set of the node is recorded into the execution context. Therefore, the data processing method provided by the embodiment of the invention can realize that the data of only part of the nodes is processed without repeatedly processing the data set of the successfully processed nodes, thereby improving the data processing efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an implementation of a data processing method according to an embodiment of the present application;
FIG. 2 is an exemplary diagram of a data processing model provided by an embodiment of the present application;
fig. 3 is a flowchart of an implementation of obtaining a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client according to the embodiment of the present application;
fig. 4 is a flowchart illustrating an implementation of processing an input data set of a first node based on a data set corresponding to the first node to generate an output data set of the first node according to the embodiment of the present application;
fig. 5 is a flowchart of an implementation of generating an operation function file corresponding to a first node based on a data set corresponding to the first node according to the embodiment of the present application;
fig. 6 is a flowchart of another implementation of a data processing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an acquisition module according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a merge sub-module provided in an embodiment of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced otherwise than as specifically illustrated.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
The data processing method and device provided by the embodiment of the invention can be applied to a distributed computing system Spark to realize interactive processing of a data set.
Referring to fig. 1, fig. 1 is a flowchart of an implementation of a data processing method according to an embodiment of the present application, which may include:
step S11: acquiring a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client;
the data processing model description file is obtained by converting the data processing model, and the data processing model description file describes the information of the data processing model diagram in an appointed coding mode. The data processing model is a directed graph, nodes in the directed graph comprise an operation node containing at least one father node and a data source node not containing any father node, and each node in the directed graph corresponds to a data set;
in the embodiment of the invention, a user establishes a data processing model at a client according to a scene of data analysis, and the client converts the established data processing model into a data processing model description file and sends the data processing model description file to a server.
The data processing model is a directed graph. Fig. 2 is a diagram illustrating an exemplary data processing model according to an embodiment of the present invention. The directed graph is composed of a plurality of nodes, each node represents a data processing unit and comprises functional modules for acquiring input data, processing the input data (executing a section of data analysis logic on the input data), storing processing results and the like. The directed graph has at least one node as a source node (such a node does not depend on data of other nodes as input, but directly reads data from other external systems), and the other nodes use the processing result of the parent node as own input data according to the dependency relationship between the nodes described by the directed edge.
The directed graph comprises two types of nodes, wherein one type is a data source node without any parent node, such as nodes No. 1-3 in FIG. 2, and the other type is an operation node with at least one parent node, such as nodes No. 4-9 in FIG. 2. Moreover, each node in the directed graph corresponds to a data set. The parent node of node 5 is node 4, and node 5 is the parent node of node 6.
The data set corresponding to the node is used to generate a program file corresponding to the node. The data set corresponding to each node comprises: type information of the node, and user-configured node parameters. Wherein,
for an operation node of the collective operation class, the node type may include: map (one-to-one mapping), Filter (filtering), FlatMap (one-to-many mapping), Union (Union), sample (sampling), intersection (intersection), distinct (removing duplicate records), redecebykey (merging according to primary key), join (connecting according to primary key), cartesian (cartesian product), subtrect (difference set)
For an operation node that imports an export operation class, the node type may include: HDFSInput (import HDFS file), HDFSOutput (export HDFS)
For the operation node of the mining algorithm class, the node type may include: three major algorithms of classification, clustering and frequent items, wherein one algorithm is abstracted into one node.
Node parameters may differ depending on the node type. For example, for HDFSInput nodes, the node parameters that need to be configured by the user include: inputting a path, a file format, a file code and the like of a file; and for the Filter node, a user is required to input a data filtering rule and the like according to a graph.
In addition, in the data processing model, each node comprises a status flag bit, the status of each node is changed among Dirty, Running, clear and Error, the Dirty status indicates that the node is not processed, Running indicates that the node is being processed, the clear indicates that the node is successfully processed, and the Error indicates that the node is in Error in the processing process.
In addition, after each node is executed successfully, the execution result of the node is also recorded into the execution context, so that the child nodes of the node use the output result of the node.
Optionally, after each node is successfully executed, the execution context of the node may be stored in a preset cache, so that the child node of the node reads the input data set from the cache, and the processing efficiency is further improved.
Step S12: when an execution instruction which is sent by a client and carries a node list consisting of a plurality of nodes in a data processing model object example is received, processing a data set corresponding to a specified node in the node list;
the execution instruction carrying the node list is triggered and generated after the user designates a node in the data processing model instance, the user may designate one node, may designate two or more nodes, and of course, the user may designate all nodes in the data processing model instance. The nodes included in the node list are the nodes designated by the user.
For convenience of description, any node in the node list is marked as a first node, and if the input data of the first node is from a father node of the first node and a data set corresponding to the father node of the first node is not processed, the father node of the first node is added into the node list and is preferentially processed; if the input data of the first node comes from a father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring an output data set of the father node of the first node from the execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
The execution instruction sent by the client comprises a node list, and the nodes contained in the node list are part or all of the nodes in the data processing model instance.
For a first node in the node list, if the input of the first node is the output of the father node of the first node, firstly judging whether the father node of the first node is successfully processed, if the father node of the first node is successfully processed (including not being processed, being processed and errors in the processing process), firstly processing the father node of the first node, and then processing the first node after the father node of the first node is successfully processed; if the parent node of the first node has been successfully processed, the output dataset of the parent node of the first node is read directly from the execution context without having to process the parent node of the first node any more.
In the data processing method provided in the embodiment of the present invention, a data processing model is represented by a directed graph, and when an instruction carrying a node list sent by a client is received, for any node in the node list, if a data set corresponding to a parent node of the node is not processed, a data set corresponding to the parent node of the node is preferentially processed, if a data set corresponding to the parent node of the node is processed, an output data set of the parent node is directly read from an execution context as an input data set of the node, the input data set of the node is processed based on the data set corresponding to the node, an output data set of the node is generated, and the output data set of the node is recorded in the execution context. Therefore, the data processing method provided by the embodiment of the invention can realize that the data of only part of the nodes is processed without repeatedly processing the data set of the successfully processed nodes, thereby improving the data processing efficiency.
Optionally, an implementation flowchart for obtaining a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client according to the embodiment of the present invention is shown in fig. 3, and may include:
step S31: converting a data processing model description file sent by a client into a first data processing model object example;
in the embodiment of the invention, after the data processing model description file sent by the client is received, the data processing model description file sent by the client is converted into the data processing model object instance (which is recorded as the first data processing model object instance for convenience in description).
Step S32: judging whether the data processing model description file is created with a data processing model object example or not according to the unique identification code of the data processing model;
in the embodiment of the present invention, each data processing model has a unique Identifier, such as a UUID (universal unique Identifier), and after the data processing model description file is converted into the data processing model object instance, the corresponding relationship between the unique Identifier and the data processing model object instance can be established.
If the unique identification code corresponding to the data processing model object instance is consistent with the unique identification code corresponding to the first data processing model object instance, the data processing model description file is proved to have created the data processing model object instance, otherwise, the data processing model description file is confirmed not to have created the data processing model object instance.
Step S33: if the data processing model description file is not created with the data processing model object instance, determining the first data processing model object instance as the data processing model object instance corresponding to the data processing model description file;
step S34: if the data processing model description file has already created a data processing model object instance, the first data processing model object instance is merged with the created data processing model object instance (for convenience of description, it is denoted as a second data processing model object instance) corresponding to the data processing model description file to obtain the data processing model object instance corresponding to the data processing model description file.
If the data processing model description file has already created the data processing model object instance, it indicates that the user has modified the data processing model and needs to update the data processing model object instance corresponding to the data processing model description file.
Merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file specifically comprises: and updating the second data processing model object instance according to the first data processing model object instance.
Optionally, an implementation of merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file according to the embodiment of the present invention may be:
comparing the first data processing model object instance with the second data processing model object instance;
by comparison, it is determined whether nodes having the same unique identification code differ from the first data processing model object instance as compared to the second data processing model object instance, and whether nodes are added or subtracted from the first data processing model object instance as compared to the second data processing model object instance.
For a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has the same unique identification code as the second node, if the parameters in the data set corresponding to the second node are different from the parameters in the data set corresponding to the third node, updating the data set corresponding to the second node to the third node, and marking the third node as an unprocessed state;
and if the parameters in the data set corresponding to the second node are the same as the parameters in the data set corresponding to the third node, the data set corresponding to the third node is not modified.
If the first data processing model object instance has a fourth node and the second data processing model object instance does not contain the fourth node, inserting the fourth node into the second data processing model object instance and marking the fourth node in the second data processing model object instance as an unprocessed state;
the first data processing model object instance has a fourth node therein, and the second data processing model object instance does not include the fourth node therein, illustrating nodes added by the user in modifying the data processing model.
If the second data processing model object instance has a fifth node and the first data processing model object instance does not contain the fifth node, deleting the fifth node in the second data processing model object instance and marking all child nodes of the fifth node as unprocessed states;
the second data processing model object instance has the fifth node therein, and the first data processing model object instance does not contain the fifth node, indicating that the user deleted the fifth node during the modification of the data processing model.
And marking the nodes of which the states of all the parent nodes are unprocessed states in the second data processing model object instance as unprocessed states.
After the operations of updating, adding or deleting the nodes are carried out on the nodes, traversing the second data processing model object instance from the data source node, and marking the nodes of which the states of all father nodes are unprocessed states as unprocessed states. That is, for any node (for convenience of description, referred to as a sixth node) in the second data processing model object instance, if the parent node of the sixth node is in the unprocessed state, the sixth node is also marked as the unprocessed state.
Optionally, an implementation flowchart for processing an input data set of a first node based on a data set corresponding to the first node to generate an output data set of the first node is shown in fig. 4, and may include:
step S41: generating an operation function file corresponding to the first node based on the data set corresponding to the first node;
step S42: dynamically compiling the generated operation function file and loading a corresponding function object;
step S43: the function object is executed on the input dataset of the first node, generating an output dataset of the first node.
Optionally, an implementation flowchart for generating an operation function file corresponding to the first node based on the data set corresponding to the first node, provided by the embodiment of the present invention, is shown in fig. 5, and may include:
step S51: reading the type and parameters of the first node from a data set corresponding to the first node;
step S52: determining a program file template corresponding to the first node based on the type of the first node;
in the embodiment of the invention, different node types correspond to different program file templates.
Step S53: filling the parameters into the determined program file template to generate a program source file corresponding to the first node;
step S54: and compiling the generated program source file to obtain an operation function file corresponding to the first node.
Optionally, as shown in fig. 6, another implementation flowchart of the data processing method provided in the embodiment of the present invention may include:
step S61: receiving a data processing model description file sent by a client, and converting the data processing model description file into an instance M' of a data processing model object;
step S62: judging whether the instance of the data processing model object is created according to the unique identification code of M'; if not, go to step S63; if yes, go to step S64;
step S63: using the Map data structure to save the stored starting address of the data processing model instance M', step S65 may then be performed;
step S64: finding the recorded data model instance M in the Map, executing a merging algorithm of M and M ', finally updating the information carried in M' to M, and then executing step S65;
step S65: receiving an execution instruction sent by a user through a client, wherein the execution instruction comprises a node list consisting of a plurality of nodes in M, and the list is used for explaining all nodes listed in the list which must be executed at this time;
step S66: judging whether Spark resources are applied for executing the data processing model; if not, the process proceeds to step S67, and if so, the process proceeds to step S68;
step S67: applying for computing resources from the Spark cluster, and then executing step S68;
step S68: processing the nodes in the node list, and for each node in the node list, if the input of the node is the output of the father node of the node, judging whether the father node of the node is successfully processed or not, if the father node of the node is successfully processed, processing the father node of the node, and processing the node after the father node of the node is successfully processed; if the parent node of the node has been successfully processed, the output dataset of the parent node of the node is directly read from the execution context without further processing the parent node of the node;
step S69: judging whether the client sends a termination signal (the termination signal is triggered and generated by a user at the client); if yes, go to step S610, otherwise, return to step S61;
step S610: a release computation resource signal is sent to Spark.
Corresponding to the method embodiment, an embodiment of the present invention further provides a data processing apparatus, and a schematic structural diagram of the data processing apparatus according to the embodiment of the present invention is shown in fig. 7, and the data processing apparatus may include:
an acquisition module 71 and a processing module 72; wherein,
the obtaining module 71 is configured to obtain a data processing model object instance corresponding to the data processing model description file based on the data processing model description file sent by the client; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise operation nodes comprising at least one father node and data source nodes not comprising any father node, and each node in the directed graph corresponds to a data set;
the processing module 72 is configured to, when receiving an execution instruction sent by a client and carrying a node list formed by a plurality of nodes in the data processing model object instance, add, for a first node in the node list, a parent node of the first node into the node list and perform preferential processing if input data of the first node comes from the parent node of the first node and a data set corresponding to the parent node of the first node is not successfully processed; if the input data of the first node comes from a father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring an output data set of the father node of the first node from the execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
The data processing apparatus according to the embodiment of the present invention uses a directed graph to represent a data processing model, and when receiving an instruction carrying a node list sent by a client, for any node in the node list, if a data set corresponding to a parent node of the node is not processed, processing a data set corresponding to the parent node of the node preferentially, if a data set corresponding to the parent node of the node is processed, directly reading an output data set of the parent node from an execution context as an input data set of the node, processing the input data set of the node based on the data set corresponding to the node, generating an output data set of the node, and recording the output data set of the node in the execution context. Therefore, the data processing device provided by the embodiment of the invention can process the data of partial nodes without repeatedly processing the data set of the successfully processed nodes, thereby improving the data processing efficiency.
Optionally, a schematic structural diagram of the obtaining module 71 provided in the embodiment of the present invention is shown in fig. 8, and may include:
a transformation submodule 81, a judgment submodule 82, a determination submodule 83 and a merging submodule 84; wherein,
the conversion submodule 81 is configured to convert the data processing model description file sent by the client into a first data processing model object instance;
the judging submodule 82 is used for judging whether the data processing model description file is created by the data processing model object example according to the unique identification code of the data processing model;
the determining submodule 83 is configured to determine, if the data processing model description file has not created a data processing model object instance, the first data processing model object instance as the data processing model object instance corresponding to the data processing model description file;
the merging submodule 84 is configured to, if the data processing model description file has created a data processing model object instance, merge the first data processing model object instance with a second data processing model object instance that has been created and corresponds to the data processing model description file, so as to obtain a data processing model object instance that corresponds to the data processing model description file.
Optionally, a schematic structural diagram of the merge sub-module 84 provided in the embodiment of the present invention is shown in fig. 9, and may include:
a comparison unit 91, a first processing unit 92, a second processing unit 93, a third processing unit 94, and a fourth processing unit 95; wherein,
the comparison unit 91 is configured to compare the first data processing model object instance with the second data processing model object instance;
the first processing unit 92 is configured to, for a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has a same unique identification code as the second node, update a data set corresponding to the second node to the third node if a parameter in the data set corresponding to the second node is different from a parameter in the data set corresponding to the third node, and mark the third node in an unprocessed state;
the second processing unit 93 is configured to insert the fourth node into the second data processing model object instance and mark the fourth node in the second data processing model object instance as an unprocessed state if the first data processing model object instance has the fourth node and the second data processing model object instance does not contain the fourth node;
the third processing unit 94 is configured to delete the fifth node in the second data processing model object instance and mark all child nodes of the fifth node as an unprocessed state if the second data processing model object instance has the fifth node and the first data processing model object instance does not contain the fifth node;
the fourth processing unit 95 is configured to mark, in the second data processing model object instance, all nodes whose parent nodes are in an unprocessed state as an unprocessed state.
Optionally, in the aspect that the input data set of the first node is processed based on the data set corresponding to the first node to generate the output data set of the first node, the processing module 72 is specifically configured to generate an operation function file corresponding to the first node based on the data set corresponding to the first node; dynamically compiling the operation function file and loading a corresponding function object; the function object is executed on the input dataset of the first node, generating an output dataset of the first node.
Optionally, in the aspect of generating an operation function file corresponding to the first node based on the data set corresponding to the first node, the processing module 72 is specifically configured to read the type and the parameter of the first node from the data set corresponding to the first node; determining a program file template corresponding to the first node based on the type of the first node; filling the parameters into a program file template to generate a program source file corresponding to the first node; and compiling the program source file to obtain an operation function file corresponding to the first node.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Claims (8)
1. A data processing method, comprising:
converting a data processing model description file sent by a client into a first data processing model object example;
judging whether the data processing model description file is created with a data processing model object example;
if the data processing model description file is not created with a data processing model object instance, determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file;
if the data processing model description file has created a data processing model object instance, merging the first data processing model object instance with a created second data processing model object instance corresponding to the data processing model description file to obtain a data processing model object instance corresponding to the data processing model description file; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
when an execution instruction which is sent by a client and carries a node list formed by a plurality of nodes in the data processing model object instance is received, for a first node in the node list, if input data of the first node comes from a father node of the first node and a data set corresponding to the father node of the first node is not successfully processed, adding the father node of the first node into the node list and preferentially processing the father node; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
2. The method of claim 1, wherein merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file comprises:
comparing the first data processing model object instance with the second data processing model object instance;
for a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has the same unique identification code as the second node, if the parameters in the data set corresponding to the second node are different from the parameters in the data set corresponding to the third node, updating the data set corresponding to the second node to the third node, and marking the third node as an unprocessed state;
if the first data processing model object instance has a fourth node and the second data processing model object instance does not contain the fourth node, inserting the fourth node into the second data processing model object instance and marking the fourth node in the second data processing model object instance as an unprocessed state;
if the second data processing model object instance has a fifth node and the first data processing model object instance does not contain the fifth node, deleting the fifth node in the second data processing model object instance and marking all child nodes of the fifth node as unprocessed states;
and marking the nodes of which the states of all the parent nodes are unprocessed states in the second data processing model object instance as unprocessed states.
3. The method of claim 1, wherein the processing the input dataset of the first node based on the dataset corresponding to the first node, and wherein generating the output dataset of the first node comprises:
generating an operation function file corresponding to the first node based on the data set corresponding to the first node;
dynamically compiling the operation function file and loading a corresponding function object;
executing the function object on the input data set of the first node to generate an output data set of the first node.
4. The method of claim 3, wherein generating the operation function file corresponding to the first node based on the data set corresponding to the first node comprises:
reading the type and the parameters of the first node from the data set corresponding to the first node;
determining a program file template corresponding to the first node based on the type of the first node;
filling the parameters into the program file template to generate a program source file corresponding to the first node;
and compiling the program source file to obtain an operation function file corresponding to the first node.
5. A data processing apparatus, comprising:
an acquisition module, comprising: the conversion submodule is used for converting the data processing model description file sent by the client into a first data processing model object example; the judging submodule is used for judging whether the data processing model description file is created with a data processing model object example; the determining submodule is used for determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file if the data processing model description file has not created the data processing model object instance; a merging submodule, configured to merge the first data processing model object instance with a created second data processing model object instance corresponding to the data processing model description file if the data processing model description file has created a data processing model object instance, to obtain a data processing model object instance corresponding to the data processing model description file; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
a processing module, configured to, when receiving an execution instruction sent by a client and carrying a node list formed by a plurality of nodes in the data processing model object instance, add, for a first node in the node list, a parent node of the first node to the node list and perform preferential processing if input data of the first node is from the parent node of the first node and a data set corresponding to the parent node of the first node is not successfully processed; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
6. The apparatus of claim 5, wherein the merge sub-module comprises:
a comparison unit for comparing the first data processing model object instance with the second data processing model object instance;
a first processing unit, configured to, for a second node in the first data processing model object instance and a third node in the second data processing model object instance that has a same unique identification code as the second node, update a data set corresponding to the second node to the third node if a parameter in the data set corresponding to the second node is different from a parameter in the data set corresponding to the third node, and mark the third node in an unprocessed state;
a second processing unit, configured to insert a fourth node into the second data processing model object instance and mark the fourth node in the second data processing model object instance as an unprocessed state if the first data processing model object instance has the fourth node and the second data processing model object instance does not include the fourth node;
a third processing unit, configured to delete a fifth node in the second data processing model object instance and mark all child nodes of the fifth node as an unprocessed state if the second data processing model object instance has the fifth node and the first data processing model object instance does not include the fifth node;
and the fourth processing unit is used for marking the nodes of which the states of all the father nodes are unprocessed states in the second data processing model object instance as unprocessed states.
7. The apparatus according to claim 5, wherein in processing the input dataset of the first node based on the dataset corresponding to the first node to generate the output dataset of the first node, the processing module is specifically configured to generate an operation function file corresponding to the first node based on the dataset corresponding to the first node; dynamically compiling the operation function file and loading a corresponding function object; executing the function object on the input data set of the first node to generate an output data set of the first node.
8. The apparatus according to claim 7, wherein in generating the operation function file corresponding to the first node based on the dataset corresponding to the first node, the processing module is specifically configured to read the type and parameters of the first node from the dataset corresponding to the first node; determining a program file template corresponding to the first node based on the type of the first node; filling the parameters into the program file template to generate a program source file corresponding to the first node; and compiling the program source file to obtain an operation function file corresponding to the first node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610098936.1A CN105573836B (en) | 2016-02-23 | 2016-02-23 | Data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610098936.1A CN105573836B (en) | 2016-02-23 | 2016-02-23 | Data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105573836A CN105573836A (en) | 2016-05-11 |
CN105573836B true CN105573836B (en) | 2018-12-28 |
Family
ID=55884006
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610098936.1A Active CN105573836B (en) | 2016-02-23 | 2016-02-23 | Data processing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105573836B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109743202B (en) * | 2018-12-26 | 2022-04-15 | 中国联合网络通信集团有限公司 | Data management method, device and equipment and readable storage medium |
CN112598506A (en) * | 2020-12-25 | 2021-04-02 | 中国农业银行股份有限公司 | Method for determining false mortgage user and related device |
CN113434323A (en) * | 2021-06-28 | 2021-09-24 | 浙江大华技术股份有限公司 | Task flow control method of data center station and related device |
CN113918126B (en) * | 2021-09-14 | 2022-06-10 | 北京柏睿数据技术股份有限公司 | AI modeling flow arrangement method and system based on graph algorithm |
CN114840265A (en) * | 2022-03-23 | 2022-08-02 | 阿里巴巴(中国)有限公司 | Data processing method based on executable graph |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1500228A1 (en) * | 2002-04-30 | 2005-01-26 | Nokia Corporation | Method and device for management of tree data exchange |
CN102819536A (en) * | 2011-09-27 | 2012-12-12 | 金蝶软件(中国)有限公司 | Processing method and device of tree type data |
CN103049580A (en) * | 2013-01-17 | 2013-04-17 | 北京工商大学 | Method and device for visualization of layering data |
CN104281681A (en) * | 2014-10-07 | 2015-01-14 | 北京工商大学 | Tetragonal ordered tree map layout method for hierarchical data |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7418454B2 (en) * | 2004-04-16 | 2008-08-26 | Microsoft Corporation | Data overlay, self-organized metadata overlay, and application level multicasting |
CN104714947A (en) * | 2013-12-11 | 2015-06-17 | 深圳市腾讯计算机系统有限公司 | Preset type number recognition method and device |
US9946809B2 (en) * | 2014-04-09 | 2018-04-17 | Introspective Systems LLC | Executable graph framework for the management of complex systems |
JP6007430B2 (en) * | 2015-05-20 | 2016-10-12 | 大澤 昇平 | Machine learning model design support device, machine learning model design support method, program for machine learning model design support device |
CN104955068B (en) * | 2015-06-18 | 2018-04-13 | 湖南大学 | A kind of data aggregate transmission method based on association mode |
CN105117468B (en) * | 2015-08-28 | 2019-05-28 | 广州酷狗计算机科技有限公司 | A kind of network data processing method and device |
-
2016
- 2016-02-23 CN CN201610098936.1A patent/CN105573836B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1500228A1 (en) * | 2002-04-30 | 2005-01-26 | Nokia Corporation | Method and device for management of tree data exchange |
CN102819536A (en) * | 2011-09-27 | 2012-12-12 | 金蝶软件(中国)有限公司 | Processing method and device of tree type data |
CN103049580A (en) * | 2013-01-17 | 2013-04-17 | 北京工商大学 | Method and device for visualization of layering data |
CN104281681A (en) * | 2014-10-07 | 2015-01-14 | 北京工商大学 | Tetragonal ordered tree map layout method for hierarchical data |
Also Published As
Publication number | Publication date |
---|---|
CN105573836A (en) | 2016-05-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105573836B (en) | Data processing method and device | |
US10579344B2 (en) | Converting visual diagrams into code | |
EP3502896B1 (en) | Generation of an adapters configuration user interface using a data structure | |
US20190129734A1 (en) | Data collection workflow extension | |
US11726802B2 (en) | Robust user interface related robotic process automation | |
CN111488174A (en) | Method and device for generating application program interface document, computer equipment and medium | |
US7640538B2 (en) | Virtual threads in business process programs | |
CN108345691B (en) | Data source general processing framework construction method, data source processing method and device | |
CN112615758B (en) | Application identification method, device, equipment and storage medium | |
WO2023087721A1 (en) | Service processing model generation method and apparatus, and electronic device and storage medium | |
CN113377342B (en) | Project construction method and device, electronic equipment and storage medium | |
Sajnani | Automatic software architecture recovery: A machine learning approach | |
US10169725B2 (en) | Change-request analysis | |
CN108845862A (en) | Multi-container management method and device | |
CN112597105A (en) | Processing method of file associated object, server side equipment and storage medium | |
US9442698B2 (en) | Migration between model elements of different types in a modeling environment | |
CN107122359A (en) | Data real-time tracking visible processing method and device | |
CN105426676A (en) | Drilling data processing method and system | |
CN111277650B (en) | Automatic micro-service identification method combining functional indexes and non-functional indexes | |
Johannsen et al. | Supporting knowledge elicitation and analysis for business process improvement through a modeling tool | |
CN114510419A (en) | Performance analysis programming framework, method and apparatus | |
US8495033B2 (en) | Data processing | |
CN109388398A (en) | Virtualization system median surface generation method and device | |
CN110968566A (en) | Migration tool-based domestic application system migration method | |
CN113391812A (en) | Analysis method and device of application program module and analysis tool |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |