CN105573836B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN105573836B
CN105573836B CN201610098936.1A CN201610098936A CN105573836B CN 105573836 B CN105573836 B CN 105573836B CN 201610098936 A CN201610098936 A CN 201610098936A CN 105573836 B CN105573836 B CN 105573836B
Authority
CN
China
Prior art keywords
node
data processing
processing model
object instance
model object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610098936.1A
Other languages
Chinese (zh)
Other versions
CN105573836A (en
Inventor
刘志丹
王鑫毅
刘龙
曹震
于雪龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Agricultural Bank of China
Original Assignee
Agricultural Bank of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Agricultural Bank of China filed Critical Agricultural Bank of China
Priority to CN201610098936.1A priority Critical patent/CN105573836B/en
Publication of CN105573836A publication Critical patent/CN105573836A/en
Application granted granted Critical
Publication of CN105573836B publication Critical patent/CN105573836B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/48Indexing scheme relating to G06F9/48
    • G06F2209/483Multiproc

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of data processing method and devices, data processing model is indicated with digraph, when receiving the instruction for carrying node listing of client transmission, to any one node in node listing, if the corresponding data set of the father node of the node is not processed, then preferentially the corresponding data set of the father node of the node is handled, if the corresponding data set of the father node of the node is processed, then input data set of the output data set of father node as the node directly is read from execution context, the input data set of the node is handled based on the node corresponding data set, generate the output data set of the node, the output data set of the node is charged into execution context.As it can be seen that data processing method provided in an embodiment of the present invention, the data set for the node being successfully processed is not repeated to handle, and realizes and only handles the data of part of nodes, to improve data-handling efficiency.

Description

Data processing method and device
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus.
Background
Spark is an efficient distributed computing system that can perform data mining and analysis at the Terabyte (TB) level of data size. To process data using Spark, one of three languages, Java, Scala and Python, needs to be grasped, and usually, an analyst needs to implement a scene of data analysis as a fixed program in one of the three languages, then compile the program into a file that can be recognized by a machine, and load, interpret and execute the file through a Java virtual machine.
However, in the data analysis scenario, an analyst often has no clear analysis idea in the early stage, and needs to use various statistical algorithms to try on the data, and finally, the most effective or interpretable data analysis process is solidified by combining with experience. In this process, an analyst needs to change a large amount of programs, and each change requires a procedure of compiling and executing a program file again, which brings inconvenience in two aspects: firstly, a certain time is needed for analysts to modify, compile and execute each program file, secondly, the re-execution of the program causes all nodes in the data processing flow to be re-executed, the execution period of the program under the background of big data processing is very time-consuming, and analysts need to waste a large amount of time to wait for the result of the modified program. The data processing efficiency is overall low.
Therefore, how to improve the data processing efficiency becomes an urgent problem to be solved.
Disclosure of Invention
The invention aims to provide a data processing method and a data processing device so as to improve the data processing efficiency.
In order to achieve the purpose, the invention provides the following technical scheme:
a method of data processing, comprising:
acquiring a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
when an execution instruction which is sent by a client and carries a node list formed by a plurality of nodes in the data processing model object instance is received, for a first node in the node list, if input data of the first node comes from a father node of the first node and a data set corresponding to the father node of the first node is not successfully processed, adding the father node of the first node into the node list and preferentially processing the father node; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
In the above method, preferably, the obtaining, based on the data processing model description file sent by the client, the data processing model object instance corresponding to the data processing model description file includes:
converting the data processing model description file sent by the client into a first data processing model object example;
judging whether the data processing model description file is created with a data processing model object example or not according to the unique identification code of the data processing model;
if the data processing model description file is not created with a data processing model object instance, determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file;
and if the data processing model description file has already created a data processing model object instance, merging the first data processing model object instance with a created second data processing model object instance corresponding to the data processing model description file to obtain the data processing model object instance corresponding to the data processing model description file.
In the above method, preferably, the merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file includes:
comparing the first data processing model object instance with the second data processing model object instance;
for a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has the same unique identification code as the second node, if the parameters in the data set corresponding to the second node are different from the parameters in the data set corresponding to the third node, updating the data set corresponding to the second node to the third node, and marking the third node as an unprocessed state;
if the first data processing model object instance has a fourth node and the second data processing model object instance does not contain the fourth node, inserting the fourth node into the second data processing model object instance and marking the fourth node in the second data processing model object instance as an unprocessed state;
if the second data processing model object instance has a fifth node and the first data processing model object instance does not contain the fifth node, deleting the fifth node in the second data processing model object instance and marking all child nodes of the fifth node as unprocessed states;
and marking the nodes of which the states of all the parent nodes are unprocessed states in the second data processing model object instance as unprocessed states.
In the above method, preferably, the processing the input data set of the first node based on the data set corresponding to the first node, and the generating the output data set of the first node includes:
generating an operation function file corresponding to the first node based on the data set corresponding to the first node;
dynamically compiling the operation function file and loading a corresponding function object;
executing the function object on the input data set of the first node to generate an output data set of the first node.
In the above method, preferably, the generating an operation function file corresponding to the first node based on the data set corresponding to the first node includes:
reading the type and the parameters of the first node from the data set corresponding to the first node;
determining a program file template corresponding to the first node based on the type of the first node;
filling the parameters into the program file template to generate a program source file corresponding to the first node;
and compiling the program source file to obtain an operation function file corresponding to the first node.
A data processing apparatus comprising:
the acquisition module is used for acquiring a data processing model object instance corresponding to the data processing model description file based on the data processing model description file sent by the client; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
a processing module, configured to, when receiving an execution instruction sent by a client and carrying a node list formed by a plurality of nodes in the data processing model object instance, add, for a first node in the node list, a parent node of the first node to the node list and perform preferential processing if input data of the first node is from the parent node of the first node and a data set corresponding to the parent node of the first node is not successfully processed; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
Preferably, the above apparatus, the obtaining module includes:
the conversion submodule is used for converting the data processing model description file sent by the client into a first data processing model object example;
the judging submodule is used for judging whether the data processing model description file is created with a data processing model object example according to the unique identification code of the data processing model;
the determining submodule is used for determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file if the data processing model description file has not created the data processing model object instance;
and the merging submodule is used for merging the first data processing model object instance and the created second data processing model object instance corresponding to the data processing model description file to obtain the data processing model object instance corresponding to the data processing model description file if the data processing model description file has created the data processing model object instance.
In the foregoing apparatus, preferably, the merging submodule includes:
a comparison unit for comparing the first data processing model object instance with the second data processing model object instance;
a first processing unit, configured to, for a second node in the first data processing model object instance and a third node in the second data processing model object instance that has a same unique identification code as the second node, update a data set corresponding to the second node to the third node if a parameter in the data set corresponding to the second node is different from a parameter in the data set corresponding to the third node, and mark the third node in an unprocessed state;
a second processing unit, configured to insert a fourth node into the second data processing model object instance and mark the fourth node in the second data processing model object instance as an unprocessed state if the first data processing model object instance has the fourth node and the second data processing model object instance does not include the fourth node;
a third processing unit, configured to delete a fifth node in the second data processing model object instance and mark all child nodes of the fifth node as an unprocessed state if the second data processing model object instance has the fifth node and the first data processing model object instance does not include the fifth node;
and the fourth processing unit is used for marking the nodes of which the states of all the father nodes are unprocessed states in the second data processing model object instance as unprocessed states.
In the apparatus, preferably, in terms of processing the input dataset of the first node based on the dataset corresponding to the first node to generate the output dataset of the first node, the processing module is specifically configured to generate an operation function file corresponding to the first node based on the dataset corresponding to the first node; dynamically compiling the operation function file and loading a corresponding function object; executing the function object on the input data set of the first node to generate an output data set of the first node.
In the foregoing apparatus, preferably, in terms of generating an operation function file corresponding to the first node based on the dataset corresponding to the first node, the processing module is specifically configured to read the type and the parameter of the first node from the dataset corresponding to the first node; determining a program file template corresponding to the first node based on the type of the first node; filling the parameters into the program file template to generate a program source file corresponding to the first node; and compiling the program source file to obtain an operation function file corresponding to the first node.
According to the scheme, the data processing method and the data processing device provided by the application represent a data processing model by using a directed graph, when an instruction which is sent by a client and carries a node list is received, for any node in the node list, if a data set corresponding to a parent node of the node is not processed, a data set corresponding to the parent node of the node is preferentially processed, if the data set corresponding to the parent node of the node is processed, an output data set of the parent node is directly read from an execution context to serve as an input data set of the node, the input data set of the node is processed based on the data set corresponding to the node, the output data set of the node is generated, and the output data set of the node is recorded into the execution context. Therefore, the data processing method provided by the embodiment of the invention can realize that the data of only part of the nodes is processed without repeatedly processing the data set of the successfully processed nodes, thereby improving the data processing efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of an implementation of a data processing method according to an embodiment of the present application;
FIG. 2 is an exemplary diagram of a data processing model provided by an embodiment of the present application;
fig. 3 is a flowchart of an implementation of obtaining a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client according to the embodiment of the present application;
fig. 4 is a flowchart illustrating an implementation of processing an input data set of a first node based on a data set corresponding to the first node to generate an output data set of the first node according to the embodiment of the present application;
fig. 5 is a flowchart of an implementation of generating an operation function file corresponding to a first node based on a data set corresponding to the first node according to the embodiment of the present application;
fig. 6 is a flowchart of another implementation of a data processing method according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an acquisition module according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a merge sub-module provided in an embodiment of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be practiced otherwise than as specifically illustrated.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
The data processing method and device provided by the embodiment of the invention can be applied to a distributed computing system Spark to realize interactive processing of a data set.
Referring to fig. 1, fig. 1 is a flowchart of an implementation of a data processing method according to an embodiment of the present application, which may include:
step S11: acquiring a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client;
the data processing model description file is obtained by converting the data processing model, and the data processing model description file describes the information of the data processing model diagram in an appointed coding mode. The data processing model is a directed graph, nodes in the directed graph comprise an operation node containing at least one father node and a data source node not containing any father node, and each node in the directed graph corresponds to a data set;
in the embodiment of the invention, a user establishes a data processing model at a client according to a scene of data analysis, and the client converts the established data processing model into a data processing model description file and sends the data processing model description file to a server.
The data processing model is a directed graph. Fig. 2 is a diagram illustrating an exemplary data processing model according to an embodiment of the present invention. The directed graph is composed of a plurality of nodes, each node represents a data processing unit and comprises functional modules for acquiring input data, processing the input data (executing a section of data analysis logic on the input data), storing processing results and the like. The directed graph has at least one node as a source node (such a node does not depend on data of other nodes as input, but directly reads data from other external systems), and the other nodes use the processing result of the parent node as own input data according to the dependency relationship between the nodes described by the directed edge.
The directed graph comprises two types of nodes, wherein one type is a data source node without any parent node, such as nodes No. 1-3 in FIG. 2, and the other type is an operation node with at least one parent node, such as nodes No. 4-9 in FIG. 2. Moreover, each node in the directed graph corresponds to a data set. The parent node of node 5 is node 4, and node 5 is the parent node of node 6.
The data set corresponding to the node is used to generate a program file corresponding to the node. The data set corresponding to each node comprises: type information of the node, and user-configured node parameters. Wherein,
for an operation node of the collective operation class, the node type may include: map (one-to-one mapping), Filter (filtering), FlatMap (one-to-many mapping), Union (Union), sample (sampling), intersection (intersection), distinct (removing duplicate records), redecebykey (merging according to primary key), join (connecting according to primary key), cartesian (cartesian product), subtrect (difference set)
For an operation node that imports an export operation class, the node type may include: HDFSInput (import HDFS file), HDFSOutput (export HDFS)
For the operation node of the mining algorithm class, the node type may include: three major algorithms of classification, clustering and frequent items, wherein one algorithm is abstracted into one node.
Node parameters may differ depending on the node type. For example, for HDFSInput nodes, the node parameters that need to be configured by the user include: inputting a path, a file format, a file code and the like of a file; and for the Filter node, a user is required to input a data filtering rule and the like according to a graph.
In addition, in the data processing model, each node comprises a status flag bit, the status of each node is changed among Dirty, Running, clear and Error, the Dirty status indicates that the node is not processed, Running indicates that the node is being processed, the clear indicates that the node is successfully processed, and the Error indicates that the node is in Error in the processing process.
In addition, after each node is executed successfully, the execution result of the node is also recorded into the execution context, so that the child nodes of the node use the output result of the node.
Optionally, after each node is successfully executed, the execution context of the node may be stored in a preset cache, so that the child node of the node reads the input data set from the cache, and the processing efficiency is further improved.
Step S12: when an execution instruction which is sent by a client and carries a node list consisting of a plurality of nodes in a data processing model object example is received, processing a data set corresponding to a specified node in the node list;
the execution instruction carrying the node list is triggered and generated after the user designates a node in the data processing model instance, the user may designate one node, may designate two or more nodes, and of course, the user may designate all nodes in the data processing model instance. The nodes included in the node list are the nodes designated by the user.
For convenience of description, any node in the node list is marked as a first node, and if the input data of the first node is from a father node of the first node and a data set corresponding to the father node of the first node is not processed, the father node of the first node is added into the node list and is preferentially processed; if the input data of the first node comes from a father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring an output data set of the father node of the first node from the execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
The execution instruction sent by the client comprises a node list, and the nodes contained in the node list are part or all of the nodes in the data processing model instance.
For a first node in the node list, if the input of the first node is the output of the father node of the first node, firstly judging whether the father node of the first node is successfully processed, if the father node of the first node is successfully processed (including not being processed, being processed and errors in the processing process), firstly processing the father node of the first node, and then processing the first node after the father node of the first node is successfully processed; if the parent node of the first node has been successfully processed, the output dataset of the parent node of the first node is read directly from the execution context without having to process the parent node of the first node any more.
In the data processing method provided in the embodiment of the present invention, a data processing model is represented by a directed graph, and when an instruction carrying a node list sent by a client is received, for any node in the node list, if a data set corresponding to a parent node of the node is not processed, a data set corresponding to the parent node of the node is preferentially processed, if a data set corresponding to the parent node of the node is processed, an output data set of the parent node is directly read from an execution context as an input data set of the node, the input data set of the node is processed based on the data set corresponding to the node, an output data set of the node is generated, and the output data set of the node is recorded in the execution context. Therefore, the data processing method provided by the embodiment of the invention can realize that the data of only part of the nodes is processed without repeatedly processing the data set of the successfully processed nodes, thereby improving the data processing efficiency.
Optionally, an implementation flowchart for obtaining a data processing model object instance corresponding to a data processing model description file based on the data processing model description file sent by a client according to the embodiment of the present invention is shown in fig. 3, and may include:
step S31: converting a data processing model description file sent by a client into a first data processing model object example;
in the embodiment of the invention, after the data processing model description file sent by the client is received, the data processing model description file sent by the client is converted into the data processing model object instance (which is recorded as the first data processing model object instance for convenience in description).
Step S32: judging whether the data processing model description file is created with a data processing model object example or not according to the unique identification code of the data processing model;
in the embodiment of the present invention, each data processing model has a unique Identifier, such as a UUID (universal unique Identifier), and after the data processing model description file is converted into the data processing model object instance, the corresponding relationship between the unique Identifier and the data processing model object instance can be established.
If the unique identification code corresponding to the data processing model object instance is consistent with the unique identification code corresponding to the first data processing model object instance, the data processing model description file is proved to have created the data processing model object instance, otherwise, the data processing model description file is confirmed not to have created the data processing model object instance.
Step S33: if the data processing model description file is not created with the data processing model object instance, determining the first data processing model object instance as the data processing model object instance corresponding to the data processing model description file;
step S34: if the data processing model description file has already created a data processing model object instance, the first data processing model object instance is merged with the created data processing model object instance (for convenience of description, it is denoted as a second data processing model object instance) corresponding to the data processing model description file to obtain the data processing model object instance corresponding to the data processing model description file.
If the data processing model description file has already created the data processing model object instance, it indicates that the user has modified the data processing model and needs to update the data processing model object instance corresponding to the data processing model description file.
Merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file specifically comprises: and updating the second data processing model object instance according to the first data processing model object instance.
Optionally, an implementation of merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file according to the embodiment of the present invention may be:
comparing the first data processing model object instance with the second data processing model object instance;
by comparison, it is determined whether nodes having the same unique identification code differ from the first data processing model object instance as compared to the second data processing model object instance, and whether nodes are added or subtracted from the first data processing model object instance as compared to the second data processing model object instance.
For a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has the same unique identification code as the second node, if the parameters in the data set corresponding to the second node are different from the parameters in the data set corresponding to the third node, updating the data set corresponding to the second node to the third node, and marking the third node as an unprocessed state;
and if the parameters in the data set corresponding to the second node are the same as the parameters in the data set corresponding to the third node, the data set corresponding to the third node is not modified.
If the first data processing model object instance has a fourth node and the second data processing model object instance does not contain the fourth node, inserting the fourth node into the second data processing model object instance and marking the fourth node in the second data processing model object instance as an unprocessed state;
the first data processing model object instance has a fourth node therein, and the second data processing model object instance does not include the fourth node therein, illustrating nodes added by the user in modifying the data processing model.
If the second data processing model object instance has a fifth node and the first data processing model object instance does not contain the fifth node, deleting the fifth node in the second data processing model object instance and marking all child nodes of the fifth node as unprocessed states;
the second data processing model object instance has the fifth node therein, and the first data processing model object instance does not contain the fifth node, indicating that the user deleted the fifth node during the modification of the data processing model.
And marking the nodes of which the states of all the parent nodes are unprocessed states in the second data processing model object instance as unprocessed states.
After the operations of updating, adding or deleting the nodes are carried out on the nodes, traversing the second data processing model object instance from the data source node, and marking the nodes of which the states of all father nodes are unprocessed states as unprocessed states. That is, for any node (for convenience of description, referred to as a sixth node) in the second data processing model object instance, if the parent node of the sixth node is in the unprocessed state, the sixth node is also marked as the unprocessed state.
Optionally, an implementation flowchart for processing an input data set of a first node based on a data set corresponding to the first node to generate an output data set of the first node is shown in fig. 4, and may include:
step S41: generating an operation function file corresponding to the first node based on the data set corresponding to the first node;
step S42: dynamically compiling the generated operation function file and loading a corresponding function object;
step S43: the function object is executed on the input dataset of the first node, generating an output dataset of the first node.
Optionally, an implementation flowchart for generating an operation function file corresponding to the first node based on the data set corresponding to the first node, provided by the embodiment of the present invention, is shown in fig. 5, and may include:
step S51: reading the type and parameters of the first node from a data set corresponding to the first node;
step S52: determining a program file template corresponding to the first node based on the type of the first node;
in the embodiment of the invention, different node types correspond to different program file templates.
Step S53: filling the parameters into the determined program file template to generate a program source file corresponding to the first node;
step S54: and compiling the generated program source file to obtain an operation function file corresponding to the first node.
Optionally, as shown in fig. 6, another implementation flowchart of the data processing method provided in the embodiment of the present invention may include:
step S61: receiving a data processing model description file sent by a client, and converting the data processing model description file into an instance M' of a data processing model object;
step S62: judging whether the instance of the data processing model object is created according to the unique identification code of M'; if not, go to step S63; if yes, go to step S64;
step S63: using the Map data structure to save the stored starting address of the data processing model instance M', step S65 may then be performed;
step S64: finding the recorded data model instance M in the Map, executing a merging algorithm of M and M ', finally updating the information carried in M' to M, and then executing step S65;
step S65: receiving an execution instruction sent by a user through a client, wherein the execution instruction comprises a node list consisting of a plurality of nodes in M, and the list is used for explaining all nodes listed in the list which must be executed at this time;
step S66: judging whether Spark resources are applied for executing the data processing model; if not, the process proceeds to step S67, and if so, the process proceeds to step S68;
step S67: applying for computing resources from the Spark cluster, and then executing step S68;
step S68: processing the nodes in the node list, and for each node in the node list, if the input of the node is the output of the father node of the node, judging whether the father node of the node is successfully processed or not, if the father node of the node is successfully processed, processing the father node of the node, and processing the node after the father node of the node is successfully processed; if the parent node of the node has been successfully processed, the output dataset of the parent node of the node is directly read from the execution context without further processing the parent node of the node;
step S69: judging whether the client sends a termination signal (the termination signal is triggered and generated by a user at the client); if yes, go to step S610, otherwise, return to step S61;
step S610: a release computation resource signal is sent to Spark.
Corresponding to the method embodiment, an embodiment of the present invention further provides a data processing apparatus, and a schematic structural diagram of the data processing apparatus according to the embodiment of the present invention is shown in fig. 7, and the data processing apparatus may include:
an acquisition module 71 and a processing module 72; wherein,
the obtaining module 71 is configured to obtain a data processing model object instance corresponding to the data processing model description file based on the data processing model description file sent by the client; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise operation nodes comprising at least one father node and data source nodes not comprising any father node, and each node in the directed graph corresponds to a data set;
the processing module 72 is configured to, when receiving an execution instruction sent by a client and carrying a node list formed by a plurality of nodes in the data processing model object instance, add, for a first node in the node list, a parent node of the first node into the node list and perform preferential processing if input data of the first node comes from the parent node of the first node and a data set corresponding to the parent node of the first node is not successfully processed; if the input data of the first node comes from a father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring an output data set of the father node of the first node from the execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
The data processing apparatus according to the embodiment of the present invention uses a directed graph to represent a data processing model, and when receiving an instruction carrying a node list sent by a client, for any node in the node list, if a data set corresponding to a parent node of the node is not processed, processing a data set corresponding to the parent node of the node preferentially, if a data set corresponding to the parent node of the node is processed, directly reading an output data set of the parent node from an execution context as an input data set of the node, processing the input data set of the node based on the data set corresponding to the node, generating an output data set of the node, and recording the output data set of the node in the execution context. Therefore, the data processing device provided by the embodiment of the invention can process the data of partial nodes without repeatedly processing the data set of the successfully processed nodes, thereby improving the data processing efficiency.
Optionally, a schematic structural diagram of the obtaining module 71 provided in the embodiment of the present invention is shown in fig. 8, and may include:
a transformation submodule 81, a judgment submodule 82, a determination submodule 83 and a merging submodule 84; wherein,
the conversion submodule 81 is configured to convert the data processing model description file sent by the client into a first data processing model object instance;
the judging submodule 82 is used for judging whether the data processing model description file is created by the data processing model object example according to the unique identification code of the data processing model;
the determining submodule 83 is configured to determine, if the data processing model description file has not created a data processing model object instance, the first data processing model object instance as the data processing model object instance corresponding to the data processing model description file;
the merging submodule 84 is configured to, if the data processing model description file has created a data processing model object instance, merge the first data processing model object instance with a second data processing model object instance that has been created and corresponds to the data processing model description file, so as to obtain a data processing model object instance that corresponds to the data processing model description file.
Optionally, a schematic structural diagram of the merge sub-module 84 provided in the embodiment of the present invention is shown in fig. 9, and may include:
a comparison unit 91, a first processing unit 92, a second processing unit 93, a third processing unit 94, and a fourth processing unit 95; wherein,
the comparison unit 91 is configured to compare the first data processing model object instance with the second data processing model object instance;
the first processing unit 92 is configured to, for a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has a same unique identification code as the second node, update a data set corresponding to the second node to the third node if a parameter in the data set corresponding to the second node is different from a parameter in the data set corresponding to the third node, and mark the third node in an unprocessed state;
the second processing unit 93 is configured to insert the fourth node into the second data processing model object instance and mark the fourth node in the second data processing model object instance as an unprocessed state if the first data processing model object instance has the fourth node and the second data processing model object instance does not contain the fourth node;
the third processing unit 94 is configured to delete the fifth node in the second data processing model object instance and mark all child nodes of the fifth node as an unprocessed state if the second data processing model object instance has the fifth node and the first data processing model object instance does not contain the fifth node;
the fourth processing unit 95 is configured to mark, in the second data processing model object instance, all nodes whose parent nodes are in an unprocessed state as an unprocessed state.
Optionally, in the aspect that the input data set of the first node is processed based on the data set corresponding to the first node to generate the output data set of the first node, the processing module 72 is specifically configured to generate an operation function file corresponding to the first node based on the data set corresponding to the first node; dynamically compiling the operation function file and loading a corresponding function object; the function object is executed on the input dataset of the first node, generating an output dataset of the first node.
Optionally, in the aspect of generating an operation function file corresponding to the first node based on the data set corresponding to the first node, the processing module 72 is specifically configured to read the type and the parameter of the first node from the data set corresponding to the first node; determining a program file template corresponding to the first node based on the type of the first node; filling the parameters into a program file template to generate a program source file corresponding to the first node; and compiling the program source file to obtain an operation function file corresponding to the first node.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Claims (8)

1. A data processing method, comprising:
converting a data processing model description file sent by a client into a first data processing model object example;
judging whether the data processing model description file is created with a data processing model object example;
if the data processing model description file is not created with a data processing model object instance, determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file;
if the data processing model description file has created a data processing model object instance, merging the first data processing model object instance with a created second data processing model object instance corresponding to the data processing model description file to obtain a data processing model object instance corresponding to the data processing model description file; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
when an execution instruction which is sent by a client and carries a node list formed by a plurality of nodes in the data processing model object instance is received, for a first node in the node list, if input data of the first node comes from a father node of the first node and a data set corresponding to the father node of the first node is not successfully processed, adding the father node of the first node into the node list and preferentially processing the father node; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
2. The method of claim 1, wherein merging the first data processing model object instance with the created second data processing model object instance corresponding to the data processing model description file comprises:
comparing the first data processing model object instance with the second data processing model object instance;
for a second node in the first data processing model object instance and a third node in the second data processing model object instance, which has the same unique identification code as the second node, if the parameters in the data set corresponding to the second node are different from the parameters in the data set corresponding to the third node, updating the data set corresponding to the second node to the third node, and marking the third node as an unprocessed state;
if the first data processing model object instance has a fourth node and the second data processing model object instance does not contain the fourth node, inserting the fourth node into the second data processing model object instance and marking the fourth node in the second data processing model object instance as an unprocessed state;
if the second data processing model object instance has a fifth node and the first data processing model object instance does not contain the fifth node, deleting the fifth node in the second data processing model object instance and marking all child nodes of the fifth node as unprocessed states;
and marking the nodes of which the states of all the parent nodes are unprocessed states in the second data processing model object instance as unprocessed states.
3. The method of claim 1, wherein the processing the input dataset of the first node based on the dataset corresponding to the first node, and wherein generating the output dataset of the first node comprises:
generating an operation function file corresponding to the first node based on the data set corresponding to the first node;
dynamically compiling the operation function file and loading a corresponding function object;
executing the function object on the input data set of the first node to generate an output data set of the first node.
4. The method of claim 3, wherein generating the operation function file corresponding to the first node based on the data set corresponding to the first node comprises:
reading the type and the parameters of the first node from the data set corresponding to the first node;
determining a program file template corresponding to the first node based on the type of the first node;
filling the parameters into the program file template to generate a program source file corresponding to the first node;
and compiling the program source file to obtain an operation function file corresponding to the first node.
5. A data processing apparatus, comprising:
an acquisition module, comprising: the conversion submodule is used for converting the data processing model description file sent by the client into a first data processing model object example; the judging submodule is used for judging whether the data processing model description file is created with a data processing model object example; the determining submodule is used for determining the first data processing model object instance as a data processing model object instance corresponding to the data processing model description file if the data processing model description file has not created the data processing model object instance; a merging submodule, configured to merge the first data processing model object instance with a created second data processing model object instance corresponding to the data processing model description file if the data processing model description file has created a data processing model object instance, to obtain a data processing model object instance corresponding to the data processing model description file; the data processing model description file is obtained by converting a data processing model, the data processing model is a directed graph, nodes in the directed graph comprise an operation node comprising at least one father node and a data source node not comprising any father node, and each node in the directed graph corresponds to a data set;
a processing module, configured to, when receiving an execution instruction sent by a client and carrying a node list formed by a plurality of nodes in the data processing model object instance, add, for a first node in the node list, a parent node of the first node to the node list and perform preferential processing if input data of the first node is from the parent node of the first node and a data set corresponding to the parent node of the first node is not successfully processed; if the input data of the first node comes from the father node of the first node and the data set corresponding to the father node of the first node is successfully processed, acquiring the output data set of the father node of the first node from an execution context as the input data set of the first node, processing the input data set of the first node based on the data set corresponding to the first node to generate the output data set of the first node, and recording the output data set of the first node into the execution context; the first node is any one node in the node list.
6. The apparatus of claim 5, wherein the merge sub-module comprises:
a comparison unit for comparing the first data processing model object instance with the second data processing model object instance;
a first processing unit, configured to, for a second node in the first data processing model object instance and a third node in the second data processing model object instance that has a same unique identification code as the second node, update a data set corresponding to the second node to the third node if a parameter in the data set corresponding to the second node is different from a parameter in the data set corresponding to the third node, and mark the third node in an unprocessed state;
a second processing unit, configured to insert a fourth node into the second data processing model object instance and mark the fourth node in the second data processing model object instance as an unprocessed state if the first data processing model object instance has the fourth node and the second data processing model object instance does not include the fourth node;
a third processing unit, configured to delete a fifth node in the second data processing model object instance and mark all child nodes of the fifth node as an unprocessed state if the second data processing model object instance has the fifth node and the first data processing model object instance does not include the fifth node;
and the fourth processing unit is used for marking the nodes of which the states of all the father nodes are unprocessed states in the second data processing model object instance as unprocessed states.
7. The apparatus according to claim 5, wherein in processing the input dataset of the first node based on the dataset corresponding to the first node to generate the output dataset of the first node, the processing module is specifically configured to generate an operation function file corresponding to the first node based on the dataset corresponding to the first node; dynamically compiling the operation function file and loading a corresponding function object; executing the function object on the input data set of the first node to generate an output data set of the first node.
8. The apparatus according to claim 7, wherein in generating the operation function file corresponding to the first node based on the dataset corresponding to the first node, the processing module is specifically configured to read the type and parameters of the first node from the dataset corresponding to the first node; determining a program file template corresponding to the first node based on the type of the first node; filling the parameters into the program file template to generate a program source file corresponding to the first node; and compiling the program source file to obtain an operation function file corresponding to the first node.
CN201610098936.1A 2016-02-23 2016-02-23 Data processing method and device Active CN105573836B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610098936.1A CN105573836B (en) 2016-02-23 2016-02-23 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610098936.1A CN105573836B (en) 2016-02-23 2016-02-23 Data processing method and device

Publications (2)

Publication Number Publication Date
CN105573836A CN105573836A (en) 2016-05-11
CN105573836B true CN105573836B (en) 2018-12-28

Family

ID=55884006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610098936.1A Active CN105573836B (en) 2016-02-23 2016-02-23 Data processing method and device

Country Status (1)

Country Link
CN (1) CN105573836B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109743202B (en) * 2018-12-26 2022-04-15 中国联合网络通信集团有限公司 Data management method, device and equipment and readable storage medium
CN112598506A (en) * 2020-12-25 2021-04-02 中国农业银行股份有限公司 Method for determining false mortgage user and related device
CN113434323A (en) * 2021-06-28 2021-09-24 浙江大华技术股份有限公司 Task flow control method of data center station and related device
CN113918126B (en) * 2021-09-14 2022-06-10 北京柏睿数据技术股份有限公司 AI modeling flow arrangement method and system based on graph algorithm
CN114840265A (en) * 2022-03-23 2022-08-02 阿里巴巴(中国)有限公司 Data processing method based on executable graph

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1500228A1 (en) * 2002-04-30 2005-01-26 Nokia Corporation Method and device for management of tree data exchange
CN102819536A (en) * 2011-09-27 2012-12-12 金蝶软件(中国)有限公司 Processing method and device of tree type data
CN103049580A (en) * 2013-01-17 2013-04-17 北京工商大学 Method and device for visualization of layering data
CN104281681A (en) * 2014-10-07 2015-01-14 北京工商大学 Tetragonal ordered tree map layout method for hierarchical data

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7418454B2 (en) * 2004-04-16 2008-08-26 Microsoft Corporation Data overlay, self-organized metadata overlay, and application level multicasting
CN104714947A (en) * 2013-12-11 2015-06-17 深圳市腾讯计算机系统有限公司 Preset type number recognition method and device
US9946809B2 (en) * 2014-04-09 2018-04-17 Introspective Systems LLC Executable graph framework for the management of complex systems
JP6007430B2 (en) * 2015-05-20 2016-10-12 大澤 昇平 Machine learning model design support device, machine learning model design support method, program for machine learning model design support device
CN104955068B (en) * 2015-06-18 2018-04-13 湖南大学 A kind of data aggregate transmission method based on association mode
CN105117468B (en) * 2015-08-28 2019-05-28 广州酷狗计算机科技有限公司 A kind of network data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1500228A1 (en) * 2002-04-30 2005-01-26 Nokia Corporation Method and device for management of tree data exchange
CN102819536A (en) * 2011-09-27 2012-12-12 金蝶软件(中国)有限公司 Processing method and device of tree type data
CN103049580A (en) * 2013-01-17 2013-04-17 北京工商大学 Method and device for visualization of layering data
CN104281681A (en) * 2014-10-07 2015-01-14 北京工商大学 Tetragonal ordered tree map layout method for hierarchical data

Also Published As

Publication number Publication date
CN105573836A (en) 2016-05-11

Similar Documents

Publication Publication Date Title
CN105573836B (en) Data processing method and device
US10579344B2 (en) Converting visual diagrams into code
EP3502896B1 (en) Generation of an adapters configuration user interface using a data structure
US20190129734A1 (en) Data collection workflow extension
US11726802B2 (en) Robust user interface related robotic process automation
CN111488174A (en) Method and device for generating application program interface document, computer equipment and medium
US7640538B2 (en) Virtual threads in business process programs
CN108345691B (en) Data source general processing framework construction method, data source processing method and device
CN112615758B (en) Application identification method, device, equipment and storage medium
WO2023087721A1 (en) Service processing model generation method and apparatus, and electronic device and storage medium
CN113377342B (en) Project construction method and device, electronic equipment and storage medium
Sajnani Automatic software architecture recovery: A machine learning approach
US10169725B2 (en) Change-request analysis
CN108845862A (en) Multi-container management method and device
CN112597105A (en) Processing method of file associated object, server side equipment and storage medium
US9442698B2 (en) Migration between model elements of different types in a modeling environment
CN107122359A (en) Data real-time tracking visible processing method and device
CN105426676A (en) Drilling data processing method and system
CN111277650B (en) Automatic micro-service identification method combining functional indexes and non-functional indexes
Johannsen et al. Supporting knowledge elicitation and analysis for business process improvement through a modeling tool
CN114510419A (en) Performance analysis programming framework, method and apparatus
US8495033B2 (en) Data processing
CN109388398A (en) Virtualization system median surface generation method and device
CN110968566A (en) Migration tool-based domestic application system migration method
CN113391812A (en) Analysis method and device of application program module and analysis tool

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant