CN115237573B - Data processing method, device, electronic equipment and readable storage medium - Google Patents

Data processing method, device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115237573B
CN115237573B CN202210943920.1A CN202210943920A CN115237573B CN 115237573 B CN115237573 B CN 115237573B CN 202210943920 A CN202210943920 A CN 202210943920A CN 115237573 B CN115237573 B CN 115237573B
Authority
CN
China
Prior art keywords
scheduling engine
sub
execution result
flow
data processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210943920.1A
Other languages
Chinese (zh)
Other versions
CN115237573A (en
Inventor
张子浪
叶臻
郝慧俊
李小言
刘海滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tower Co Ltd
Original Assignee
China Tower Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tower Co Ltd filed Critical China Tower Co Ltd
Priority to CN202210943920.1A priority Critical patent/CN115237573B/en
Publication of CN115237573A publication Critical patent/CN115237573A/en
Application granted granted Critical
Publication of CN115237573B publication Critical patent/CN115237573B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data processing method, a device, an electronic device and a readable storage medium, wherein the method is applied to a big data platform, the big data platform comprises a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, a second flow and a second virtual flow are configured in the second scheduling engine, and the method comprises the following steps: receiving a data processing request; and responding to the data processing request according to the first scheduling engine and the second scheduling engine, and generating a target execution result. The first virtual process is used for mapping the second process, and the second virtual process is used for mapping the first process to avoid possible conflict between the first scheduling engine and the second scheduling engine, so that the big data platform can have the capability of being compatible with different scheduling engines, and the reliability of service operation executed by the big data platform in the process of process migration is improved.

Description

Data processing method, device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, an electronic device, and a readable storage medium.
Background
In order to mine the data value, the data needs to be collected, processed, analyzed and other treatments. For a big data platform, a plurality of data processing flows in the big data platform are generally coordinated by configuring a scheduling engine so as to ensure reliable execution of each flow.
In practical application, because of reasons such as program transformation and upgrading, a plurality of processes in a big data platform need to be migrated from an old scheduling engine to a new scheduling engine, related technologies generally realize the process migration operation by using a migration tool or a manual development mode, that is, the old scheduling engine is replaced by the new scheduling engine, but unpredictable errors (such as that the new scheduling engine cannot support all functions of the old scheduling engine) may occur in the process migration operation due to the function difference between different scheduling engines, which may bring great interference to the data processing operation of the big data platform. That is, the related art has low reliability of the business operations performed during the process migration.
Disclosure of Invention
An embodiment of the application aims to provide a data processing method, a data processing device, electronic equipment and a readable storage medium, which are used for solving the problem that the reliability of business operations executed by related technologies in a process of flow migration is low.
In a first aspect, an embodiment of the present application provides a data processing method, which is applied to a big data platform, where the big data platform includes a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing method comprises the following steps:
receiving a data processing request;
responding to the data processing request according to the first scheduling engine and the second scheduling engine to generate a target execution result;
the first virtual flow is used for forwarding a first sub-request to be executed by the first scheduling engine to the second scheduling engine, the first sub-request is a sub-request corresponding to the second flow in the data processing request, the second virtual flow is used for forwarding a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first flow in the data processing request.
Optionally, the second process includes a plurality of sub-processes;
after the first scheduling engine and the second scheduling engine respond to the data processing request and generate the target execution result, the method further comprises:
clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
and migrating the second set into the first scheduling engine.
Optionally, the dependency information includes at least one of:
directed acyclic graph, DAG, information;
table information;
and the blood margin information is information obtained by metadata blood margin analysis of the corresponding sub-flow.
Optionally, after the migration of the second set into the first scheduling engine, the method further includes:
acquiring first data and second data, wherein the first data is data obtained by executing a target sub-process by the first scheduling engine, the second data is data obtained by executing the target sub-process by the second scheduling engine, and the target sub-process is any sub-process in the second set;
and removing the target sub-process in the second scheduling engine under the condition that the first data and the second data are consistent.
Optionally, the responding, according to the first scheduling engine and the second scheduling engine, to the data processing request, generates a target execution result, including:
acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and integrating the first execution result and the second execution result to obtain a target execution result.
Optionally, the integrating the first execution result and the second execution result to obtain a target execution result includes at least one of the following:
remotely synchronizing the second execution result into a first database comprising the first execution result, and obtaining the target execution result from the first database;
copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
In a second aspect, an embodiment of the present application further provides a data processing apparatus, which is applied to a big data platform, where the big data platform includes a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing apparatus includes:
the receiving module is used for receiving the data processing request;
the response module is used for responding to the data processing request according to the first scheduling engine and the second scheduling engine and generating a target execution result;
the first virtual flow is used for forwarding a first sub-request to be executed by the first scheduling engine to the second scheduling engine, the first sub-request is a sub-request corresponding to the second flow in the data processing request, the second virtual flow is used for forwarding a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first flow in the data processing request.
Optionally, the second process includes a plurality of sub-processes;
the device also comprises a flow migration module;
the flow migration module is used for:
clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
and migrating the second set into the first scheduling engine.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a program stored on the memory and executable on the processor; the processor is configured to read a program in a memory to implement the steps in the data processing method according to the first aspect.
In a fourth aspect, an embodiment of the present application further provides a readable storage medium storing a program, which when executed by a processor, implements the steps of the data processing method according to the first aspect.
In the embodiment of the application, the first scheduling engine and the second scheduling engine are configured on the big data platform, and the second flow is mapped by utilizing the first virtual flow, so that the possible conflict between the first scheduling engine and the second scheduling engine is avoided by using the way that the first virtual flow is mapped by the second virtual flow, the big data platform can have the capability of being compatible with different scheduling engines, the smooth execution of the data processing service of the big data platform is ensured, the service operation risk of the big data platform in the process of flow migration is reduced, and the reliability of the service operation executed by the big data platform in the process of flow migration is improved.
Drawings
FIG. 1 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a big data platform according to an embodiment of the present application;
fig. 3 is a schematic flow chart of a process for confirming to-be-migrated flow according to an embodiment of the present application;
FIG. 4 is a flowchart of a method for obtaining a target execution result according to an embodiment of the present application;
FIG. 5 is a flowchart of another method for obtaining a target execution result according to an embodiment of the present application;
FIG. 6 is a flowchart of still another method for obtaining a target execution result according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The terms "first," "second," and the like in embodiments of the present application are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Furthermore, the use of "and/or" in the present application means at least one of the connected objects, such as a and/or B and/or C, means 7 cases including a alone a, B alone, C alone, and both a and B, both B and C, both a and C, and both A, B and C.
Referring to fig. 1, fig. 1 is a flow diagram of a data processing method provided by an embodiment of the present application, where the data processing method is applied to a big data platform, the big data platform includes a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine; the first virtual flow is used for forwarding a first sub-request to be executed by the first scheduling engine to the second scheduling engine, the first sub-request is a sub-request corresponding to the second flow in the data processing request, the second virtual flow is used for forwarding a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first flow in the data processing request.
As shown in fig. 1, the data processing method includes the following steps:
step 101, a data processing request is received.
The data processing request may be a request initiated by an external device (such as a server, a terminal, etc.) to the big data platform (for obtaining and applying the big data analysis result), or may be a request actively initiated by the big data platform itself through a preset program or script (for obtaining and displaying the big data analysis result).
And 102, responding to the data processing request according to the first scheduling engine and the second scheduling engine, and generating a target execution result.
In the embodiment of the application, the first scheduling engine can be understood as a new scheduling engine applied by the big data platform, and the second scheduling engine can be understood as an old scheduling engine applied by the big data platform; the first procedure is used for representing a plurality of procedures configured in a first scheduling engine, the second procedure is used for representing a plurality of procedures configured in a second scheduling engine, the first virtual procedure is the mapping of the second procedure in the first scheduling engine, and the second virtual procedure is the mapping of the first procedure in the second scheduling engine.
As shown in fig. 2, when the big data platform receives the data processing request, the big data platform processes the data processing request based on the cooperation of the first scheduling engine and the second scheduling engine, that is, the first scheduling engine and the second scheduling engine execute the first flow and the second flow actually configured in the first scheduling engine and the second scheduling engine respectively, and feed back the execution results (that means that the first scheduling engine feeds back the result of executing the first flow to the second scheduling engine and the second scheduling engine feeds back the result of executing the second flow to the first scheduling engine) mutually, and finally integrates the result of executing the first flow and the result of executing the second flow to generate the target execution result for responding to the data processing request.
As described above, by configuring the first scheduling engine and the second scheduling engine on the big data platform and mapping the second flow by using the first virtual flow and mapping the first flow by using the second virtual flow, possible conflicts between the first scheduling engine and the second scheduling engine are avoided, so that the big data platform can have the capability of being compatible with different scheduling engines, thereby ensuring the smooth execution of the data processing service of the big data platform (by coordinating the measures of different scheduling engines, the big data platform can still respond to the data processing request in time in the process of gradually replacing the old scheduling engine by using the new scheduling engine), reducing the risk of the business operation of the big data platform in the process of flow migration, and improving the reliability of the business operation executed by the big data platform in the process of flow migration.
It should be noted that, the degree of association between the several flows represented by the first flow is greater than the degree of association between the several flows represented by the second flow, where the degree of association between the different flows may be understood as the degree of overlap of the table information corresponding to the different flows (for example, if the flow 1 corresponds to the table 1, the table 2 and the table 3, the flow 2 corresponds to the table 1, the table 2 and the table 4, the flow 3 corresponds to the table 1, the table 5 and the table 6, the flow 4 corresponds to the table 1, the table 7 and the table 8, and the degree of association between the flow 1 and the flow 2 is greater than the degree of association between the flow 3 and the flow 4), and the flow corresponds to the table may be understood as that the execution of the flow may involve the operation of the data in the table.
The first scheduling engine and the second scheduling engine can communicate in a proxy-stub mode, a web service mode, a restful api mode, a thraft mode and the like when the execution results are fed back to each other, the big data platform can coordinate the first scheduling engine and the second scheduling engine by using a zookeeper, and under the condition that the first scheduling engine and the second scheduling engine coexist, the Yarn can be used for uniformly managing the bottom layer resources of the big data platform, isolating and distributing CPU and memory resources and avoiding unordered contention of resources of the first scheduling engine and the second scheduling engine.
Optionally, the second process includes a plurality of sub-processes;
after the first scheduling engine and the second scheduling engine respond to the data processing request and generate the target execution result, the method further comprises:
clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
and migrating the second set into the first scheduling engine.
As described above, in the process of flow migration, in addition to timely responding to the data processing request received by the big data platform, it is also required to determine, in the old scheduling engine (i.e., the second scheduling engine), a plurality of flows that need to be migrated in the next migration action, so that the flow migration efficiency can be improved by a batch migration method of a plurality of flows, compared with a method of migrating each flow one by one.
Clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, determining the first set with the largest number of corresponding elements in the at least two first sets as a second set, and then migrating the determined second set to the first scheduling engine in the next migration action. And under the condition that the current second set is confirmed to be successfully migrated, repeating the process (namely determining a new second set in the second scheduling engine and migrating the newly determined second set into the first scheduling engine) until the flow in the second scheduling engine is completely migrated into the first scheduling engine.
The application clusters a plurality of sub-processes in a second scheduling engine based on the dependency information corresponding to each sub-process, determines the class cluster with the largest number of corresponding elements in the obtained class cluster (i.e. the first set) as a target class cluster (i.e. the second set), and finally realizes the operation of batch migration of a plurality of processes in a single process migration action by migrating the second set.
It should be noted that, the data asset in the big data platform generally includes a three-layer structure, which is sequentially an interface layer, a detail layer and a mart layer, where the interface layer is used to collect the original data table information of the service system; the detail layer is used for preprocessing interface layer data, or a wide table can be formed by associating a plurality of interface tables; the mart layer is used for correlating the wide tables to form theme-oriented data assets so as to report statistics.
The data asset design based on the three-layer structure enables the multiple sub-processes to be divided into three types of sub-processes of an associated interface layer, sub-processes of an associated detail layer and sub-processes of an associated mart layer, in which case the aforesaid dependency information can be understood as a dependency relationship between different sub-processes in the same layer (for example, a dependency exists between sub-process a of the associated mart layer and sub-process B of the associated mart layer), and also as a dependency relationship between different sub-processes in different layers (for example, a dependency exists between sub-process C of the associated mart layer and sub-process D of the associated detail layer).
As described above, before the process migration, the dependency relationships and the data dependencies between different sub-processes (which may be the same layer or different layers) are sorted, and at least one sub-process to be migrated is selected based on the sorting result (that is, the process of determining the second set and migrating the second set to the first scheduling engine).
In the process migration operation, a layer-by-layer migration manner (for example, migration of multiple sub-processes of the associated mart layer, migration of multiple sub-processes of the associated detail layer, and migration of multiple sub-processes of the associated interface layer) is preferably applied to ensure that multiple sub-processes in the second scheduling engine are sequentially migrated into the first scheduling engine.
In an example, the second set may be understood as a set formed by a plurality of sub-processes associated with the bazaar layer, before the second set is migrated into the first scheduling engine, a number threshold may be set, and if the number of elements corresponding to the second set is greater than or equal to the number threshold, the second set may be split into at least two sub-sets, and then the at least two sub-sets are migrated one by one according to a dependency relationship, so as to avoid a problem that the migration flow volume is too large. It should be noted that, after each sub-set is migrated, verification is performed on the sub-set, and after the verification is passed, the next sub-set is migrated, so as to ensure the reliability of the process migration operation performed.
For example, if the number threshold is 10 and the number of elements corresponding to the second set is 25, the second set may be split into a first subset (corresponding to 10 elements), a second subset (corresponding to 10 elements), and a third subset (corresponding to 5 elements), and then the first subset, the second subset, and the third subset may be migrated to the first scheduling engine one by one.
It should be noted that, in some embodiments, the flow migration operation may be understood as follows: all flows in the second scheduling engine are transferred to the first scheduling engine; in other embodiments, due to the bottom implementation difference between different scheduling engines, there is a case that a new scheduling engine cannot completely replace an old scheduling engine, and at this time, the completion of the flow migration operation is understood that the flow that can be replaced and executed by the first scheduling engine in the second scheduling engine is migrated into the first scheduling engine, and the flow that cannot be replaced and executed by the first scheduling engine in the second scheduling engine is retained in the second scheduling engine.
Further, the dependency information includes at least one of:
directed acyclic graph, DAG, information;
table information;
and the blood margin information is information obtained by metadata blood margin analysis of the corresponding sub-flow.
In some examples, directed acyclic graphs (Directed Acyclic Graph, DAG) explicitly specified at the time of sub-flow configuration may be analyzed to complete a clustering process between multiple sub-flows.
In some examples, table information (such as table names, field names, etc.) corresponding to each sub-process may also be obtained by analyzing the data processing task configured in the sub-process, and then the clustering process between the plurality of sub-processes may be completed based on the table information corresponding to each sub-process.
In some examples, metadata blood-edge analysis (e.g., using atlas) may also be applied to obtain blood-edge information for each sub-process, and then complete the clustering process between the sub-processes based on the blood-edge information.
In some examples, table information in the big data platform may be analyzed first, table data may be stored in a table list, field data may be stored in a field list, table data and field data referenced by each sub-flow may be obtained based on information such as a DAG graph, and finally a clustering process between multiple sub-flows may be completed based on the referenced table data and field data of each sub-flow. In the process of obtaining the table data and the field data referred to by the sub-flow, the regular expression mode may be adopted, for example, space must be set before the table data, space may be set after the table data, or ". Field data", space may be set before the field data, and space must be set after the field data.
It should be noted that, in the actual clustering process, any one of the foregoing examples may be selected for implementation, or may be implemented in combination with the various manners provided by the foregoing examples (as shown in fig. 3), which is not limited by the embodiment of the present application.
Optionally, after the migration of the second set into the first scheduling engine, the method further includes:
acquiring first data and second data, wherein the first data is data obtained by executing a target sub-process by the first scheduling engine, the second data is data obtained by executing the target sub-process by the second scheduling engine, and the target sub-process is any sub-process in the second set;
and removing the target sub-process in the second scheduling engine under the condition that the first data and the second data are consistent.
Through the process setting, after the second set is migrated to the first scheduling engine, the execution condition of the target sub-process in the first scheduling engine is compared with the execution condition of the target sub-process in the second scheduling engine, so that whether the target sub-process is successful or not is judged, errors and inefficiency caused by manual judgment are avoided, and the process migration efficiency is further improved.
For example, for the data table and the fields related to the target sub-flow, data backup may be performed before the target sub-flow is executed, and the data is recorded as data1; executing the target sub-process based on the second scheduling engine, and recording an execution result (namely second data) as data2; and then, carrying out data recovery on the database based on the data1, finally, executing a target sub-process based on the first scheduling engine, and recording an execution result (namely first data) as data3.
Comparing the data2 with the data3, if the data2 and the data3 are consistent, the target sub-flow is correctly migrated to the first scheduling engine, and the target sub-flow in the second scheduling engine can be deleted at the moment, so that the migration operation of the target sub-flow is completed; if the two are inconsistent, the target sub-process is not correctly migrated to the first scheduling engine, and error information needs to be sent out at the moment, so that related personnel can locate the data difference between the data2 and the data3 based on the error information, and the problem that the target sub-process is not correctly migrated to the first scheduling engine is solved according to the data difference.
Optionally, the responding, according to the first scheduling engine and the second scheduling engine, to the data processing request, generates a target execution result, including:
acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and integrating the first execution result and the second execution result to obtain a target execution result.
Further, the integrating the first execution result and the second execution result to obtain a target execution result includes at least one of the following:
remotely synchronizing the second execution result into a first database comprising the first execution result, and obtaining the target execution result from the first database;
copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
In some embodiments, a view of the first schedule engine corresponding data table and the second schedule engine corresponding data table may be formed by applying a remote table (i.e., a remote synchronization manner) (Oracle data uses a dblink tool, mySQL database uses a delayed tool), and then the remote table is directly accessed by the monitor (i.e., accessing the first database corresponding table information) to obtain the target execution result.
In some embodiments, as shown in fig. 4, the first execution result and the second execution result may be copied to a third database by using a synchronous database (e.g., implemented using a preset script), and the third database is accessed by using the aforementioned monitor program to obtain the target execution result. The third database may be the first database (corresponding to the first scheduling engine) or the second database (corresponding to the second scheduling engine), or may be other databases besides the first database and the second database.
In some embodiments, the first execution result and the second execution result may be obtained and integrated according to the application program to generate the target execution result. The first execution result and the second execution result may be obtained by calling API interfaces of the first scheduling engine and the second scheduling engine (fig. 5), or may be obtained by directly accessing the first database and the second database (fig. 6).
In the process of obtaining the target execution result, any one of the above three embodiments may be selected and implemented, or two or three of the above three embodiments may be combined and implemented, which is not limited by the embodiment of the present application.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus 200 according to an embodiment of the present application, where the data processing apparatus 200 is applied to a big data platform, and the big data platform includes a first scheduling engine and a second scheduling engine, and the first scheduling engine is configured with a first flow and a first virtual flow, and the second scheduling engine is configured with a second flow and a second virtual flow; the first virtual flow is used for forwarding a first sub-request to be executed by the first scheduling engine to the second scheduling engine, the first sub-request is a sub-request corresponding to the second flow in the data processing request, the second virtual flow is used for forwarding a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first flow in the data processing request.
As shown in fig. 7, the data processing apparatus 200 includes:
the receiving module 201 is configured to receive a data processing request;
and the response module 202 is configured to generate a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request.
Optionally, the second process includes a plurality of sub-processes;
the apparatus 200 further includes a flow migration module;
the flow migration module is used for:
clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
and migrating the second set into the first scheduling engine.
Optionally, the dependency information includes at least one of:
directed acyclic graph, DAG, information;
table information;
and the blood margin information is information obtained by metadata blood margin analysis of the corresponding sub-flow.
Optionally, the apparatus 200 further comprises a verification module;
the verification module is used for:
acquiring first data and second data, wherein the first data is data obtained by executing a target sub-process by the first scheduling engine, the second data is data obtained by executing the target sub-process by the second scheduling engine, and the target sub-process is any sub-process in the second set;
and removing the target sub-process in the second scheduling engine under the condition that the first data and the second data are consistent.
Optionally, the response module 202 includes:
the response sub-module is used for acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and the integration sub-module is used for integrating the first execution result and the second execution result to obtain a target execution result.
Optionally, the integration sub-module includes at least one of:
the first integration unit is used for remotely synchronizing the second execution result into a first database comprising the first execution result and obtaining the target execution result from the first database;
the second integration unit is used for copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and the third integration unit is used for integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
The data processing apparatus 200 can implement the processes of the method embodiment of fig. 1 in the embodiment of the present application, and achieve the same beneficial effects, and in order to avoid repetition, a detailed description is omitted here.
The embodiment of the application also provides electronic equipment. Referring to fig. 8, an electronic device may include a processor 301, a memory 302, and a program 3021 stored on the memory 302 and executable on the processor 301.
The program 3021, when executed by the processor 301, may implement any steps and achieve the same advantageous effects in the method embodiment corresponding to fig. 1, which will not be described herein.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the methods of the embodiments described above may be implemented by hardware associated with program instructions, where the program may be stored on a readable medium.
The embodiment of the present application further provides a readable storage medium, where a computer program is stored, where the computer program when executed by a processor may implement any step in the method embodiment corresponding to fig. 1, and may achieve the same technical effect, so that repetition is avoided, and no further description is given here.
The computer-readable storage media of embodiments of the present application may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
While the foregoing is directed to the preferred embodiments of the present application, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present application, and such modifications and adaptations are intended to be comprehended within the scope of the present application.

Claims (8)

1. The data processing method is applied to a big data platform and is characterized in that the big data platform comprises a first scheduling engine and a second scheduling engine, wherein a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing method comprises the following steps:
receiving a data processing request;
responding to the data processing request according to the first scheduling engine and the second scheduling engine to generate a target execution result;
the first virtual flow is used for forwarding a first sub-request to be executed by the first scheduling engine to the second scheduling engine, the first sub-request is a sub-request corresponding to the second flow in the data processing request, the second virtual flow is used for forwarding a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first flow in the data processing request;
the second process includes a plurality of sub-processes;
after the first scheduling engine and the second scheduling engine respond to the data processing request and generate the target execution result, the method further comprises:
clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
and migrating the second set into the first scheduling engine.
2. The method of claim 1, wherein the dependency information comprises at least one of:
directed acyclic graph, DAG, information;
table information;
and the blood margin information is information obtained by metadata blood margin analysis of the corresponding sub-flow.
3. The method of claim 1, wherein after the migrating the second set into the first scheduling engine, the method further comprises:
acquiring first data and second data, wherein the first data is data obtained by executing a target sub-process by the first scheduling engine, the second data is data obtained by executing the target sub-process by the second scheduling engine, and the target sub-process is any sub-process in the second set;
and removing the target sub-process in the second scheduling engine under the condition that the first data and the second data are consistent.
4. The method of claim 1, wherein generating a target execution result in response to the data processing request according to the first scheduling engine and the second scheduling engine comprises:
acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and integrating the first execution result and the second execution result to obtain a target execution result.
5. The method of claim 4, wherein integrating the first execution result and the second execution result to obtain a target execution result comprises at least one of:
remotely synchronizing the second execution result into a first database comprising the first execution result, and obtaining the target execution result from the first database;
copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
6. The data processing device is applied to a big data platform and is characterized by comprising a first scheduling engine and a second scheduling engine, wherein a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing apparatus includes:
the receiving module is used for receiving the data processing request;
the response module is used for responding to the data processing request according to the first scheduling engine and the second scheduling engine and generating a target execution result;
the first virtual flow is used for forwarding a first sub-request to be executed by the first scheduling engine to the second scheduling engine, the first sub-request is a sub-request corresponding to the second flow in the data processing request, the second virtual flow is used for forwarding a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first flow in the data processing request;
the second process includes a plurality of sub-processes;
the device also comprises a flow migration module;
the flow migration module is used for:
clustering the multiple sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
and migrating the second set into the first scheduling engine.
7. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor; the processor being arranged to read a program in a memory for implementing the steps of the data processing method according to any one of claims 1 to 5.
8. A readable storage medium for storing a program which when executed by a processor implements the steps in the data processing method according to any one of claims 1 to 5.
CN202210943920.1A 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium Active CN115237573B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210943920.1A CN115237573B (en) 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210943920.1A CN115237573B (en) 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN115237573A CN115237573A (en) 2022-10-25
CN115237573B true CN115237573B (en) 2023-08-18

Family

ID=83679045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210943920.1A Active CN115237573B (en) 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115237573B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635A (en) * 2007-12-24 2009-07-01 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
CN101719082A (en) * 2009-12-24 2010-06-02 中国科学院计算技术研究所 Method and system for dispatching application requests in virtual calculation platform
CN103444141A (en) * 2011-04-05 2013-12-11 瑞典爱立信有限公司 Packet scheduling method and apparatus
CN109861850A (en) * 2019-01-11 2019-06-07 中山大学 A method of the stateless cloud workflow load balance scheduling based on SLA
CN111611221A (en) * 2019-02-26 2020-09-01 北京京东尚科信息技术有限公司 Hybrid computing system, data processing method and device
CN114647491A (en) * 2020-12-17 2022-06-21 中移(苏州)软件技术有限公司 Task scheduling method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635A (en) * 2007-12-24 2009-07-01 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
CN101719082A (en) * 2009-12-24 2010-06-02 中国科学院计算技术研究所 Method and system for dispatching application requests in virtual calculation platform
CN103444141A (en) * 2011-04-05 2013-12-11 瑞典爱立信有限公司 Packet scheduling method and apparatus
CN109861850A (en) * 2019-01-11 2019-06-07 中山大学 A method of the stateless cloud workflow load balance scheduling based on SLA
CN111611221A (en) * 2019-02-26 2020-09-01 北京京东尚科信息技术有限公司 Hybrid computing system, data processing method and device
CN114647491A (en) * 2020-12-17 2022-06-21 中移(苏州)软件技术有限公司 Task scheduling method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115237573A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
US10540335B2 (en) Solution to generate a scriptset for an automated database migration
US10248671B2 (en) Dynamic migration script management
US10353913B2 (en) Automating extract, transform, and load job testing
CN110222036B (en) Method and system for automated database migration
CN110275861B (en) Data storage method and device, storage medium and electronic device
US9892121B2 (en) Methods and systems to identify and use event patterns of application workflows for data management
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
US9892122B2 (en) Method and apparatus for determining a range of files to be migrated
CN108491254A (en) A kind of dispatching method and device of data warehouse
CN105677465B (en) The data processing method and device of batch processing are run applied to bank
CN111258881B (en) Intelligent test system for workflow test
CN112559525B (en) Data checking system, method, device and server
US9847941B2 (en) Selectively suppress or throttle migration of data across WAN connections
CN115237573B (en) Data processing method, device, electronic equipment and readable storage medium
CN111638920B (en) Method, device, electronic equipment and medium for processing computer program synchronous task
CN115878386A (en) Disaster recovery method and device, electronic equipment and storage medium
CN110928860B (en) Data migration method and device
WO2019062087A1 (en) Attendance check data testing method, terminal and device, and computer readable storage medium
US8321844B2 (en) Providing registration of a communication
CN104731697A (en) Running control method, control system and sound monitor of test case
CN110019448A (en) A kind of data interactive method and device
JP3547691B2 (en) Job inspection apparatus, job inspection method, and recording medium recording job inspection program
CN114968748B (en) Database testing method, system and device
US11762875B2 (en) Machine assisted data aggregation
CN114817393A (en) Data extraction and cleaning method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant