CN115237573A - Data processing method and device, electronic equipment and readable storage medium - Google Patents

Data processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115237573A
CN115237573A CN202210943920.1A CN202210943920A CN115237573A CN 115237573 A CN115237573 A CN 115237573A CN 202210943920 A CN202210943920 A CN 202210943920A CN 115237573 A CN115237573 A CN 115237573A
Authority
CN
China
Prior art keywords
sub
scheduling engine
execution result
data processing
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210943920.1A
Other languages
Chinese (zh)
Other versions
CN115237573B (en
Inventor
张子浪
叶臻
郝慧俊
李小言
刘海滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Tower Co Ltd
Original Assignee
China Tower Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Tower Co Ltd filed Critical China Tower Co Ltd
Priority to CN202210943920.1A priority Critical patent/CN115237573B/en
Publication of CN115237573A publication Critical patent/CN115237573A/en
Application granted granted Critical
Publication of CN115237573B publication Critical patent/CN115237573B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The embodiment of the application provides a data processing method, a data processing device, an electronic device and a readable storage medium, wherein the method is applied to a big data platform, the big data platform comprises a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, a second flow and a second virtual flow are configured in the second scheduling engine, and the method comprises the following steps: receiving a data processing request; and generating a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request. The method for mapping the second process by using the first virtual process and mapping the first process by using the second virtual process avoids possible conflicts between the first scheduling engine and the second scheduling engine, so that the big data platform can have the capability of being compatible with different scheduling engines, thereby improving the reliability of the business operation executed by the big data platform in the process of process migration.

Description

Data processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, an electronic device, and a readable storage medium.
Background
In order to mine the data value, the data needs to be collected, processed, analyzed and the like. For a big data platform, a scheduling engine is generally configured to coordinate a plurality of data processing flows in the big data platform to ensure reliable execution of each flow.
In practical application, due to reasons such as program modification and upgrading, a plurality of processes in a big data platform need to be migrated from an old scheduling engine to a new scheduling engine, and related technologies generally implement the process migration operation by applying a migration tool or manually developing, that is, the old scheduling engine is replaced by the new scheduling engine, but due to functional differences between different scheduling engines, unpredictable errors may occur in the process migration operation (for example, the new scheduling engine cannot support all functions of the old scheduling engine), which may bring great interference to data processing operations of the big data platform. That is, the reliability of the business operation performed in the process of flow migration in the related art is low.
Disclosure of Invention
An object of the embodiments of the present application is to provide a data processing method, an apparatus, an electronic device, and a readable storage medium, which are used to solve the problem of low reliability of a service operation executed in a process of flow migration in the related art.
In a first aspect, an embodiment of the present application provides a data processing method, which is applied to a big data platform, where the big data platform includes a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing method comprises the following steps:
receiving a data processing request;
responding to the data processing request according to the first scheduling engine and the second scheduling engine, and generating a target execution result;
the first virtual process is configured to forward a first sub-request to be executed by the first scheduling engine to the second scheduling engine, where the first sub-request is a sub-request corresponding to the second process in the data processing request, the second virtual process is configured to forward a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first process in the data processing request.
Optionally, the second process includes a plurality of sub-processes;
after the generating a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request, the method further includes:
clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
migrating the second set into the first scheduler engine.
Optionally, the dependency information includes at least one of:
directed acyclic graph DAG information;
table information;
and blood relationship information, wherein the blood relationship information is information obtained by analyzing the blood relationship of the metadata of the corresponding sub-process.
Optionally, after migrating the second set into the first scheduling engine, the method further includes:
acquiring first data and second data, wherein the first data is acquired by the first scheduling engine executing a target sub-process, the second data is acquired by the second scheduling engine executing the target sub-process, and the target sub-process is any one sub-process in the second set;
removing the target sub-flow within the second scheduling engine if the first data and the second data are consistent.
Optionally, the generating a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request includes:
acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and integrating the first execution result and the second execution result to obtain a target execution result.
Optionally, the integrating the first execution result and the second execution result to obtain a target execution result includes at least one of the following:
remotely synchronizing the second execution result to a first database comprising the first execution result, and obtaining the target execution result from the first database;
copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
In a second aspect, an embodiment of the present application further provides a data processing apparatus, which is applied to a big data platform, where the big data platform includes a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing apparatus includes:
the receiving module is used for receiving a data processing request;
the response module is used for responding to the data processing request according to the first scheduling engine and the second scheduling engine and generating a target execution result;
the first virtual process is configured to forward a first sub-request to be executed by the first scheduling engine to the second scheduling engine, where the first sub-request is a sub-request corresponding to the second process in the data processing request, the second virtual process is configured to forward a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first process in the data processing request.
Optionally, the second process includes a plurality of sub-processes;
the apparatus also includes a process migration module;
the flow migration module is used for:
clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
migrating the second set into the first scheduler engine.
In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a memory, a processor, and a program stored in the memory and executable on the processor; the processor is configured to read the program in the memory to implement the steps in the data processing method according to the first aspect.
In a fourth aspect, embodiments of the present application further provide a readable storage medium, where the readable storage medium is used to store a program, and the program, when executed by a processor, implements the steps in the data processing method according to the first aspect.
In the embodiment of the application, a first scheduling engine and a second scheduling engine are configured on a big data platform, and a mode that a first virtual process maps a second process and a second virtual process maps a first process is utilized to avoid possible conflicts between the first scheduling engine and the second scheduling engine, so that the big data platform can have the capability of being compatible with different scheduling engines, thereby ensuring the smooth execution of data processing services of the big data platform, reducing the service operation risk of the big data platform in the process migration process, and improving the reliability of the service operation executed by the big data platform in the process migration process.
Drawings
Fig. 1 is a schematic flowchart of a data processing method provided in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a big data platform provided in an embodiment of the present application;
fig. 3 is a schematic flow chart illustrating a process of confirming a to-be-migrated flow according to an embodiment of the present application;
fig. 4 is a schematic flowchart of obtaining a target execution result according to an embodiment of the present application;
fig. 5 is a schematic flowchart of another method for obtaining a target execution result according to an embodiment of the present disclosure;
FIG. 6 is a schematic flowchart of another method for obtaining a target execution result according to an embodiment of the present disclosure;
fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," and the like in the embodiments of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus. Further, the use of "and/or" in this application means that at least one of the connected objects, e.g., a and/or B and/or C, means that 7 cases are included where a alone, B alone, C alone, and both a and B are present, B and C are present, a and C are present, and a, B, and C are present.
Referring to fig. 1, fig. 1 is a schematic flow diagram of a data processing method provided in an embodiment of the present application, where the data processing method is applied to a big data platform, where the big data platform includes a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine; the first virtual process is configured to forward a first sub-request to be executed by the first scheduling engine to the second scheduling engine, where the first sub-request is a sub-request corresponding to the second process in the data processing request, the second virtual process is configured to forward a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first process in the data processing request.
As shown in fig. 1, the data processing method includes the steps of:
step 101, receiving a data processing request.
The data processing request may be a request (for obtaining and applying a big data analysis result) initiated by an external device (such as a server, a terminal, and the like) to the big data platform, or may be a request (for obtaining and displaying a big data analysis result) actively initiated by the big data platform itself through a preset program or script.
And 102, responding to the data processing request according to the first scheduling engine and the second scheduling engine, and generating a target execution result.
In the embodiment of the application, the first scheduling engine can be understood as a new scheduling engine applied by a big data platform, and the second scheduling engine can be understood as an old scheduling engine applied by the big data platform; the first flow is used for representing a plurality of flows configured in a first scheduling engine, the second flow is used for representing a plurality of flows configured in a second scheduling engine, the first virtual flow is the mapping of the second flow in the first scheduling engine, and the second virtual flow is the mapping of the first flow in the second scheduling engine.
As shown in fig. 2, when the big data platform receives a data processing request, it processes the data processing request based on the cooperation of the first scheduling engine and the second scheduling engine, that is, the first scheduling engine and the second scheduling engine execute the first flow and the second flow configured in their respective real statuses, and feed back the execution results to each other (that is, the first scheduling engine feeds back the result of the first flow execution to the second scheduling engine, and the second scheduling engine feeds back the result of the second flow execution to the first scheduling engine), and finally integrate the result of the first flow execution and the result of the second flow execution to generate a target execution result for responding to the data processing request.
As described above, by configuring the first scheduling engine and the second scheduling engine on the big data platform, and using the manner that the first virtual process maps the second process, and the second virtual process maps the first process, a possible conflict between the first scheduling engine and the second scheduling engine is avoided, so that the big data platform can have the capability of being compatible with different scheduling engines, thereby ensuring smooth execution of data processing services of the big data platform (by coordinating measures of different scheduling engines, the big data platform can still respond to data processing requests in time in the process of gradually replacing old scheduling engines with new scheduling engines), reducing the risk of service operation of the big data platform in the process of process migration, and improving the reliability of service operation executed by the big data platform in the process of process migration.
It should be noted that, a degree of association between a plurality of processes represented by the first process is greater than a degree of association between a plurality of processes represented by the second process, where the degree of association between different processes can be understood as a degree of overlap of table information corresponding to different processes respectively (for example, if process 1 corresponds to table 1, table 2 and table 3, process 2 corresponds to table 1, table 2 and table 4, and process 3 corresponds to table 1, table 5 and table 6, and process 4 corresponds to table 1, table 7 and table 8, then the degree of association between process 1 and process 2 is greater than the degree of association between process 3 and process 4), where process and table correspond can be understood as that execution of the process involves operation on data in the tables.
Illustratively, when the first scheduling engine and the second scheduling engine feed back the execution result, communication can be performed in a proxy-stub, a web service, a restful api, a triple and the like, the big data platform can coordinate the first scheduling engine and the second scheduling engine by using a zookeeper, and when the first scheduling engine and the second scheduling engine coexist, a Yarn can be used to perform unified management on the bottom layer resources of the big data platform, realize isolation and allocation of the CPU and the memory resources, and avoid the first scheduling engine and the second scheduling engine from contending for the resources out of order.
Optionally, the second process includes a plurality of sub-processes;
after the generating a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request, the method further includes:
clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
migrating the second set into the first scheduler engine.
As described above, in the process of process migration, in addition to timely responding to the data processing request received by the big data platform, it is also necessary to determine a plurality of processes to be migrated in the next migration action in the old scheduling engine (i.e., the second scheduling engine).
Clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, determining the first set with the maximum number of corresponding elements in the at least two first sets as a second set, and then migrating the determined second set to the first scheduling engine in the next migration action. And under the condition that the current second set is successfully migrated, repeating the process (namely determining a new second set in the second scheduling engine and migrating the newly determined second set into the first scheduling engine) until all the processes in the second scheduling engine are migrated into the first scheduling engine.
According to the method and the device, based on the dependency information corresponding to each sub-process, clustering is carried out on the sub-processes in the second scheduling engine, the cluster with the largest number of corresponding elements in the obtained cluster (namely the first set) is determined as a target cluster (namely the second set), and finally the operation of migrating the sub-processes in batches is achieved in a single process migration action through a second set migration mode.
It should be noted that the data assets in the big data platform generally include three layers of structures, which are an interface layer, a detail layer and a market layer in sequence, wherein the interface layer is used for collecting the original data table information of the business system; the fine layer is used for preprocessing interface layer data or forming a wide table through association of a plurality of interface tables; and the mart layer is used for associating the wide table to form a theme-oriented data asset so as to report statistics.
Based on the data asset design with a three-layer structure, the plurality of sub processes can be divided into three types, namely, a sub process of an associated interface layer, a sub process of an associated detailed layer, and a sub process of an associated mart layer, in which case, the dependency information can be understood as a dependency relationship between different sub processes in the same layer (for example, there is a dependency between a sub process a of the associated mart layer and a sub process B of the associated mart layer), and can also be understood as a dependency relationship between different sub processes in different layers (for example, there is a dependency between a sub process C of the associated mart layer and a sub process D of the associated detailed layer).
As described above, before the process migration, the dependency relationships and data dependencies between different sub-processes (which may be on the same layer or different layers) are sorted, and at least one sub-process to be migrated is selected based on the sorting result (i.e., the aforementioned process of determining the second set and migrating the second set to the first scheduling engine).
In the process migration operation, a layer-by-layer migration manner (for example, first migrating a plurality of sub-processes associated with the mart layer, then migrating a plurality of sub-processes associated with the detailed layer, and finally migrating a plurality of sub-processes associated with the interface layer) is preferably applied to ensure that the plurality of sub-processes in the second scheduling engine are migrated to the first scheduling engine in order.
In an example, the second set may be understood as a set formed by a plurality of sub-processes of an association mart layer, before the second set is migrated into the first scheduling engine, a quantity threshold may be set, and if the number of elements corresponding to the second set is greater than or equal to the quantity threshold, the second set may be split into at least two sub-sets, and then the at least two sub-sets are migrated one by one according to a dependency relationship, so as to avoid a problem that the quantity of migration processes is too large. It should be noted that, after each sub-set is migrated, the sub-set needs to be verified, and the migration operation of the next sub-set is performed after the verification is passed, so as to ensure the reliability of the flow migration operation.
For example, if the quantity threshold is 10 and the number of elements corresponding to the second set is 25, the second set may be divided into a first subset (corresponding to the number of elements being 10), a second subset (corresponding to the number of elements being 10), and a third subset (corresponding to the number of elements being 5), and the first subset, the second subset, and the third subset are migrated to the first scheduler engine one by one.
It should be noted that, in some embodiments, the flow migration operation may be understood as: all processes in the second scheduling engine are transferred to the first scheduling engine; in other embodiments, due to the underlying implementation differences between different scheduler engines, there is a case where a new scheduler engine cannot completely replace an old scheduler engine, and at this time, the completion of the process migration operation is understood that a process that can be executed by the first scheduler engine in the second scheduler engine instead will be migrated into the first scheduler engine, and a process that cannot be executed by the first scheduler engine in the second scheduler engine instead will be retained in the second scheduler engine.
Further, the dependency information includes at least one of:
directed acyclic graph DAG information;
table information;
and blood relationship information, wherein the blood relationship information is information obtained by analyzing blood relationship of metadata of the corresponding sub-process.
In some examples, a Directed Acyclic Graph (DAG) explicitly specified at the time of sub-flow configuration may be analyzed to complete the clustering process between multiple sub-flows.
In some examples, table information (e.g., table name, field name, etc.) corresponding to each sub-process may also be obtained by analyzing the data processing tasks configured in the sub-processes, and then the clustering process between multiple sub-processes may be completed based on the table information corresponding to each sub-process.
In some examples, the metadata blood relationship analysis may also be applied (e.g., using atlas), the blood relationship information of each sub-process is obtained, and the clustering process between the sub-processes is completed based on the blood relationship information.
In some examples, the table information in the big data platform may be analyzed first, the table data may be stored in the table list, the field data may be stored in the field list, the table data and the field data referred by each sub-process may be obtained based on information such as the DAG diagram, and finally the clustering process between the plurality of sub-processes may be completed based on the table data and the field data referred by each sub-process. The method can be implemented by adopting a regular expression when table data and field data referenced by a sub-process are acquired, for example, table data must be set to be a space before, table data may be set to be a space after or ". The field data" may be set to be ". The table data" before the field data, and a space after the field data must be set to be a space.
It should be noted that, in the actual clustering process, any one of the above examples may be selected to be implemented, and may also be implemented in combination with multiple ways provided by the above examples (as shown in fig. 3), which is not limited in this application.
Optionally, after migrating the second set into the first scheduling engine, the method further includes:
acquiring first data and second data, wherein the first data is acquired by the first scheduling engine executing a target sub-process, the second data is acquired by the second scheduling engine executing the target sub-process, and the target sub-process is any one sub-process in the second set;
removing the target sub-flow within the second scheduling engine if the first data and the second data are consistent.
Through the process setting, after the second set is migrated to the first scheduling engine, the execution condition of the target sub-process in the first scheduling engine is compared with the execution condition of the target sub-process in the second scheduling engine, so that whether the target sub-process is successful or not is judged, the problems of errors and low efficiency caused by manual judgment are avoided, and the process migration efficiency is further improved.
For example, for data tables and fields related to the target sub-process, data backup may be performed before the target sub-process is executed, and the data backup is recorded as data1; then, executing the target sub-process based on the second scheduling engine, and recording an execution result (namely second data) as data2; and then, performing data recovery on the database based on the data1, finally executing the target sub-flow based on the first scheduling engine, and recording an execution result (namely the first data) as data3.
Comparing the data2 with the data3, if the data2 and the data3 are consistent, indicating that the target sub-process is correctly migrated to the first scheduling engine, and at the moment, deleting the target sub-process in the second scheduling engine, so as to finish the migration operation of the target sub-process; if the data difference is inconsistent with the data difference, it is indicated that the target sub-process is not correctly migrated to the first scheduling engine, and at this time, error information needs to be sent out, so that related personnel can locate the data difference between the data2 and the data3 based on the error information, and the problem that the target sub-process is not correctly migrated to the first scheduling engine is solved according to the data difference.
Optionally, the generating a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request includes:
acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and integrating the first execution result and the second execution result to obtain a target execution result.
Further, the integrating the first execution result and the second execution result to obtain a target execution result includes at least one of the following:
remotely synchronizing the second execution result into a first database comprising the first execution result, and obtaining the target execution result from the first database;
copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
In some embodiments, a view of the first scheduling engine corresponding data table and the second scheduling engine corresponding data table (Oracle data uses dblink tool, mySQL database uses feed tool) may be formed by applying a remote table (i.e. remote synchronization), and then directly accessing the remote table (i.e. accessing the first database corresponding table information) by using the monitor program to obtain the target execution result.
In some embodiments, as shown in fig. 4, the first execution result and the second execution result may be copied to a third database by using a synchronous database (e.g., implemented by using a preset script), and the third database may be accessed by using the foregoing monitoring program to obtain the target execution result. The third database may be the first database (corresponding to the first scheduling engine) or the second database (corresponding to the second scheduling engine), or may be another database besides the first database and the second database.
In some embodiments, the first execution result and the second execution result may be obtained and integrated according to the application to generate the target execution result. The first execution result and the second execution result may be obtained by calling API interfaces of the first scheduling engine and the second scheduling engine (as in fig. 5), or may be obtained by directly accessing the first database and the second database (as in fig. 6).
In the process of obtaining the target execution result, any one of the three embodiments may be selected, or two or three of the three embodiments may be combined, and this is not limited in the examples of the present application.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus 200 according to an embodiment of the present application, where the data processing apparatus 200 is applied to a big data platform, the big data platform includes a first scheduling engine and a second scheduling engine, the first scheduling engine is configured with a first flow and a first virtual flow, and the second scheduling engine is configured with a second flow and a second virtual flow; the first virtual process is configured to forward a first sub-request to be executed by the first scheduling engine to the second scheduling engine, where the first sub-request is a sub-request corresponding to the second process in the data processing request, the second virtual process is configured to forward a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first process in the data processing request.
As shown in fig. 7, the data processing apparatus 200 includes:
the receiving module 201 is configured to receive a data processing request;
a response module 202, configured to generate a target execution result according to a response of the first scheduling engine and the second scheduling engine to the data processing request.
Optionally, the second process includes a plurality of sub-processes;
the apparatus 200 further comprises a process migration module;
the flow migration module is used for:
clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
migrating the second set into the first scheduler engine.
Optionally, the dependency information includes at least one of:
directed acyclic graph DAG information;
table information;
and blood relationship information, wherein the blood relationship information is information obtained by analyzing the blood relationship of the metadata of the corresponding sub-process.
Optionally, the apparatus 200 further comprises a verification module;
the verification module is to:
acquiring first data and second data, wherein the first data is acquired by the first scheduling engine executing a target sub-process, the second data is acquired by the second scheduling engine executing the target sub-process, and the target sub-process is any one sub-process in the second set;
removing the target sub-flow within the second scheduling engine if the first data and the second data are consistent.
Optionally, the response module 202 includes:
the response submodule is used for acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and the integration submodule is used for integrating the first execution result and the second execution result to obtain a target execution result.
Optionally, the integrating sub-module includes at least one of the following:
the first integration unit is used for remotely synchronizing the second execution result into a first database comprising the first execution result and obtaining the target execution result from the first database;
the second integration unit is used for copying the first execution result and the second execution result into a third database based on a preset script and obtaining the target execution result from the third database;
and the third integration unit is used for integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
The data processing apparatus 200 can implement each process of the method embodiment in fig. 1 in the embodiment of the present application, and achieve the same beneficial effects, and for avoiding repetition, the details are not described here again.
The embodiment of the application also provides the electronic equipment. Referring to fig. 8, an electronic device may include a processor 301, a memory 302, and a program 3021 stored on the memory 302 and operable on the processor 301.
When executed by the processor 301, the program 3021 may implement any of the steps of the method embodiment shown in fig. 1 and achieve the same advantages, and thus, the description thereof is omitted here.
Those skilled in the art will appreciate that all or part of the steps of the method according to the above embodiments may be implemented by hardware associated with program instructions, and the program may be stored in a readable medium.
An embodiment of the present application further provides a readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program may implement any step in the method embodiment corresponding to fig. 1, and may achieve the same technical effect, and in order to avoid repetition, details are not repeated here.
The computer-readable storage media of the embodiments of the present application may take any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a storage medium may be transmitted over any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
While the foregoing is directed to the preferred embodiment of the present application, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the principles of the disclosure, and it is intended that such changes and modifications be considered as within the scope of the disclosure.

Claims (10)

1. A data processing method is applied to a big data platform and is characterized in that the big data platform comprises a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing method comprises the following steps:
receiving a data processing request;
responding to the data processing request according to the first scheduling engine and the second scheduling engine, and generating a target execution result;
the first virtual process is configured to forward a first sub-request to be executed by the first scheduling engine to the second scheduling engine, where the first sub-request is a sub-request corresponding to the second process in the data processing request, the second virtual process is configured to forward a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first process in the data processing request.
2. The method of claim 1, wherein the second process comprises a plurality of sub-processes;
after the generating a target execution result according to the response of the first scheduling engine and the second scheduling engine to the data processing request, the method further includes:
clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
migrating the second set into the first scheduler engine.
3. The method of claim 2, wherein the dependency information comprises at least one of:
directed acyclic graph DAG information;
table information;
and blood relationship information, wherein the blood relationship information is information obtained by analyzing the blood relationship of the metadata of the corresponding sub-process.
4. The method of claim 2, wherein after migrating the second set into the first scheduler engine, the method further comprises:
acquiring first data and second data, wherein the first data is acquired by the first scheduling engine executing a target sub-process, the second data is acquired by the second scheduling engine executing the target sub-process, and the target sub-process is any one sub-process in the second set;
removing the target sub-flow within the second scheduling engine if the first data and the second data are consistent.
5. The method of claim 1, wherein generating a target execution result in response to the data processing request by the first scheduler engine and the second scheduler engine comprises:
acquiring a first execution result of the first scheduling engine responding to the data processing request and a second execution result of the second scheduling engine responding to the data processing request;
and integrating the first execution result and the second execution result to obtain a target execution result.
6. The method of claim 5, wherein the integrating the first execution result and the second execution result to obtain a target execution result comprises at least one of:
remotely synchronizing the second execution result to a first database comprising the first execution result, and obtaining the target execution result from the first database;
copying the first execution result and the second execution result into a third database based on a preset script, and obtaining the target execution result from the third database;
and integrating the first execution result and the second execution result based on an application program to obtain the target execution result.
7. A data processing device is applied to a big data platform and is characterized in that the big data platform comprises a first scheduling engine and a second scheduling engine, a first flow and a first virtual flow are configured in the first scheduling engine, and a second flow and a second virtual flow are configured in the second scheduling engine;
the data processing apparatus includes:
the receiving module is used for receiving a data processing request;
the response module is used for responding to the data processing request according to the first scheduling engine and the second scheduling engine and generating a target execution result;
the first virtual process is configured to forward a first sub-request to be executed by the first scheduling engine to the second scheduling engine, where the first sub-request is a sub-request corresponding to the second process in the data processing request, the second virtual process is configured to forward a second sub-request to be executed by the second scheduling engine to the first scheduling engine, and the second sub-request is a sub-request corresponding to the first process in the data processing request.
8. The apparatus of claim 7, wherein the second process comprises a plurality of sub-processes;
the apparatus also includes a process migration module;
the flow migration module is used for:
clustering the plurality of sub-processes according to the dependency information corresponding to each sub-process to obtain at least two first sets, wherein the dependency information is used for representing the association between the corresponding sub-process and other sub-processes;
determining a second set in the at least two first sets, wherein the second set is the first set with the largest number of corresponding elements in the at least two first sets;
migrating the second set into the first scheduler engine.
9. An electronic device comprising a memory, a processor, and a program stored on the memory and executable on the processor; the processor, which is used for reading the program in the memory to realize the steps in the data processing method according to any one of claims 1 to 6.
10. A readable storage medium for storing a program which, when executed by a processor, implements the steps in the data processing method of any one of claims 1 to 6.
CN202210943920.1A 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium Active CN115237573B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210943920.1A CN115237573B (en) 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210943920.1A CN115237573B (en) 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN115237573A true CN115237573A (en) 2022-10-25
CN115237573B CN115237573B (en) 2023-08-18

Family

ID=83679045

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210943920.1A Active CN115237573B (en) 2022-08-05 2022-08-05 Data processing method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115237573B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635A (en) * 2007-12-24 2009-07-01 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
CN101719082A (en) * 2009-12-24 2010-06-02 中国科学院计算技术研究所 Method and system for dispatching application requests in virtual calculation platform
CN103444141A (en) * 2011-04-05 2013-12-11 瑞典爱立信有限公司 Packet scheduling method and apparatus
CN109861850A (en) * 2019-01-11 2019-06-07 中山大学 A method of the stateless cloud workflow load balance scheduling based on SLA
CN111611221A (en) * 2019-02-26 2020-09-01 北京京东尚科信息技术有限公司 Hybrid computing system, data processing method and device
CN114647491A (en) * 2020-12-17 2022-06-21 中移(苏州)软件技术有限公司 Task scheduling method, device, equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101470635A (en) * 2007-12-24 2009-07-01 联想(北京)有限公司 Method for multi-virtual processor synchronous scheduling and computer thereof
CN101719082A (en) * 2009-12-24 2010-06-02 中国科学院计算技术研究所 Method and system for dispatching application requests in virtual calculation platform
CN103444141A (en) * 2011-04-05 2013-12-11 瑞典爱立信有限公司 Packet scheduling method and apparatus
CN109861850A (en) * 2019-01-11 2019-06-07 中山大学 A method of the stateless cloud workflow load balance scheduling based on SLA
CN111611221A (en) * 2019-02-26 2020-09-01 北京京东尚科信息技术有限公司 Hybrid computing system, data processing method and device
CN114647491A (en) * 2020-12-17 2022-06-21 中移(苏州)软件技术有限公司 Task scheduling method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN115237573B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
US10353913B2 (en) Automating extract, transform, and load job testing
US8151248B1 (en) Method and system for software defect management
US20210311858A1 (en) System and method for providing a test manager for use with a mainframe rehosting platform
JP5970617B2 (en) Development support system
US8918783B2 (en) Managing virtual computers simultaneously with static and dynamic dependencies
EP2572294B1 (en) System and method for sql performance assurance services
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
CN109062780A (en) The development approach and terminal device of automatic test cases
US20160019206A1 (en) Methods and systems to identify and use event patterns of application workflows for data management
CN108446326B (en) A kind of isomeric data management method and system based on container
CN111488109A (en) Method, device, terminal and storage medium for acquiring control information of user interface
CN109344053A (en) Interface coverage test method, system, computer equipment and storage medium
CN110427258A (en) Scheduling of resource control method and device based on cloud platform
CN111258881B (en) Intelligent test system for workflow test
CN112559525B (en) Data checking system, method, device and server
US9847941B2 (en) Selectively suppress or throttle migration of data across WAN connections
CN113760499A (en) Method, device, computing equipment and medium for scheduling computing unit
CN115237573B (en) Data processing method, device, electronic equipment and readable storage medium
CN115309558A (en) Resource scheduling management system, method, computer equipment and storage medium
WO2019062087A1 (en) Attendance check data testing method, terminal and device, and computer readable storage medium
CN110928860B (en) Data migration method and device
US20210357419A1 (en) Preventing dbms deadlock by eliminating shared locking
CN113641628A (en) Data quality detection method, device, equipment and storage medium
CN113220592A (en) Processing method and device for automated testing resources, server and storage medium
CN113434382A (en) Database performance monitoring method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant