CN113296913A

CN113296913A - Data processing method, device and equipment based on single cluster and storage medium

Info

Publication number: CN113296913A
Application number: CN202110575939.0A
Authority: CN
Inventors: 吴辰侣; 刘明鑫
Original assignee: Weikun Shanghai Technology Service Co Ltd
Current assignee: Weikun Shanghai Technology Service Co Ltd
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-24

Abstract

The application discloses a data processing method, a device, equipment and a storage medium based on a single cluster, wherein the method comprises the following steps: the method comprises the steps of obtaining at least two data processing tasks and determining a scheduling system for scheduling the data processing tasks based on the data processing tasks, wherein the at least two data processing tasks comprise data processing tasks of at least two task types, and the data processing tasks of one task type correspond to one scheduling system. When a processing request of a target data processing task in at least two data processing tasks is received, a target scheduling system for scheduling the target data processing task is determined according to the task type of the target data processing task. And scheduling the target data processing task to the target cluster resource of the target cluster based on the target scheduling system so as to execute the target data processing task through the target cluster resource. By adopting the method and the device, the development and maintenance cost of the big data cluster can be reduced, the operation is simple, and the applicability is high.

Description

Data processing method, device and equipment based on single cluster and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and storage medium based on a single cluster.

Background

With the development and application of database technology, the amount of data stored in a database is shifted from megabytes to gigabytes at present, meanwhile, the query demand of users is more and more complicated, and data analysis and information synthesis, namely Online Analytical Processing (OLAP), are performed on tens of millions of recorded data in multiple tables, which is the most important application of a data warehouse system. The ETL (Extract Transform Load) big data tasks, namely data extraction (Extract), Cleaning (Cleaning), transformation (Transform), and loading (Load), are an important part of building a data warehouse. When an ETL big data task is issued to a big data cluster, if the existing ETL task has a bug, the bug will have great influence on the online data.

During research and practice, the inventor of the present application finds that in the prior art, in order to not destroy original data in an ETL big data task, a set of pre-sending clusters as well as production configuration is generally prepared, so as to perform related algorithm debugging optimization on the pre-sending clusters. However, in a general debugging optimization algorithm, an enterprise has only one set of big data production cluster resources (including computing resources and storage resources), and the cost for reconfiguring one set of big data production cluster resources and the operation and maintenance cost are very expensive and complex to operate.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, data processing equipment and a storage medium based on a single cluster, which can reduce the development and maintenance cost of a large data cluster, are simple to operate and have high applicability.

A first aspect of the embodiments of the present application provides a data processing method based on a single cluster, including:

acquiring at least two data processing tasks, and determining a scheduling system for scheduling the data processing tasks based on the data processing tasks, wherein the at least two data processing tasks comprise data processing tasks of at least two task types, the data processing tasks of one task type correspond to one scheduling system, one scheduling system is associated with one cluster resource in a target cluster, and the cluster resource is used for executing the data processing tasks;

when a processing request of a target data processing task in the at least two data processing tasks is received, determining a task type of the target data processing task based on the processing request, and determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task;

scheduling the target data processing task to a target cluster resource of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource;

and storing response data obtained by executing the target data processing task to a target database through the target cluster resources.

With reference to the first aspect, in a possible implementation manner, before the obtaining at least two data processing tasks, the method further includes:

at least two cluster resources are determined from the target cluster, and the association between each cluster resource of the at least two cluster resources and at least two scheduling systems is established, wherein one cluster resource is associated with one scheduling system.

With reference to the first aspect, in a possible implementation manner, the establishing association between each of the at least two cluster resources and at least two scheduling systems includes:

based on the dispatching system identification of each dispatching system in at least two dispatching systems, the dispatching system identification of each dispatching system is associated with each cluster resource in the target cluster to establish the association between each cluster resource and each dispatching system;

wherein the scheduling system identifier of one scheduling system is associated with one of the cluster resources in the target cluster.

With reference to the first aspect, in a possible implementation manner, the processing request carries a target task identifier of the target data processing task; the determining the task type of the target data processing task based on the processing request includes:

and determining the task type of the target data processing task based on the target task identifier carried in the processing request, wherein the task type of the target data processing task comprises one of debugging, working or testing.

With reference to the first aspect, in a possible implementation manner, the scheduling the target data processing task to the target cluster resource of the target cluster based on the target scheduling system includes:

determining a resource parameter corresponding to a target cluster resource in the target cluster based on the scheduling system identifier of the target scheduling system;

and configuring the resource parameters corresponding to the target cluster resources into the target scheduling system so as to schedule the target data processing tasks into the target cluster resources of the target cluster based on the target scheduling system.

With reference to the first aspect, in a possible implementation manner, the determining, based on the scheduling system identifier of the target scheduling system, a resource parameter corresponding to a target cluster resource in the target cluster includes:

determining a target cluster resource in a target cluster associated with the target scheduling system based on the scheduling system identifier of the target scheduling system, and determining a resource parameter corresponding to the target cluster resource based on the target cluster resource in the target cluster;

the target cluster resources comprise storage resources and/or computing resources.

In a second aspect, the present application provides a data processing apparatus, comprising:

the system comprises an acquisition module, a scheduling module and a processing module, wherein the acquisition module is used for acquiring at least two data processing tasks and determining a scheduling system for scheduling each data processing task based on each data processing task, the at least two data processing tasks comprise data processing tasks of at least two task types, the data processing tasks of one task type correspond to one scheduling system, one scheduling system is associated with one cluster resource in a target cluster, and the cluster resource is used for executing the data processing tasks;

a first determining module, configured to, when a processing request of a target data processing task of the at least two data processing tasks is received, determine a task type of the target data processing task based on the processing request, and determine a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task;

a first scheduling module, configured to schedule the target data processing task to a target cluster resource of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource;

and the first storage module is used for storing the response data obtained by executing the target data processing task to a target database through the target cluster resource.

With reference to the second aspect, in a possible implementation manner, the apparatus further includes:

the second determining module is configured to determine at least two cluster resources from the target cluster, and establish association between each of the at least two cluster resources and at least two scheduling systems, where one cluster resource is associated with one scheduling system.

the association module is used for associating the scheduling system identifier of each scheduling system with each cluster resource in the target cluster based on the scheduling system identifier of each scheduling system in at least two scheduling systems so as to establish the association between each cluster resource and each scheduling system;

With reference to the second aspect, in a possible implementation manner, the processing request carries a target task identifier of the target data processing task; the above-mentioned device still includes:

and a third determining module, configured to determine a task type of the target data processing task based on a target task identifier carried in the processing request, where the task type of the target data processing task includes one of debugging, working, and testing.

a fourth determining module, configured to determine, based on the scheduling system identifier of the target scheduling system, a resource parameter corresponding to a target cluster resource in the target cluster;

the first scheduling module is further configured to:

With reference to the second aspect, in a possible implementation manner, the fourth determining module is further configured to:

In a third aspect, the present application provides a computer device comprising: a processor, a memory, and a network interface;

the memory is configured to store program code, and the processor is configured to call the program code to perform the method performed by any one of the possible embodiments of the first aspect and the first aspect of the present application.

In a fourth aspect, the present application provides a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, perform the method performed by any one of the above-mentioned first aspect and possible embodiments of the first aspect of the present application.

In the present application, at least two data processing tasks are obtained, and a scheduling system for scheduling each data processing task is determined based on each data processing task, where the at least two data processing tasks include data processing tasks of at least two task types, and one scheduling system is associated with one cluster resource in a target cluster. And when a processing request of a target data processing task in the at least two data processing tasks is received, determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task. And scheduling the target data processing task to a target cluster resource of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource. And finally, storing response data obtained by executing the target data processing task to a target database through the target cluster resources. By adopting the scheme, cluster resources can be divided from one big data cluster, so that the target data processing tasks are scheduled to the target cluster resources for task execution by using the corresponding target scheduling systems aiming at different data processing tasks, thereby realizing that one code performs task execution on different data processing tasks (debugging, working or testing and the like) in a single big data cluster, reducing the development and maintenance cost of the big data cluster, and being simple in operation and high in applicability.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a network architecture provided in an embodiment of the present application;

fig. 2 is a scene schematic diagram of a data processing method based on a single cluster according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a data processing method based on a single cluster according to an embodiment of the present application;

fig. 4 is another schematic flow chart of a data processing method based on a single cluster according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a single cluster-based data processing method apparatus provided in an embodiment of the present application;

fig. 6 is another schematic structural diagram of a single cluster-based data processing method apparatus provided in the present application;

fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present application.

Detailed Description

The technical solutions in the present application will be described clearly and completely with reference to the accompanying drawings in the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The data processing method based on the single cluster provided by the embodiment of the application belongs to the Cloud Technology (CT) belonging to the technical field of computers. The cloud technology is a general name of a network technology, an information technology, an integration technology, a management platform technology, an application technology and the like based on cloud computing business model application, and refers to a hosting technology for unifying series resources such as hardware, software, a network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of big data. Can form a resource pool, can be used as required, and is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

The Big Data (BD) is a Data set that cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset that needs a new processing mode to have stronger decision-making power, insight discovery power and process optimization capability. With the advent of the cloud era, big data has attracted more and more attention, and the big data needs special technology to effectively process a large amount of data within a tolerance elapsed time. The method is suitable for the technology of big data, and comprises a large-scale parallel processing database, data mining, a distributed file system, a distributed database, a cloud computing platform, the Internet and an extensible storage system. Database (DB), which can be regarded as an electronic file cabinet-a place for storing electronic files in short, allows a user to add, query, update, delete, etc. data in files. A "database" is a collection of data that is stored together in a manner that can be shared by multiple users, has as little redundancy as possible, and is independent of the application.

Fig. 1 is a network architecture diagram provided in an embodiment of the present application. As shown in fig. 1, the network architecture may include a service server 1000 and a background server cluster, where the background server cluster may include a plurality of background servers, and as shown in fig. 1, the network architecture may specifically include a background server 100a, a background server 100b, background servers 100c and …, and a background server 100 n. As shown in fig. 1, the backend server 100a, the backend server 100b, the backend servers 100c, …, and the backend server 100n may be respectively connected to the service server 1000 through a network, so that each backend server may perform data interaction with the service server 1000 through the network connection, so that each backend server 1000 may receive service data from the service server.

The service server 1000 shown in fig. 1 may correspond to a plurality of user terminals, and may be configured to store service data of the corresponding user terminals. Each user terminal may be integrally installed with a target application, and when the target application runs in each user terminal, the service server corresponding to each user terminal may store service data in the application, and perform data interaction with the background servers (the background server 100a, the background server 100b, the background servers 100c and …, and the background server 100n) shown in fig. 1. Optionally, the target application may include an application having a function of displaying data information such as text, images, and videos. For example, the target application may be an inventory management application, and may be configured to upload initial inventory data by a user, perform data processing on the inventory data, acquire the processed inventory data from the target database, and perform subsequent operations. Or the target application can also be a user portrait management application, and can be used for uploading initial data of a user portrait by a manager, performing data processing on the initial data of the user portrait, acquiring user portrait data with a user tag from a target database, and performing subsequent marketing planning. The service server 1000 in the present application may collect service data such as images or characters uploaded by these applications, and transmit the service data to each background server through network connection for data processing. Optionally, the background server may be any one selected from the cluster of the back-head servers corresponding to fig. 1, for example, the background server may be the background server 100a, and then the background server 100a may have at least two data processing tasks, and determine a scheduling system for scheduling each data processing task based on each data processing task. The at least two data processing tasks include data processing tasks of at least two task types, the data processing tasks of one task type correspond to one scheduling system, and one scheduling system is associated with one cluster resource in the target cluster. The target cluster may be a big data cluster built in the background server cluster, where the background server cluster may include a plurality of background servers, as shown in fig. 1, and specifically may include a background server 100a, a background server 100b, background servers 100c and …, and a background server 100 n. When the background server 100a receives a processing request of a target data processing task of the at least two data processing tasks, a target scheduling system for scheduling the target data processing task is determined according to a task type of the target data processing task. And scheduling the target data processing task to a target cluster resource of the target cluster based on the target scheduling system so as to execute the target data processing task through the target cluster resource. The cluster resources may include a cluster resource for executing a debugging task, a cluster resource for executing a working task, a cluster resource for executing a testing task, and the like, where the cluster resource for executing the debugging task, the cluster resource for executing the working task, and the cluster resource for executing the testing task may include a computing resource, a storage resource, and the like. And finally, storing response data obtained by executing the target data processing task to a target database through the target cluster resource, and then enabling a user to obtain the processed data through the target database at the user terminal and carrying out subsequent operation based on the processed data.

It is understood that the method provided by the embodiment of the present application may be executed by a computer device, where the computer device includes, but is not limited to, a terminal or a server, and the service server 1000, the backend server 100a, the backend server 100b, the backend servers 100c, …, and the backend server 100n in the embodiment of the present application may be computer devices, and are not limited herein. The service server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN, big data and artificial intelligence platform. The terminal may include: smart terminals such as smart phones, tablet computers, notebook computers, desktop computers, smart watches, but not limited thereto. The user terminal and the service server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

Referring to fig. 2, fig. 2 is a schematic view of a scene of a data processing method based on a single cluster according to an embodiment of the present application. As shown in FIG. 2, when a user A uses a target application (e.g., a user portrait management application) in a user terminal, a background server 100a acquires at least two data processing tasks (e.g., a user portrait data cleansing debugging task and a user portrait data cleansing working task), wherein the at least two data processing tasks include at least two task types (e.g., a user portrait data cleansing debugging task, a user portrait data cleansing working task, a user portrait data extraction debugging task, a user portrait data reflow working task, and a user portrait data cleansing testing task) and determines a scheduling system (e.g., a debugging task scheduling system and a working task scheduling system) for scheduling the data processing tasks based on the data processing tasks, wherein one task type of data processing tasks corresponds to one scheduling system, a scheduling system associates a cluster resource in a target cluster. The cluster resources herein may include cluster resources for executing a debugging task, cluster resources for executing a working task, cluster resources for executing a testing task, and the like, where the cluster resources for executing the debugging task, the cluster resources for executing the working task, and the cluster resources for executing the testing task may also include computing resources, storage resources, and the like. When a user a initiates a data processing task request through a user terminal 10b, the background server 100a determines a target scheduling system (e.g., a job scheduling system) for scheduling the target data processing task according to a task type of the target data processing task (e.g., a user portrait data cleaning job task), and schedules the target data processing task to a target cluster resource (e.g., a cluster resource for executing the job task) of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource. The background server 100a then stores the response data obtained by executing the target data processing task to the target database 20 through the target cluster resource. The user terminal 10b may view the response data (e.g., the cleaned user image data) after performing the object data processing task from the object database 20.

Further, for convenience of understanding, please refer to fig. 3, and fig. 3 is a schematic flowchart of a data processing method based on a single cluster according to an embodiment of the present application. The method may be executed by a service server (e.g., the service server 1000 shown in fig. 1) or may be executed by a backend server and a service server (e.g., the service server 1000 and the backend server 100a in the embodiments corresponding to fig. 1 or fig. 2). For ease of understanding, the present embodiment is described by taking the method as an example, where the method is executed by the service server. The data processing method at least comprises the following steps S101-S104:

s101, at least two data processing tasks are obtained, and a scheduling system for scheduling the data processing tasks is determined based on the data processing tasks.

In some possible embodiments, at least two data processing tasks are obtained, and a scheduling system for scheduling each data processing task is determined based on each data processing task, it can be understood that the at least two data processing tasks include data processing tasks of at least two task types, a data processing task of one task type corresponds to one scheduling system, one scheduling system is associated with one cluster resource in a target cluster, and a cluster resource associated with one scheduling system is used for executing a data processing task of one task type. In an optional embodiment of the present application, a set of corresponding scheduling systems is respectively set up for data processing tasks of different task types, where a scheduling system refers to a system capable of periodically or once submitting a data processing task to a target cluster resource for task execution. One scheduling system is associated with one cluster resource in the target cluster, which means that a data processing task is scheduled to the target cluster resource by the corresponding target scheduling system for task execution. Optionally, the scheduling system may be an Azkabna open-source scheduling system, or an oize open-source scheduling system, where the oize open-source scheduling system is a heavy-weight task scheduling system compared to the Azkaban open-source scheduling system, and has more comprehensive functions but more complex configuration and use, so that the light-weight open-source scheduling system Azkaban may be a more suitable candidate on the premise of not using some functions. In an optional embodiment of the present application, when the type of the acquired data processing task is a work task and a debugging task, the target cluster resource executing the work task may be referred to as a green environment, and the target cluster resource executing the debugging task may be referred to as a blue environment. When the work task and the debugging task are executed in the target cluster at the same time, the task execution mode can be called Blue Green Deployment (BGD), that is, an online Deployment mode that can ensure that the target cluster continuously provides service for a data user, and when one data processing task (for example, a user portrait data cleaning task) needs to perform task debugging in a Blue environment, the Green environment can also perform task execution on the data processing task of a work type (for example, a user portrait data cleaning work task). In an optional embodiment of the present application, a Hadoop big data cluster may be used as a target cluster of the data processing tasks, and before the at least two data processing tasks are obtained, at least two cluster resources may be determined from the target cluster, and an association between each cluster resource of the at least two cluster resources and at least two scheduling systems is established, where one cluster resource is associated with one scheduling system. Optionally, the cluster resources herein may include cluster resources for executing a debugging task, cluster resources for executing a working task, cluster resources for executing a testing task, and the like, where the cluster resources for executing the debugging task, the cluster resources for executing the working task, and the cluster resources for executing the testing task may further include computing resources, storage resources, and the like, respectively. The storage resource may include a Hadoop Distributed File System (HDFS) in a Hadoop big data cluster, and it can be understood that the HDFS File refers to a Distributed File System suitable for running on general hardware, and the File can be stored in different servers in a Distributed manner; the computing resources herein include queues capable of simultaneously executing multiple data Processing tasks of the same type at a time (the functions of the queues are mainly to execute task contents in the data Processing tasks one by one according to an instruction sequence and store related data of the data Processing tasks to be processed), Central Processing Units (CPUs) required for executing the tasks in the queues, and resources such as memory occupied by the queues, and if the number of currently executed data Processing tasks exceeds the maximum number of CPU cores allocated to each queue (i.e., the number of data Processing tasks to be simultaneously processed) when executing the data Processing tasks, the data Processing tasks need to be queued for execution.

S102, when a processing request of a target data processing task in at least two data processing tasks is received, determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task.

In some possible embodiments, when a processing request of a target data processing task of the at least two data processing tasks is received, a target scheduling system for scheduling the target data processing task is determined according to a task type of the target data processing task. Optionally, the data processing tasks of the two task types may be task types such as a debugging task, a working task, and a testing task. It will thus be appreciated that the at least two data processing tasks of the at least two task types captured herein may be a debug task, a work task, a debug task, a test task, two work tasks, a test task, etc. The determining of the scheduling system for scheduling each data processing task based on each data processing task may be determining a scheduling system for scheduling each data processing task based on a task type of each data processing task. For example, if the task type of the target data processing task is a debugging task, a debugging task scheduling system corresponding to the debugging task is determined as the scheduling system of the target data processing task.

S103, scheduling the target data processing task to a target cluster resource of the target cluster based on the target scheduling system so as to execute the target data processing task through the target cluster resource.

In some possible embodiments, the target data processing task is scheduled into a target cluster resource of the target cluster based on the target scheduling system to execute the target data processing task by the target cluster resource. In an optional embodiment of the present application, when at least two scheduling systems are built for the data processing tasks of different task types, the scheduling system identifier of each scheduling system is associated with each cluster resource in the target cluster through a scheduling system identifier of each scheduling system in the at least two scheduling systems, where the scheduling system identifier is used to mark each cluster resource in the target cluster associated with each scheduling system, and the scheduling system identifier of one scheduling system is associated with one cluster in the target cluster. In an optional embodiment of the present application, when corresponding scheduling systems are built for different types of data processing tasks, relevant information (for example, a scheduling system identifier) of the scheduling system may be associated with a target cluster resource in a big data cluster in a parameter form (for example, dev.dev _ mr may represent a queue resource parameter, ADS _ blind may represent an HDFS file resource parameter, and the like in code deployment), so that when a target data processing task request is received subsequently, the target cluster resource in the target cluster may be determined according to the scheduling system identifier of the target scheduling system, and the target data processing task may be task-executed by using the target cluster resource. Optionally, if the target data processing task is a work task, the work task scheduling system schedules the work task to a target cluster resource (for example, an HDFS file for storing data related to the work task and a queue for executing the work task) corresponding to the work task in the target cluster by configuring a resource parameter of the target cluster resource for executing the work task, and executes the target data processing task in the target cluster resource (for example, the HDFS file for storing data related to the work task and the queue for executing the work task).

And S104, storing response data obtained by executing the target data processing task to a target database through the target cluster resource.

In some possible embodiments, the response data resulting from executing the target data processing task is stored to the target database by the target cluster resource. In an optional embodiment of the present application, the data processing tasks may include different types of data reflow tasks (e.g., a work task for user portrait data reflow), and response data obtained by a target data processing task executed by a target cluster resource is stored in a target database (e.g., MYSQL database, etc.) by executing the reflow task. After the target data processing task is executed, storing response data obtained by executing the target data processing task to a target database in the target cluster resource through a queue resource in the target cluster resource, so that a subsequent data user can obtain the response data after executing the target data processing task from the corresponding target database. It can be understood that the target databases in the target cluster resources may include a target database corresponding to a debugging task, a target database corresponding to a work task, and a target database corresponding to a testing task, and different data users may obtain corresponding response data from different target databases. Optionally, if the data user is only an application scenario such as report analysis, the data user (comparison report) may directly use the response data after executing the target data processing task in the different queue resources, and the response data does not need to be returned to the target database, and may be specifically determined according to an actual scenario, which is not limited herein. If the data user is an Online Transaction Processing (OLTP) process, the response data after executing the target data Processing task may be stored in the corresponding target database for the Online Transaction Processing process to query. This is because the relevant data in the Hadoop big data cluster is stored in the Hive table in the HDFS file, which is not suitable for the OLTP system query, and therefore, the response data after executing the data processing task needs to be stored in the corresponding target database. The Hive table is a data warehouse tool based on a big data cluster Hadoop, and can map a structured data file into a database table and provide functions of storage, query, analysis storage and the like.

In the application, at least two data processing tasks are obtained, and a scheduling system for scheduling the data processing tasks is determined based on the data processing tasks, wherein the at least two data processing tasks include data processing tasks of at least two task types, a data processing task of one task type corresponds to one scheduling system, one scheduling system is associated with one cluster resource in a target cluster, and a cluster resource associated with one scheduling system is used for executing a data processing task of one task type. And when a processing request of a target data processing task in the at least two data processing tasks is received, determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task. And scheduling the target data processing task to a target cluster resource of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource. And finally, storing response data obtained by executing the target data processing task to a target database through the target cluster resources. By adopting the scheme, cluster resources can be divided from one big data cluster, so that the target data processing tasks are scheduled to the target cluster resources for task execution by using the corresponding target scheduling systems aiming at different data processing tasks, thereby realizing that one code performs task execution on different data processing tasks (debugging, working or testing and the like) in a single big data cluster, reducing the development and maintenance cost of the big data cluster, and being simple in operation and high in applicability.

In some possible embodiments, please refer to fig. 4 together, and fig. 4 is another schematic flow chart of the data processing method based on a single cluster according to the embodiment of the present application. The method may be executed by a service server (e.g., the service server 1000 shown in fig. 1) or may be executed by a backend server and a service server (e.g., the service server 1000 and the backend server 100a in the embodiments corresponding to fig. 1 or fig. 2). For ease of understanding, the present embodiment is described by taking the method as an example, where the method is executed by the service server. Wherein, the data processing method at least comprises the following steps S201 to S205:

s201, at least two cluster resources are determined from the target cluster, and the association between each cluster resource of the at least two cluster resources and at least two scheduling systems is established.

In some possible embodiments, at least two cluster resources are determined from the target cluster, and an association between each of the at least two cluster resources and at least two scheduling systems is established, where one cluster resource is associated with one scheduling system and one cluster resource is used to execute one data processing task. For example, a first target cluster resource used for executing a work task in a data processing task, a second target cluster resource used for executing a debugging task in the data processing task, and the like may be determined from a target cluster, where the first target cluster resource may include an HDFS file used for storing data related to the work task and a queue used for executing the work task, and the second target cluster resource may include an HDFS file used for storing data related to the debugging task and a queue used for executing the debugging task. Optionally, when actually allocating cluster resources in the target cluster, more cluster resources may be allocated to work tasks in the data processing tasks, and a higher task priority may be used to ensure normal operation of the work tasks, because the number of the work tasks is greater in most cases, which has a greater impact on the accuracy of the data. In addition, when the association between each cluster resource of the at least two cluster resources and the at least two scheduling systems is established, the scheduling system identifier of each scheduling system may be associated with each cluster resource of the target cluster based on the scheduling system identifier of each scheduling system of the at least two scheduling systems, where the scheduling system identifier of one scheduling system is associated with one cluster resource of the target cluster. In an optional embodiment of the present application, when corresponding scheduling systems are built for different types of data processing tasks, relevant information (for example, a scheduling system identifier) of the scheduling system may be associated with a target cluster resource in a big data cluster in a parameter form (for example, dev.dev _ mr may represent a queue resource parameter, ADS _ blind may represent an HDFS file resource parameter, and the like in code deployment), so that when a target data processing task request is received subsequently, the target cluster resource in the target cluster may be determined according to the scheduling system identifier of the target scheduling system, and the target data processing task may be task-executed by using the target cluster resource.

S202, at least two data processing tasks are obtained, and a scheduling system for scheduling the data processing tasks is determined based on the data processing tasks.

The specific implementation of step S202 may refer to the description of step S101 in the embodiment corresponding to fig. 3, which will not be described herein again.

S203, when a processing request of a target data processing task in at least two data processing tasks is received, determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task.

In some feasible embodiments, the processing request of the target data processing task may further carry a task identifier of the target data processing task, so that when a processing request of a target data processing task of at least two data processing tasks is received, a task type of the target data processing task may be determined based on the target task identifier carried in the processing request of the target data processing task, where the task type of the target data processing task includes one of debugging, working, or testing. And determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task. For example, if the task type of the target data processing task is a debugging task, a debugging task scheduling system corresponding to the debugging task is determined as the scheduling system of the target data processing task.

S204, scheduling the target data processing task to the target cluster resource of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource.

In some possible embodiments, a target cluster resource in a target cluster associated with the target scheduling system may be determined based on a scheduling system identifier of the target scheduling system, and a resource parameter corresponding to the target cluster resource may be determined based on the target cluster resource in the target cluster, and in addition, the resource parameter corresponding to the target cluster resource may be configured to the target scheduling system, so as to schedule the target data processing task to the target cluster resource of the target cluster based on the target scheduling system. In an optional embodiment of the present application, corresponding scheduling systems may be built for different types of data processing tasks, and relevant information (for example, a scheduling system identifier) of the scheduling system may be associated with a target cluster resource in a big data cluster in a parameter form, after a target data processing task request is received, a scheduling system identifier of the target scheduling system and a scheduling system identifier of the target scheduling system may be determined based on definition information in the target task identifier carried in the target data processing task request, and then a resource parameter of the target cluster resource of the target cluster associated therewith may be determined based on the scheduling system identifier of the target scheduling system. For example, when the type of the acquired data processing task is a work task and a debugging task, the target cluster resource for executing the work task may be referred to as a green environment, and the target cluster resource for executing the debugging task may be referred to as a blue environment. At this time, if the definition information of the target task identifier is "insert over write table $ { ADS _ DB }, TEST select user _ id, sum (ampout) amt from $ { DWD _ DB }, TEST group by user _ id", where "TEST" represents that the target data processing task is a debugging task, it may be determined that the corresponding target scheduling system is a scheduling system corresponding to the debugging task and a corresponding scheduling system identifier. Then, the resource parameter of the target cluster resource of the associated target cluster can be identified by the scheduling system of the scheduling system corresponding to the test task according to "$ { ADS _ DB }", where the "$" symbol represents a flag that the content in the bracket needs to be configured as its associated resource parameter, and if "$ { ADS _ DB }" associated resource parameter is "ADS _ BLUE", where "ADS _ BLUE" is represented as a BLUE environment in the target cluster, the above definition information is replaced with "insert over write table ADS _ BLUE _ test _ select user id, sum (out am) am from ADS _ BLUE _ test _ group b _ user id" to perform task execution on the target data processing task (e.g. debugging task) through the target cluster resource (e.g. BLUE environment).

S205, response data obtained by executing the target data processing task is stored in a target database through the target cluster resources.

The specific implementation of step S205 may refer to the description of step S104 in the embodiment corresponding to fig. 3, which will not be described herein again.

In the application, at least two cluster resources are determined from the target cluster, and the association between each cluster resource of the at least two cluster resources and at least two scheduling systems is established. In addition, when the association between each cluster resource of the at least two cluster resources and the at least two scheduling systems is established, the scheduling system identifier of each scheduling system may be associated with each cluster resource of the target cluster based on the scheduling system identifier of each scheduling system of the at least two scheduling systems, where the scheduling system identifier of one scheduling system is associated with one cluster resource of the target cluster. When a processing request of a target data processing task in at least two data processing tasks is received, determining a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task based on a target task identifier carried in the processing request of the target data processing task and the task type of the target data processing task. And then determining a target cluster resource in a target cluster associated with the target scheduling system based on the scheduling system identifier of the target scheduling system, and determining a resource parameter corresponding to the target cluster resource based on the target cluster resource in the target cluster. By adopting the scheme, cluster resources can be divided from one big data cluster, so that the target data processing tasks are scheduled to the target cluster resources for task execution by using the corresponding target scheduling systems aiming at different data processing tasks, thereby realizing that one code performs task execution on different data processing tasks (debugging, working or testing and the like) in a single big data cluster, reducing the development and maintenance cost of the big data cluster, and being simple in operation and high in applicability.

Further, please refer to fig. 5, wherein fig. 5 is a schematic structural diagram of a data processing apparatus based on a single cluster according to the present application. The data processing means may be a computer program (comprising program code) running on a computer device, e.g. an application software; the apparatus may be adapted to perform the corresponding steps in the methods provided herein. As shown in fig. 5, the data processing apparatus includes: the system comprises an acquisition module 10, a first determination module 20, a first scheduling module 30 and a first storage module 40.

An obtaining module 10, configured to obtain at least two data processing tasks, and determine, based on each data processing task, a scheduling system for scheduling each data processing task, where the at least two data processing tasks include data processing tasks of at least two task types, a data processing task of one task type corresponds to one scheduling system, and one scheduling system is associated with one cluster resource in a target cluster, where the cluster resource is used to execute a data processing task;

a first determining module 20, configured to, when a processing request of a target data processing task of the at least two data processing tasks is received, determine a task type of the target data processing task based on the processing request, and determine a target scheduling system for scheduling the target data processing task according to the task type of the target data processing task;

a first scheduling module 30, configured to schedule the target data processing task to a target cluster resource of the target cluster based on the target scheduling system, so as to execute the target data processing task through the target cluster resource;

the first storage module 40 is configured to store, to the target database, response data obtained by executing the target data processing task through the target cluster resource.

In a possible implementation, referring to fig. 6, the apparatus further includes:

a second determining module 50, configured to determine at least two cluster resources from the target cluster, and establish an association between each of the at least two cluster resources and at least two scheduling systems, where one cluster resource is associated with one scheduling system.

an association module 60, configured to associate, based on a scheduling system identifier of each scheduling system of the at least two scheduling systems, the scheduling system identifier of each scheduling system with each cluster resource in the target cluster to establish an association between each cluster resource and each scheduling system;

In a possible implementation manner, the processing request carries a target task identifier of the target data processing task; the above-mentioned device still includes:

a third determining module 70, configured to determine a task type of the target data processing task based on a target task identifier carried in the processing request, where the task type of the target data processing task includes one of debugging, working, and testing.

In a possible embodiment, the above apparatus further comprises:

a fourth determining module 80, configured to determine, based on the scheduling system identifier of the target scheduling system, a resource parameter corresponding to a target cluster resource in the target cluster;

the first scheduling module 30 is further configured to:

In a possible implementation, the fourth determining module 80 is further configured to:

For specific implementation manners of the obtaining module 10, the first determining module 20, the first scheduling module 30, and the first storing module 40, reference may be made to the description of steps S101 to S104 in the embodiment corresponding to fig. 3, and details will not be further described here. In addition, the beneficial effects of the same method are not described in detail.

Further, please refer to fig. 7, where fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 7, the apparatus in the embodiment corresponding to fig. 5 may be applied to the computer device 2000, where the computer device 2000 may include: at least one processor 2001, e.g., a CPU, at least one network interface 2003, memory 2004, at least one communication bus 2002. The communication bus 2002 is used to implement connection communication between these components. Wherein the network interface 2003 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 2004 may be a Random Access Memory (RAM) memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 2004 may optionally also be at least one memory device located remotely from the aforementioned processor 2001. As shown in fig. 7, the memory 2004, which is a type of computer storage medium, may include an operating system, a network communication module, and a device control application program.

It should be understood that the computer device 2000 described in this embodiment of the present application may perform the description of the embodiment corresponding to fig. 3 and/or fig. 4, and may also perform the description of the data processing apparatus in the embodiment corresponding to fig. 5 and/or fig. 6, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.

Further, here, it is to be noted that: the present application further provides a computer-readable storage medium, and the computer-readable storage medium stores the aforementioned computer program executed by the single cluster-based data processing apparatus, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the single cluster-based data processing method in the embodiment corresponding to fig. 3 and/or fig. 4 can be executed, so that details are not described here again. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in embodiments of the computer-readable storage medium referred to in the present application, reference is made to the description of embodiments of the method of the present application. By way of example, the program instructions may be deployed to be executed on one computing device or on multiple computing devices at one site.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The computer-readable storage medium may be a data processing apparatus based on a single cluster provided in any of the foregoing embodiments or an internal storage unit of the foregoing device, such as a hard disk or a memory of an electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash card (flash card), and the like, which are provided on the electronic device. The computer readable storage medium may further include a magnetic disk, an optical disk, a read-only memory (ROM), a random access memory (ram), or the like. Further, the computer readable storage medium may also include both an internal storage unit and an external storage device of the electronic device. The computer-readable storage medium is used for storing the computer program and other programs and quantities required by the electronic device. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

The terms "first", "second", and the like in the claims, in the description and in the drawings of the present invention are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the invention. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments. The term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A data processing method based on a single cluster is characterized by comprising the following steps:

the method comprises the steps of obtaining at least two data processing tasks, and determining a scheduling system for scheduling the data processing tasks based on the data processing tasks, wherein the at least two data processing tasks comprise data processing tasks of at least two task types, the data processing tasks of one task type correspond to one scheduling system, one scheduling system is associated with one cluster resource in a target cluster, and the cluster resource is used for executing the data processing tasks;

scheduling the target data processing task to a target cluster resource of the target cluster based on the target scheduling system to execute the target data processing task through the target cluster resource;

and storing response data obtained by executing the target data processing task to a target database through the target cluster resource.

2. The method of claim 1, wherein prior to said obtaining at least two data processing tasks, the method further comprises:

3. The method of claim 2, wherein the establishing the association between each of the at least two cluster resources and at least two scheduling systems comprises:

based on the dispatching system identification of each dispatching system in at least two dispatching systems, the dispatching system identification of each dispatching system is associated with each cluster resource in the target cluster so as to establish the association between each cluster resource and each dispatching system;

wherein a scheduling system identification of a scheduling system is associated with a cluster resource in the target cluster.

4. The method according to claim 3, wherein the processing request carries a target task identifier of the target data processing task; the determining a task type of the target data processing task based on the processing request comprises:

5. The method of claim 4, wherein said scheduling the target data processing task into the target cluster resource of the target cluster based on the target scheduling system comprises:

and configuring the resource parameters corresponding to the target cluster resources into the target scheduling system so as to schedule the target data processing tasks to the target cluster resources of the target cluster based on the target scheduling system.

6. The method of claim 5, wherein the determining the resource parameter corresponding to the target cluster resource in the target cluster based on the scheduling system identifier of the target scheduling system comprises:

determining target cluster resources in a target cluster associated with the target scheduling system based on the scheduling system identifier of the target scheduling system, and determining resource parameters corresponding to the target cluster resources based on the target cluster resources in the target cluster;

wherein the target cluster resources comprise storage resources and/or computing resources.

7. A data processing apparatus, characterized in that the apparatus comprises:

the first determining module is used for determining a target scheduling system for scheduling a target data processing task according to the task type of the target data processing task when a processing request of the target data processing task in the at least two data processing tasks is received;

and the first storage module is used for storing response data obtained by executing the target data processing task to a target database through the target cluster resource.

8. The apparatus of claim 7, further comprising:

9. A computer device, comprising: a processor, a memory, and a network interface;

the memory is configured to store program code and the processor is configured to invoke the program code to perform the method of any of claims 1-6.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method of any of claims 1-6.