CN114327820A - Processing method and device for offline tasks - Google Patents

Processing method and device for offline tasks Download PDF

Info

Publication number
CN114327820A
CN114327820A CN202111609393.2A CN202111609393A CN114327820A CN 114327820 A CN114327820 A CN 114327820A CN 202111609393 A CN202111609393 A CN 202111609393A CN 114327820 A CN114327820 A CN 114327820A
Authority
CN
China
Prior art keywords
data
historical
channel data
offline
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111609393.2A
Other languages
Chinese (zh)
Inventor
宋东瑞
白杰
白会杰
姚鑫
苏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Zhenxuan Data Information Technology Co ltd
Original Assignee
Suzhou Zhenxuan Data Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Zhenxuan Data Information Technology Co ltd filed Critical Suzhou Zhenxuan Data Information Technology Co ltd
Priority to CN202111609393.2A priority Critical patent/CN114327820A/en
Publication of CN114327820A publication Critical patent/CN114327820A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

The invention provides a method and a device for processing an offline task, wherein the method comprises the following steps: acquiring historical channel data in a preset time period, dividing the historical channel data by adopting a calculation engine Spark, and then generating and arranging an offline task of the historical channel data; and performing task scheduling on the offline task by adopting a batch scheduler Azkaban. The task scheduling method and the task scheduling device solve the problem that task scheduling cannot be carried out in the prior art.

Description

Processing method and device for offline tasks
Technical Field
The invention relates to the field of data processing, in particular to a method and a device for processing an offline task.
Background
With the maturity of the clouding technology, more and more enterprise-level applications develop towards clouding, and there are more dependent schedules of batch processing tasks under clouding. Some scheduling systems are needed to realize cloud operation scheduling, and the existing cloud task scheduling products only have the capacity that task scheduling does not have task arrangement.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
The invention mainly aims to provide a method and a device for processing an offline task, so as to solve the problem that task arrangement cannot be performed in the prior art.
In order to achieve the above object, according to an aspect of the present invention, there is provided an offline task processing method, including: acquiring historical channel data in a preset time period, dividing the historical channel data by adopting a calculation engine Spark, and then generating and arranging an offline task of the historical channel data; and performing task scheduling on the offline task by adopting a batch scheduler Azkaban.
Optionally, before obtaining the historical channel data within the preset time period, the method further includes: and locally configuring the running environment of the computing engine Spark, and configuring the master node of the computing engine Spark into a horn mode.
Optionally, before obtaining the historical channel data within the preset time period, the method further includes: installing JAVA language software development kit JDK and then configuring JAVA environment variables, wherein the JAVA environment variables include JAVA _ HOME configuration of JAVA and a CLASSPATH file for recording all information of a project compilation environment, including: a source file path, a compiled class file storage path, a dependent JAR package path, running container information and dependent external project information; configuring environment variables of a distributed system infrastructure (HADOOP), wherein the environment variables of the HADOOP comprise root directory configuration (HADOOP _ HOME) entering the HADOOP, HADOOP _ CONF _ DIR configuration, PATH configuration, source file source updating and environment variable validation test, and the HADOOP is operated by depending on JAVA language software development kit (JDK).
Optionally, the obtaining of the historical channel data in the preset time period includes: reading channel data from a distributed file system hdfs of the distributed system infrastructure HADOOP by using a textFile method, or reading the channel data from a data source according to a configured data source address by using the textFile method; and acquiring the generation time of all the read data, and acquiring the historical channel data with the generation time within the last week.
Optionally, obtaining historical channel data within a preset time period, dividing the historical channel data by using a calculation engine Spark, and then generating and arranging offline tasks of the historical channel data, wherein the offline tasks include: and dividing the historical channel data into a plurality of batches of data by adopting a computing engine Spark, and then generating and arranging off-line tasks of the historical channel data.
Optionally, a computing engine Spark is adopted to divide the historical channel data into a plurality of batches of data, and generate an offline task of the historical channel data, where the offline task includes one of: dividing the historical channel data according to time periods by adopting a computing engine Spark to obtain multiple sets of historical data, creating multiple offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained through division corresponds to one of the multiple time periods, the time periods corresponding to any two sets of historical data are different, each set of offline task in the multiple set of offline tasks is used for processing one set of historical data, and the historical data processed by any two sets of offline tasks are different; dividing the historical channel data according to data types by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data types, and the data types corresponding to any two sets of historical data are different; dividing the historical channel data according to the data volume by adopting a computing engine Spark to obtain multiple pieces of historical data with the same data volume, creating multiple offline tasks for the multiple pieces of historical data, and then establishing association among the offline tasks; dividing the historical channel data according to data sources by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data sources, and the data sources corresponding to any two sets of historical data are different; and randomly dividing the historical channel data by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks.
In order to achieve the above object, according to one aspect of the present invention, there is provided an offline task processing apparatus including: the task dividing unit is used for acquiring historical channel data in a preset time period, dividing the historical channel data by adopting a calculation engine Spark, and then generating and arranging an offline task of the historical channel data; and the task scheduling unit is used for performing task scheduling on the offline task by adopting a batch scheduler Azkaban.
Optionally, the apparatus further comprises: the configuration unit is configured to configure the running environment of the compute engine Spark locally before acquiring the historical channel data within a preset time period, and configure the master node of the compute engine Spark into a yarn mode.
Optionally, the configuration unit is further configured to, before obtaining the historical channel data within the preset time period, install JAVA language software development kit JDK, and then configure JAVA environment variables, where the JAVA environment variables include JAVA _ HOME configuration of JAVA and a CLASSPATH file for recording all information of the project compilation environment, where the all information includes: a source file path, a compiled class file storage path, a dependent JAR package path, running container information and dependent external project information; configuring environment variables of a distributed system infrastructure (HADOOP), wherein the environment variables of the HADOOP comprise root directory configuration (HADOOP _ HOME) entering the HADOOP, HADOOP _ CONF _ DIR configuration, PATH configuration, source file source updating and environment variable validation test, and the HADOOP is operated by depending on JAVA language software development kit (JDK).
Optionally, the task dividing unit is further configured to: reading channel data from a distributed file system hdfs of the distributed system infrastructure HADOOP by using a textFile method, or reading the channel data from a data source according to a configured data source address by using the textFile method; and acquiring the generation time of all the read data, and acquiring the historical channel data with the generation time within the last week.
Optionally, the task dividing unit is further configured to: and dividing the historical channel data into a plurality of batches of data by adopting a computing engine Spark, and then generating and arranging off-line tasks of the historical channel data.
Optionally, the task dividing unit is further configured to: dividing the historical channel data according to time periods by adopting a computing engine Spark to obtain multiple sets of historical data, creating multiple offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained through division corresponds to one of the multiple time periods, the time periods corresponding to any two sets of historical data are different, each set of offline task in the multiple set of offline tasks is used for processing one set of historical data, and the historical data processed by any two sets of offline tasks are different; dividing the historical channel data according to data types by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data types, and the data types corresponding to any two sets of historical data are different; dividing the historical channel data according to the data volume by adopting a computing engine Spark to obtain multiple pieces of historical data with the same data volume, creating multiple offline tasks for the multiple pieces of historical data, and then establishing association among the offline tasks; dividing the historical channel data according to data sources by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data sources, and the data sources corresponding to any two sets of historical data are different; and randomly dividing the historical channel data by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks.
According to another aspect of the embodiments of the present application, there is also provided an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor executes the above method through the computer program.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the steps of any of the embodiments of the method described above.
By applying the technical scheme of the invention, historical channel data in a preset time period are obtained, a calculation engine Spark is adopted to divide the historical channel data, and then an off-line task of the historical channel data is generated and arranged; the batch processing scheduler Azkaban is adopted to carry out task scheduling on the offline tasks, the generation, the arrangement and the scheduling of the tasks can be automatically finished, and the problem that the tasks cannot be arranged in the related technology can be solved.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a method of processing an offline task according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an alternative crontab scheme in accordance with an embodiment of the present application;
FIG. 3 is a schematic diagram of an alternative task creation according to an embodiment of the present application; and
fig. 4 is a schematic diagram of an alternative processing apparatus for offline tasks according to an embodiment of the present application.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances for describing embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
According to an aspect of an embodiment of the present application, an embodiment of a method for processing an offline task is provided. As shown in fig. 1, the method comprises the following steps:
step S11, obtaining historical channel data within a preset time period, dividing the historical channel data by using a calculation engine Spark, and then generating and arranging an offline task of the historical channel data.
Optionally, before obtaining the historical channel data within the preset time period, the method further includes: and locally configuring the running environment of the computing engine Spark, and configuring the master node of the computing engine Spark into a horn mode.
Optionally, before obtaining the historical channel data within the preset time period, the method further includes: installing JAVA language software development kit JDK and then configuring JAVA environment variables, wherein the JAVA environment variables include JAVA _ HOME configuration of JAVA and a CLASSPATH file for recording all information of a project compilation environment, including: a source file path, a compiled class file storage path, a dependent JAR package path, running container information and dependent external project information; configuring environment variables of a distributed system infrastructure (HADOOP), wherein the environment variables of the HADOOP comprise root directory configuration (HADOOP _ HOME) entering the HADOOP, HADOOP _ CONF _ DIR configuration, PATH configuration, source file source updating and environment variable validation test, and the HADOOP is operated by depending on JAVA language software development kit (JDK).
Optionally, the obtaining of the historical channel data in the preset time period includes: reading channel data from a distributed file system hdfs of the distributed system infrastructure HADOOP by using a textFile method, or reading the channel data from a data source according to a configured data source address by using the textFile method; and acquiring the generation time of all the read data, and acquiring the historical channel data with the generation time within the last week.
Optionally, obtaining historical channel data within a preset time period, dividing the historical channel data by using a calculation engine Spark, and then generating and arranging offline tasks of the historical channel data, wherein the offline tasks include: and dividing the historical channel data into a plurality of batches of data by adopting a computing engine Spark, and then generating and arranging off-line tasks of the historical channel data.
Optionally, a computing engine Spark is adopted to divide the historical channel data into a plurality of batches of data, and generate an offline task of the historical channel data, where the offline task includes one of: dividing the historical channel data according to time periods by adopting a computing engine Spark to obtain multiple sets of historical data, creating multiple offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained through division corresponds to one of the multiple time periods, the time periods corresponding to any two sets of historical data are different, each set of offline task in the multiple set of offline tasks is used for processing one set of historical data, and the historical data processed by any two sets of offline tasks are different; dividing the historical channel data according to data types by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data types, and the data types corresponding to any two sets of historical data are different; dividing the historical channel data according to the data volume by adopting a computing engine Spark to obtain multiple pieces of historical data with the same data volume, creating multiple offline tasks for the multiple pieces of historical data, and then establishing association among the offline tasks; dividing the historical channel data according to data sources by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data sources, and the data sources corresponding to any two sets of historical data are different; and randomly dividing the historical channel data by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks.
And step S12, performing task scheduling on the offline task by adopting a batch scheduler Azkaban.
Through the steps, historical channel data in a preset time period are obtained, the historical channel data are divided by adopting a calculation engine Spark, and then an off-line task of the historical channel data is generated and arranged; the batch processing scheduler Azkaban is adopted to carry out task scheduling on the offline tasks, the generation, the arrangement and the scheduling of the tasks can be automatically completed, and the problem that the tasks cannot be arranged in the related technology can be solved. The technical solution of the present application is further detailed below with reference to specific embodiments:
spark is a distributed batch processing system based on a memory, and divides tasks, and then distributes the tasks to a plurality of CPUs for processing, and intermediate products (calculation results) generated during data processing are stored in the memory, so that the I/O operation on a disk is reduced, the data processing speed is greatly improved, and the distributed batch processing system is suitable for being applied to scenes such as data processing, data mining and the like.
There are two main types of offline task scheduling for Spark: the first is crontab timing task using Linux; the second is to use components (e.g., Azkaban) for the orchestration.
The way to use the crontab scheme is as follows: crontab is installed, the CentOS system does not carry crontab by itself, yum can be used to install: yum install vixie-cron crrontabs; making a spare-submit command to be executed into a shell script, namely creating an sh file, such as spare shell. #! Bin/bash spark-submit/usr/sdr/wbfivecon connections. jar >/usr/sdr/log1229. out; editing crontab, crontab-e, then entering into vim interface, inputting: 10 ×/etc/profile; sh (semantic: executing the command at 00:01 a morning every day), checking whether the editing is successful through a crontab-l command, introducing an environment variable in the scheme, and causing the environment variable of the crontab to be inconsistent with the system; restart the crond service: service connected restart; the view angle expression of crontab is shown in fig. 2.
The approach using the Azkaban protocol is as follows: mysql initialization, in order to ensure the stability and reliability of Azkaban, the two-process service mode suggests using Mysql as a database, Mysql installation is omitted here, before Azkaban is installed, Mysql initialization and configuration are needed first, an installation package is downloaded and decompressed, a new directory (/ usr/local/Azkaban) is decompressed, all tar packages are decompressed to the directory, and the directory is entered into/usr/local/Azkaban directory, a name is modified, an execution Server is configured, a configuration file is modified (under a conf directory), and the modified content is the address of Azkaban web Server.
Port file appears when starting, which shows that the starting is successful, and returns status after activation, which means that the activation is successful, the Web Server is configured, and the modification content of the modified configuration file is as follows: the number of queued tasks staticiremainingflowsize, CPU occupancy CpuStatus, memory occupancy MinimumFreeMemory. The testing environment must delete minimumfreememery, otherwise it considers the cluster resources insufficient and does not execute.
When starting, the method must enter/usr/local/azkaban/azkaban-web directory, start the web server, and create a project by accessing the first page of the web page with four menus, wherein the projects are the most important part, and all flows run in the project; scheduling displays timed tasks, executing displays currently running tasks and history displays historically running tasks.
The projects part is mainly introduced, firstly, a project is created, and names and descriptions are filled in, as shown in FIG. 3; after clicking and creating, a workflow flow consists of a plurality of jobs, authority management Permission and Project Logs; job creation, which is simple and can be done by creating more than one text file, for example, creating a job for printing hello, the name of which is command.job, a simple job is created, a project generally cannot have only one job, and a plurality of dependent jobs are created, which is the primary purpose of adopting azkaban; packaging the jobresource file; project is created and compressed packets are uploaded through the azkaban web management platform.
Spark application scenario: data Processing (Data Processing) can be used for rapidly Processing Data, and has fault tolerance and expandability; iterative Computation (Iterative Computation), which supports Iterative Computation and effectively deals with complex data processing logic; data Mining (Data Mining), which is used for carrying out complex Mining analysis on the basis of mass Data and can support various Data Mining and machine learning algorithms; streaming Processing (Streaming Processing), which supports stream Processing with second-level delay and can support various external data sources; query Analysis (Query Analysis), supports Query Analysis in SQL while providing a Domain Specific Language (DSL) to facilitate the manipulation of structured data and support a variety of external data sources.
The scheme of the crontab is simple and easy to implement, but if a plurality of jobs have a precedence relationship, the realization is difficult (for example, after the execution of the former task is finished, a folder is created, and the latter task starts to execute after detecting the folder); and the Azkaban can solve the overall arrangement of a plurality of Spark tasks. Considering the performance of the product and the maintainability of the code, the application completes other functions such as Spark offline task scheduling based on Azkaban. The invention aims to complete data processing of channel data of a week of history by Spark and perform task scheduling on Spark offline tasks by Azkaban.
And step 1, carrying out different data processing aiming at different channel data.
Step 1.1, creating a spark (a fast and general computing engine specially designed for large-scale data processing) operating environment and configuring the app.
And step 1.2, reading data from the Hadoop distributed file system hdfs by using a textFile method, and also reading data from other data sources.
The file path can be written as a local file path on the premise that Hadoop needs to be installed locally, a winutils-added related dll is arranged under a bin directory, and a Hadoop _ home environment variable is configured; the hdfs file path may also be written.
The mapPartitions can process a large batch of data at a time, divide the data into a plurality of batches of data to be processed, and process one batch of data at a time.
After calculation, the data is collected into a partition, and then the saved action is executed, wherein one partition and the spark only start one task to execute the saved action, and only one file is generated.
And 2, one service comprises a plurality of off-line tasks, and task arrangement is carried out on the service.
In the technical scheme of the application, the task can be executed in a timing manner by adopting the idea of timing tasks and combining Spark offline tasks. And finishing data processing on channel data of one week in history by using Spark. And performing task scheduling on Spark offline tasks by using Azkaban.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present application.
According to another aspect of the embodiment of the application, an apparatus for implementing the method is also provided. Fig. 4 is a schematic diagram of an alternative offline task processing apparatus according to an embodiment of the present application, and as shown in fig. 4, the apparatus may include:
the task dividing unit 41 is configured to obtain historical channel data within a preset time period, divide the historical channel data by using a calculation engine Spark, and then generate and arrange an offline task of the historical channel data; and a task scheduling unit 43, configured to perform task scheduling on the offline task by using a batch scheduler Azkaban.
Optionally, the apparatus further comprises: the configuration unit is configured to configure the running environment of the compute engine Spark locally before acquiring the historical channel data within a preset time period, and configure the master node of the compute engine Spark into a yarn mode.
Optionally, the configuration unit is further configured to, before obtaining the historical channel data within the preset time period, install JAVA language software development kit JDK, and then configure JAVA environment variables, where the JAVA environment variables include JAVA _ HOME configuration of JAVA and a CLASSPATH file for recording all information of the project compilation environment, where the all information includes: a source file path, a compiled class file storage path, a dependent JAR package path, running container information and dependent external project information; configuring environment variables of a distributed system infrastructure (HADOOP), wherein the environment variables of the HADOOP comprise root directory configuration (HADOOP _ HOME) entering the HADOOP, HADOOP _ CONF _ DIR configuration, PATH configuration, source file source updating and environment variable validation test, and the HADOOP is operated by depending on JAVA language software development kit (JDK).
Optionally, the task dividing unit is further configured to: reading channel data from a distributed file system hdfs of the distributed system infrastructure HADOOP by using a textFile method, or reading the channel data from a data source according to a configured data source address by using the textFile method; and acquiring the generation time of all the read data, and acquiring the historical channel data with the generation time within the last week.
Optionally, the task dividing unit is further configured to: and dividing the historical channel data into a plurality of batches of data by adopting a computing engine Spark, and then generating and arranging off-line tasks of the historical channel data.
Optionally, the task dividing unit is further configured to: dividing the historical channel data according to time periods by adopting a computing engine Spark to obtain multiple sets of historical data, creating multiple offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained through division corresponds to one of the multiple time periods, the time periods corresponding to any two sets of historical data are different, each set of offline task in the multiple set of offline tasks is used for processing one set of historical data, and the historical data processed by any two sets of offline tasks are different; dividing the historical channel data according to data types by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data types, and the data types corresponding to any two sets of historical data are different; dividing the historical channel data according to the data volume by adopting a computing engine Spark to obtain multiple pieces of historical data with the same data volume, creating multiple offline tasks for the multiple pieces of historical data, and then establishing association among the offline tasks; dividing the historical channel data according to data sources by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data sources, and the data sources corresponding to any two sets of historical data are different; and randomly dividing the historical channel data by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments, and this embodiment is not described herein again.
Optionally, in this embodiment, the storage medium may include, but is not limited to: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a storage medium, and including instructions for causing one or more computer devices (which may be personal computers, servers, network devices, or the like) to execute all or part of the steps of the method described in the embodiments of the present application.
In the above embodiments of the present application, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A method for processing an offline task is characterized by comprising the following steps:
acquiring historical channel data in a preset time period, dividing the historical channel data by adopting a calculation engine Spark, and then generating and arranging an offline task of the historical channel data;
and performing task scheduling on the offline task by adopting a batch scheduler Azkaban.
2. The method of claim 1, wherein prior to obtaining historical channel data for a preset time period, the method further comprises:
and locally configuring the running environment of the computing engine Spark, and configuring the master node of the computing engine Spark into a horn mode.
3. The method of claim 1, wherein prior to obtaining historical channel data for a preset time period, the method further comprises:
installing JAVA language software development kit JDK and then configuring JAVA environment variables, wherein the JAVA environment variables include JAVA _ HOME configuration of JAVA and a CLASSPATH file for recording all information of a project compilation environment, including: a source file path, a compiled class file storage path, a dependent JAR package path, running container information and dependent external project information;
configuring environment variables of a distributed system infrastructure (HADOOP), wherein the environment variables of the HADOOP comprise root directory configuration (HADOOP _ HOME) entering the HADOOP, HADOOP _ CONF _ DIR configuration, PATH configuration, source file source updating and environment variable validation test, and the HADOOP is operated by depending on JAVA language software development kit (JDK).
4. The method of claim 1, wherein obtaining historical channel data over a preset time period comprises:
reading channel data from a distributed file system hdfs of a distributed system infrastructure HADOOP (Hadoop) by using a textFile method, or reading the channel data from a data source according to a configured data source address by using the textFile method;
and acquiring the generation time of all the read data, and acquiring the historical channel data with the generation time within the last week.
5. The method of claim 1, wherein the partitioning of the historical channel data using a compute engine Spark and then generating and scheduling offline tasks of the historical channel data comprises:
and dividing the historical channel data into a plurality of batches of data by adopting a computing engine Spark, and then generating and arranging off-line tasks of the historical channel data.
6. The method of claim 5, wherein the off-line task of dividing the historical channel data into a plurality of batches using a computing engine Spark and generating the historical channel data comprises one of:
dividing the historical channel data according to time periods by adopting a computing engine Spark to obtain multiple sets of historical data, creating multiple offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained through division corresponds to one of the multiple time periods, the time periods corresponding to any two sets of historical data are different, each set of offline task in the multiple set of offline tasks is used for processing one set of historical data, and the historical data processed by any two sets of offline tasks are different;
dividing the historical channel data according to data types by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data types, and the data types corresponding to any two sets of historical data are different;
dividing the historical channel data according to the data volume by adopting a computing engine Spark to obtain multiple pieces of historical data with the same data volume, creating multiple offline tasks for the multiple pieces of historical data, and then establishing association among the offline tasks;
dividing the historical channel data according to data sources by adopting a computing engine Spark to obtain multiple sets of historical data, creating multiple offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks, wherein each set of historical data obtained by division corresponds to one of the multiple data sources, and the data sources corresponding to any two sets of historical data are different.
7. The method of claim 5, wherein the dividing the historical channel data into a plurality of batches using a computing engine Spark and generating the offline task of the historical channel data comprises:
and randomly dividing the historical channel data by adopting a computing engine Spark to obtain multiple sets of historical data, creating a plurality of offline tasks for the multiple sets of historical data, and then establishing association among the offline tasks.
8. An apparatus for processing an offline task, comprising:
the task dividing unit is used for acquiring historical channel data in a preset time period, dividing the historical channel data by adopting a calculation engine Spark, and then generating and arranging an offline task of the historical channel data;
and the task scheduling unit is used for performing task scheduling on the offline task by adopting a batch scheduler Azkaban.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program when executed performs the method of any of the preceding claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the method of any of the preceding claims 1 to 7 by means of the computer program.
CN202111609393.2A 2021-12-27 2021-12-27 Processing method and device for offline tasks Pending CN114327820A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111609393.2A CN114327820A (en) 2021-12-27 2021-12-27 Processing method and device for offline tasks

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111609393.2A CN114327820A (en) 2021-12-27 2021-12-27 Processing method and device for offline tasks

Publications (1)

Publication Number Publication Date
CN114327820A true CN114327820A (en) 2022-04-12

Family

ID=81012450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111609393.2A Pending CN114327820A (en) 2021-12-27 2021-12-27 Processing method and device for offline tasks

Country Status (1)

Country Link
CN (1) CN114327820A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686867A (en) * 2022-11-30 2023-02-03 北京市大数据中心 Data mining method, device, system, equipment and medium based on cloud computing

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115686867A (en) * 2022-11-30 2023-02-03 北京市大数据中心 Data mining method, device, system, equipment and medium based on cloud computing

Similar Documents

Publication Publication Date Title
CN107577475B (en) Software package management method and system of data center cluster system
US7823023B2 (en) Test framework for testing an application
CN107451147B (en) Method and device for dynamically switching kafka clusters
CN103795759B (en) The dispatching method and system of a kind of virtual machine image file
CN104750555B (en) Process management method and device in a kind of Android program
CN111026723B (en) Big data cluster management control method and device, computer equipment and storage medium
CN103580908A (en) Server configuration method and system
CN103067501B (en) The large data processing method of PaaS platform
CN112099800A (en) Code data processing method and device and server
CN113220431A (en) Cross-cloud distributed data task scheduling method, device and storage medium
CN103034540A (en) Distributed information system, device and coordinating method thereof
CN103034541A (en) Distributing type information system and equipment and method thereof
CN108804100B (en) Method and device for creating interface element, storage medium and mobile terminal
CN112860282A (en) Upgrading method and device of cluster plug-in and server
CN114721809A (en) Application deployment method and device of kubernets cluster
CN114064213A (en) Kubernets container environment-based rapid arranging service method and system
CN105553732B (en) A kind of distributed network analogy method and system
CN114327820A (en) Processing method and device for offline tasks
CN109597627A (en) A kind of component mounting method, device, storage medium and processor
CN112181644B (en) Method, system and device for cross-domain machine learning component Jupitter
CN107493200B (en) Optical disc image file creating method, virtual machine deploying method and device
WO2023160418A1 (en) Resource processing method and resource scheduling method
US9176974B1 (en) Low priority, multi-pass, server file discovery and management
CN108595169A (en) A kind of visual programming method, cloud server and storage medium
CN114500268A (en) Deployment method, device, server and storage medium of chart resource

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination