WO2017186082A1 - 一种数据调度的方法及装置 - Google Patents

一种数据调度的方法及装置 Download PDF

Info

Publication number
WO2017186082A1
WO2017186082A1 PCT/CN2017/081688 CN2017081688W WO2017186082A1 WO 2017186082 A1 WO2017186082 A1 WO 2017186082A1 CN 2017081688 W CN2017081688 W CN 2017081688W WO 2017186082 A1 WO2017186082 A1 WO 2017186082A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
scheduling
cache queue
virtual cache
processing
Prior art date
Application number
PCT/CN2017/081688
Other languages
English (en)
French (fr)
Inventor
田鹏伟
江川
梁栋
毛怿
罗洪
曲秀赟
Original Assignee
西门子公司
田鹏伟
江川
梁栋
毛怿
罗洪
曲秀赟
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西门子公司, 田鹏伟, 江川, 梁栋, 毛怿, 罗洪, 曲秀赟 filed Critical 西门子公司
Publication of WO2017186082A1 publication Critical patent/WO2017186082A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2471Distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Definitions

  • the present invention relates to the field of big data processing, and in particular, to a method and apparatus for data scheduling.
  • big data processing methods usually focus only on data processing within a specific data processing framework, such as bulk data processing for historical data analysis (eg, MapReduce, etc.), and stream computing for real-time streaming data processing (such as Storm). , S4, etc.) or graphical calculations for interrelated data analysis (eg HAMA, etc.).
  • data processing eg, MapReduce, etc.
  • stream computing for real-time streaming data processing
  • HAMA graphical calculations for interrelated data analysis
  • one of the problems solved by one embodiment of the present invention is that it is possible to schedule data processing to different data processing architectures according to the function of data and the processing time requirement, thereby improving the big data processing rate.
  • a method for data scheduling comprising:
  • the scheduling label is used to identify a function of the source data and a processing duration requirement
  • the source data is dispatched to a corresponding target data cache or a corresponding target data processing framework according to the scheduling tag.
  • an apparatus for data scheduling comprising:
  • a label adding module configured to add a scheduling label to the source data according to a predetermined requirement, where the scheduling label is used to identify a function of the source data and a processing duration requirement;
  • a dispatching module for dispatching the source data to a target data cache or according to the scheduling tag Corresponding target data processing framework.
  • the present invention has the following advantages: the data scheduling method and device can schedule data processing to different target data caches or target data processing architectures according to the function of the data and the processing time requirement, thereby ensuring that the data can be correctly corrected.
  • the location is assigned to the target big data processing framework to increase the processing rate of big data.
  • FIG. 1 is a flow chart of a method of data scheduling in accordance with one embodiment of the present invention.
  • FIG. 2 is a flow chart of another method of data scheduling in accordance with one embodiment of the present invention.
  • 3-1 is a schematic diagram of a method for data scheduling from a first virtual cache queue to a second virtual cache queue according to an embodiment of the present invention.
  • 3-2 is a schematic diagram of a method of data scheduling in a method of data scheduling from two virtual cache queues to adding a third virtual cache queue according to an embodiment of the present invention.
  • 3-3 are diagrams showing a write and read operation of a virtual cache queue from a virtual cache queue in a method of data scheduling according to an embodiment of the present invention.
  • 3-4 are schematic diagrams of deleting a virtual cache queue in which a storage space is empty from a virtual cache queue set in a method for data scheduling according to an embodiment of the present invention.
  • FIG. 4 is a schematic diagram showing the structure of a system for processing data in a complex big data processing scheme according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a system for processing data in a complex big data processing scheme according to another embodiment of the present invention.
  • FIG. 6 is a structural block diagram of an apparatus for data scheduling according to an embodiment of the present invention.
  • FIG. 7 is a structural block diagram of another apparatus for data scheduling according to an embodiment of the present invention.
  • FIG. 8 is a structural block diagram of another apparatus for data scheduling according to an embodiment of the present invention.
  • step S110 is a flow chart of a method of data scheduling in accordance with one embodiment of the present invention.
  • the method includes step S110 and step S120:
  • step S110 the server adds the scheduling tag to the source data according to a predetermined requirement, and the scheduling tag is used to identify the function of the source data and the processing time requirement.
  • the scheduling label may include a function requirement field and a processing time requirement field.
  • the scheduling label 1 is: F: batch (function: batch), T: Millisecond (time length: millisecond)
  • the scheduling label 2 is: F: real-time (Function: Real-time), T: NA (duration: not applicable)
  • scheduling tag 3 is: F: incremental (function: incremental), T: NA (duration: not applicable);
  • scheduling tag 4 is: F: batch ( Function: Real time), T: NA (duration: not applicable).
  • the "F” indicates the function requirement field, including but not limited to batch, real-time or incremental. It should be noted that the function requirement field is a mandatory field, and NA cannot be used.
  • the "T” indicates the processing time requirement field, including but not limited to milliseconds or days. It should be noted that the processing time requirement field is an optional field. If there is no processing time requirement, the NA can be used.
  • the function of the source data and the processing time requirement are determined according to the requirements of the user.
  • the user can formulate predetermined rules according to different requirements to determine the function and processing duration of the source data. For example, the user can determine according to different service types/application types.
  • the different functions and processing duration of the source data such as the offline data determination function is batch, the duration is NA; the interaction type data determination function is real-time, the duration is millisecond; the application with high return time requirement is real-time, the duration is millisecond; the return time
  • the lower application determines that the function is in batches and the duration is NA.
  • the remaining applications determine that the function is incremental and the duration is days.
  • the source data in the embodiment of the present invention is data to be processed, which may be a data packet (for example, a binary number, a text file, or a .rar package, etc.).
  • the source data is pre-processed before the source data is added according to a predetermined requirement, and the pre-processing is generally performed by data extraction and conversion (such as an ETL (Extract-Transform-Load) process adjustment.
  • the data has a predetermined format (such as a .rar packet, etc.) or structure (a data pattern (generally describing the logic and features of the data in the database based on a certain data model)).
  • the pre-processing may include: extracting a predetermined field in the source data, and storing the predetermined field as pre-defined format data.
  • the extraction rule of the predetermined field is generally determined according to the requirements of the user, such as extracting business data or interactive data, and the predefined format is generally the same as the data format/structure used by the data loading process to ensure the number. According to the loading speed and fluency.
  • step S120 the server assigns the source data to a corresponding target data cache or a corresponding target data processing framework according to the scheduling tag.
  • the source data is specifically allocated to the batch data processing, the stream data processing, or the graphic data processing framework for data processing according to the function requirement field and the processing duration requirement field, or is assigned to batch data processing and stream data processing. Or in the data cache corresponding to the graphics data processing framework.
  • the scheduling action may be performed based on the function and processing duration requirements in the scheduling tag, further by using the mandatory field "F” and the optional field "T” to assign data, for example: F:batch, T:Millisecond in the scheduling tag 1
  • the source data is dispatched to the third data cache corresponding to FW (3) or FW (3); in the scheduling label 2, F: real-time, the source data is assigned to FW (2) (data processing framework 2) or FW (2) Corresponding second data cache; F:incremental in scheduling tag 3, then dispatching source data to FW(2); scheduling tag 4 in F:batch (function: real-time), T:NA, then source data Dispatched to the first data cache corresponding to FW(1) (data processing framework 1) or FW(1), where FW(1), FW(2), and FW(3) represent different data processing frameworks, as optional FW(1) can represent a graphics data processing framework, FW(2) can represent a stream data processing framework, and FW(3) can represent a bulk data processing framework.
  • step S210 is a flow chart of another method of data scheduling in accordance with one embodiment of the present invention.
  • the method includes step S210, step S220, step 230, and step S240:
  • step S210 the server adds the scheduling tag to the source data according to a predetermined requirement, and the scheduling tag is used to identify the function of the source data and the processing time requirement.
  • step S220 the server assigns the source data to a corresponding target data cache according to the scheduling tag.
  • step S230 the dispatched source data is written to the first virtual cache queue in the target data cache.
  • the virtual cache queue is a virtual cache mode, which temporarily allocates data in the address of the virtual cache through the virtual cache queue by dividing a specific space in the storage space as an address of the virtual cache.
  • the virtual cache queue in the embodiment of the present invention is a data storage layer for temporarily storing the received source data for temporary data storage for reading and loading.
  • the storage space of each virtual cache queue is determined according to the data type, the data storage space, or the user requirement. For example, if the data type is an image (the general data storage space is 1M-2M), then each cache queue can be set to store 5 pieces. Data, the storage space of each cache queue is 10M; According to the type of text information (general data storage space is 10k-100k), each cache queue can be set to store 10 pieces of data, and the storage space of each cache queue is 1M.
  • the storage space of each cache queue can also be arbitrarily set according to the requirements of the user.
  • the setting method of the storage space of each cache queue and the setting method of the storage space size of each cache queue are not specific. As defined, any virtual cache with virtual cache queue functionality should be included in the scope of the invention.
  • step S240 if the data stored in the first virtual cache queue reaches a threshold, the second virtual cache queue is added, and the dispatched source data is written into the second virtual cache queue.
  • FIG. 3-1 a schematic diagram from the first virtual cache queue to the second virtual cache queue is shown in FIG. 3-1.
  • the threshold of the data stored in the virtual cache queue may be determined according to the data type, the data storage space, or the user requirement. For example, if the data type is an image (the general data storage space is 1M-2M), and each cache queue is set to be stored. Strip data, the storage space of each cache queue is 10M, the threshold can be set to 4 data and / or the total storage space has occupied 9M; the threshold can also be set to the virtual cache queue storage space usage rate is greater than or equal to 90% or 95% Wait.
  • the second virtual cache queue is generally the same as the storage space set by the first virtual cache queue, but may also be set to different storage spaces according to the needs of the user.
  • a read data operation is provided for the corresponding target data processing framework.
  • the received data can be written into the operation, and after the data is written in the cache queue, the corresponding target data processing framework can perform the read data operation, the sequence of the write operation and the read operation.
  • the order of the data operations is the same, that is, the virtual cache queue that first writes the cache data is first read by the data storage layer, and then the cached data of the virtual cache queue that is written to the cached data in the previous virtual cache queue is read.
  • the data storage layer reads data, that is, the first virtual cache queue that first writes the cache data first reads data from the data storage layer, and then writes the second virtual cache queue of the cached data in the first virtual cache queue. After the cached data is read, the data storage layer reads the data.
  • the first virtual cache queue or the second virtual cache queue is empty, the first virtual cache queue or the second virtual cache queue is deleted. Specifically, after the virtual cache queue finishes reading data and its storage space is empty, the virtual cache queue is deleted.
  • the embodiment of the present invention is not limited to the case of two virtual cache queues.
  • the number of virtual cache queues can be arbitrarily increased within the allowable range of dividing a specific space in the storage space, that is, if the virtual cache queue of the current data is stored.
  • another virtual cache queue is added.
  • the received data is written to the other virtual cache queue.
  • the virtual cache queue is empty, the virtual cache queue is deleted to provide storage space for the newly added virtual cache queue.
  • each target data cache corresponds to a target data processing framework
  • each data cache may include a virtual cache queue set, where the virtual cache queue set is a set of at least one virtual cache queue, which passes through the storage space.
  • a specific space is divided as an address of the virtual cache, and data is temporarily cached through the virtual cache queue set in the address of the virtual cache.
  • the virtual cache queue set in the embodiment of the present invention is a data storage layer for temporarily storing the received source data for temporary data storage for reading and loading.
  • the virtual cache queue set can exist alone or multiple virtual cache queue sets can exist at the same time.
  • the storage space of each virtual cache queue in the virtual cache queue set is determined according to the data type, the data storage space, or the user requirement. For example, if the data type is an image (the general data storage space is 1M-2M), then each cache is set. The queue can store 5 pieces of data, and the storage space of each cache queue is 10M. If the data type is text information (the general data storage space is 10k-100k), then each cache queue can be set to store 10 pieces of data, each buffer queue. The storage space is 1M.
  • the storage space of different virtual cache queues in the same virtual cache queue set may be the same or different, and may be set according to user requirements. In the embodiment of the present invention, a method for setting a storage space of each cache queue and a method for setting a storage space size of each cache queue are not specifically limited.
  • FIG. 3-2 a schematic diagram from the two virtual cache queues to the third virtual cache queue is shown in Figure 3-2.
  • the write data operation is performed when the last virtual cache queue is added to the virtual cache queue set, and the virtual cache queue in the virtual cache queue set is currently added to perform the read data operation, that is, the cache data is read when the data is read.
  • the order of writing is read, that is, the virtual cache queue first written to the cache data is first read by the data storage layer, and then the virtual cache queue written to the cache data is read in the cache data of the previous virtual cache queue.
  • the data storage layer reads the data.
  • the virtual cache queues in the virtual cache queue set are sequentially added in the order of increasing time to be A, B, and C.
  • the virtual cache queue D is added in the virtual cache queue set.
  • the virtual cache queue D performs a write cache data operation, and the virtual cache queue A performs a read data operation.
  • the virtual cache queue B performs a read data operation, and so on, a virtual cache queue.
  • C and D sequentially perform read data operations.
  • the direction of the arrow indicates the direction of the data, that is, the write operation when the arrow points to the virtual cache queue, and the read operation when the arrow points to the outside of the virtual cache queue.
  • the virtual cache queue in the virtual cache queue set is empty, the virtual cache queue is deleted. Specifically, after the virtual cache queue finishes reading data and its storage space is empty, the virtual cache queue is deleted. For example, if the virtual cache queues in the virtual cache queue set are sequentially added in the order of increasing time are A, B, and C, the virtual cache queues A, B, and C sequentially perform read data operations, that is, when the virtual cache queue A is read. After the data is completed, the virtual cache queue B performs a read data operation, and deletes the virtual cache queue A, and so on. When the virtual cache queue B is read, the virtual cache queue C performs a read data operation and virtualizes Cache queue B is deleted.
  • FIG. 3-4 a schematic diagram of deleting a virtual cache queue that is empty from a virtual cache queue set is shown in FIG. 3-4.
  • FIG. 4 is a schematic diagram showing the structure of a system for processing data in a complex big data processing scheme according to an embodiment of the present invention.
  • the system structure is mainly divided into five parts: source data 410, data scheduling 420, data storage layer 430, data processing layer 440, and user application 450 according to embodiments of the present invention.
  • the source data 410 may be directly transmitted to the data schedule 420, and may be pre-processed and transmitted to the data schedule 420.
  • the data schedule 420 adds the source data or the pre-processed source data according to a predetermined requirement, and according to a predetermined requirement,
  • the scheduling tag assigns the source data or the pre-processed source data to the corresponding target data processing framework.
  • the specific dispatching process includes: the data storage layer 430 receives the source data + scheduling label, or the pre-processed source data + scheduling label
  • the data processing framework itself defines/requires the data format/structure data to be loaded 4301 to the storage specific data processing framework (for example: can be implemented through the ETL loading step, DBMS4302 (Database Management System) and DFS4303 ( Depth-First-Search, depth-first search algorithm).
  • the predefined format data in the data loading 4301 process should be consistent with the data record in the source data packet, and the data format should be defined by the user according to the business requirements and the stored data, such as a relational database or a distributed file system.
  • the DBMS can meet the relational data storage requirements of the database of the big data processing framework, and can choose MySQL (a relational database management system) or MS SQL (a database platform).
  • DFS can also meet the file data storage requirements of the database of the big data processing framework, and HDFS (a distributed file system) can be selected.
  • the data storage layer 430 sends the data packet or file correspondence to the data processing layer 440, and the data processing layer 440 combines various data processing frameworks or technologies according to user requirements.
  • a bulk data processing framework 4401, a stream data processing framework 4402, and/or a graphics data processing framework 4403 may be included.
  • User application 450 is a business logic/algorithm-oriented big data processing application built on the data processing layer.
  • predefined format data may be defined separately for different big data processing frameworks and other data processing procedures.
  • the scheduling label is deleted, so as not to affect the loading and processing of subsequent data.
  • the embodiment of the invention proposes data scheduling based on function and processing time requirements, which properly solves the data distribution problem in the complex big data processing scheme.
  • the source data packet is attached with a scheduling label according to a predetermined requirement, and the data can be correctly assigned to the target big data processing framework, and the method can meet most data processing related requirements.
  • the pre-processed data stored in the pre-processed source data must be consistent with the data record defined by the data loading process in the data storage layer.
  • the data scheduling method described in the embodiments of the present invention can be widely applied to industrial data analysis/processing, especially for complex big data processing processes, such as anomaly detection.
  • FIG. 5 is a schematic structural diagram of a system for processing data in a complex big data processing scheme according to an embodiment of the present invention.
  • the system structure is mainly divided into six parts: source data 510, data scheduling 520 responsible for data distribution, data cache 530, data storage layer 540, data processing layer 550, and user application 560 according to embodiments of the present invention.
  • the source data 510 may be directly transmitted to the data schedule 520, and may be pre-processed and transmitted to the data schedule 520.
  • the data schedule 520 adds the source data 510 or the pre-processed source data 510 to the scheduling label according to a predetermined requirement. And assigning the source data 510 or the pre-processed source data 510 to the virtual cache queue set of the corresponding target data processing framework according to the scheduling label, and deleting the scheduling label, wherein the format of the pre-processed source data should be performed with the data storage layer 540.
  • the predefined format is the same in the data recording process.
  • the specific dispatch process includes: the data cache 530 assigns the received data to the virtual cache queue set of the corresponding target processing framework according to the scheduling label based on the distributed message queue mechanism (DMQ) 5301, and the virtual cache for each target processing framework
  • the queue set writes the assigned data to the virtual cache queue in the virtual cache queue set of each target processing framework based on the horizontal expansion mechanism (ie, if the data stored in the virtual cache queue in the virtual cache queue set of each target processing framework reaches Threshold, adding another virtual cache queue in the virtual cache queue set and writing the received data to the another virtual cache queue), for example: A virtual cache queue set VQS1 assigned to the bulk data processing framework, a virtual cache queue set VQS2 of the stream data processing framework, and/or a virtual cache queue set VQS3 of the graphics data processing framework, wherein VQS1, VQS2, and VQS3 have at least one virtual cache
  • the queue (VQ) is composed.
  • the data storage layer 540 respectively loads the virtual cache queues in the virtual cache queue set of each target processing framework by the 5401, and sequentially reads the virtual cache queues in the virtual cache queue set (for example, the loading step of the ETL can be implemented).
  • DBMS5402 Database Management System, database management system
  • DFS5403 Depth-First-Search, depth-first search algorithm.
  • the format data pre-defined in the process of data loading 5401 should be consistent with the data record in the source data packet, and the data format should be defined by the user according to business requirements and stored data, such as a relational database or a distributed file system.
  • the DBMS can meet the relational data storage requirements of the database of the big data processing framework, and can choose MySQL (a relational database management system) or MS SQL (a database platform).
  • DFS can also meet the file data storage requirements of the database of the big data processing framework, and HDFS (a distributed file system) can be selected.
  • the data storage layer 540 sends the data packet or file correspondence to the data processing layer 550.
  • the data processing layer 550 combines various data processing frameworks or technologies according to the requirements of the user.
  • the data processing layer 5501 may include a batch data processing framework 5501. Processing framework 5502, and/or graphics data processing framework 5503, and the like.
  • User application 560 is a business logic/algorithm-oriented big data processing application built on the data processing layer.
  • the block structure of VQS1, VQS2, and VQS3 in FIG. 5 is merely an example of a data structure, and the scope of protection of the present invention is not limited.
  • the loading component of the data storage layer 540 only needs to provide the identity information of the target processing framework to sequentially read the virtual cache queue of the virtual cache queue set of the corresponding target processing framework of the data cache 530. After the source data is dispatched to the virtual cache queue set corresponding to the target processing framework, that is, before entering the data cache 530, the scheduling label is deleted to avoid affecting the caching, storage, and processing of subsequent data.
  • the embodiment of the invention proposes data scheduling based on function and processing time requirements, which properly solves the data distribution problem in the complex big data processing scheme.
  • the source data packet is attached with a scheduling label according to a predetermined requirement, and the data can be correctly assigned to the target big data processing framework, and the method can meet most data processing related requirements.
  • the pre-processed data stored in the pre-processed source data must be consistent with the data record defined by the data loading process in the data storage layer.
  • the data scheduling method described in the embodiments of the present invention can be widely applied to industrial data analysis/processing, especially for complex big data processing processes, such as Anomaly detection, etc.
  • the data cache based on the horizontal expansion mechanism realizes horizontal expansion and expansion of data through the virtual cache queue, and does not need to transfer the cached data, thereby avoiding data loss.
  • each target processing architecture has a corresponding set of independent virtual cache queues, and unified system maintenance through DMQ, while data can be correctly dispatched to the virtual cache of the target big data processing framework through the scheduling label.
  • multiple virtual cache queue sets can simultaneously cache data. This method can meet most data processing related requirements, which greatly improves the efficiency of data caching, loading and processing.
  • the data scheduling method described in the embodiments of the present invention can be widely applied to industrial data analysis/processing, especially for complex big data processing processes, such as anomaly detection.
  • FIG. 6 is a structural block diagram of an apparatus for data scheduling according to an embodiment of the present invention.
  • the device may be disposed in the server or may be used separately from the server.
  • the device includes a tag adding module 610 and a dispatching module 620:
  • the tag adding module 610 is configured to add the scheduling tag to the source data according to a predetermined requirement, where the scheduling tag is used to identify the function of the source data and the processing duration requirement.
  • the scheduling label may include a function requirement field and a processing time requirement field.
  • the scheduling label 1 is: F: batch (function: batch), T: Millisecond (time length: millisecond)
  • the scheduling label 2 is: F: real-time (Function: Real-time), T: NA (duration: not applicable)
  • scheduling tag 3 is: F: incremental (function: incremental), T: NA (duration: not applicable);
  • scheduling tag 4 is: F: batch ( Function: Real time), T: NA (duration: not applicable).
  • the "F” indicates the function requirement field, including but not limited to batch, real-time or incremental. It should be noted that the function requirement field is a mandatory field, and NA cannot be used.
  • the "T” indicates the processing time requirement field, including but not limited to milliseconds or days. It should be noted that the processing time requirement field is an optional field. If there is no processing time requirement, the NA can be used.
  • the function of the source data and the processing time requirement are determined according to the requirements of the user.
  • the user can formulate predetermined rules according to different requirements to determine the function and processing duration of the source data. For example, the user can determine according to different service types/application types.
  • the different functions and processing duration of the source data such as the offline data determination function is batch, the duration is NA; the interaction type data determination function is real-time, the duration is millisecond; the application with high return time requirement is real-time, the duration is millisecond; the return time
  • the lower application determines that the function is in batches and the duration is NA.
  • the remaining applications determine that the function is incremental and the duration is days.
  • a dispatching module 620 configured to allocate the source data to the target data according to the scheduling label Save or correspond to the target data processing framework.
  • the source data is specifically allocated to the batch data processing, the stream data processing, or the graphic data processing framework for data processing according to the function requirement field and the processing duration requirement field, or is assigned to batch data processing and stream data processing. Or in the data cache corresponding to the graphics data processing framework.
  • the scheduling action may be performed based on the function and processing duration requirements in the scheduling tag, further by using the mandatory field "F” and the optional field "T” to assign data, for example: F:batch, T:Millisecond in the scheduling tag 1
  • the source data is dispatched to the third data cache corresponding to FW (3) or FW (3); in the scheduling label 2, F: real-time, the source data is assigned to FW (2) (data processing framework 2) or FW (2) Corresponding second data cache; F:incremental in scheduling tag 3, then dispatching source data to FW(2); scheduling tag 4 in F:batch (function: real-time), T:NA, then source data Dispatched to the first data cache corresponding to FW(1) (data processing framework 1) or FW(1), where FW(1), FW(2), and FW(3) represent different data processing frameworks, as optional FW(1) can represent a graphics data processing framework, FW(2) can represent a stream data processing framework, and FW(3) can represent a bulk data processing framework.
  • FIG. 7 is a structural block diagram of an apparatus for data scheduling according to another embodiment of the present invention.
  • the device may be disposed in the server or may be used separately from the server.
  • the device includes a tag adding module 710, a dispatching module 720, a pre-processing module 730, and a tag deleting module 740.
  • the tag adding module 710 is configured to add the scheduling tag to the source data according to a predetermined requirement, where the scheduling tag is used to identify the function of the source data and the processing duration requirement.
  • the dispatching module 720 is configured to allocate the source data to the target data cache or the corresponding target data processing framework according to the scheduling tag.
  • the pre-processing module 730 is configured to perform a pre-processing pre-processing module on the source data. Specifically, the method is configured to extract a predetermined field in the source data, and store the predetermined field as pre-defined format data.
  • the source data is pre-processed before the source data is added according to a predetermined requirement, and the pre-processing is generally performed by data extraction and conversion (such as an ETL (Extract-Transform-Load) process adjustment.
  • the data has a predetermined format (such as a .rar packet, etc.) or structure (a data pattern (generally describing the logic and features of the data in the database based on a certain data model)).
  • the pre-processing may include: extracting a predetermined field in the source data, and storing the predetermined field as pre-defined format data.
  • the extraction rule of the predetermined field is generally determined according to the requirements of the user, such as extracting business data or interactive data, and the predefined format is generally the same as the data format/structure used by the data loading process to ensure the loading speed and flow of the data. Smooth.
  • the label deletion module 740 is configured to delete the scheduling label after the source data is allocated to the corresponding target data processing framework. That is, the scheduling tag is only used in the data scheduling process, and after the data scheduling is completed, the scheduling tag is deleted. Take the system structure diagram of data processing in the complex big data processing scheme shown in FIG. 4 as an example. After the source data is dispatched to the corresponding target data processing framework, that is, before entering the data storage layer 430 or the data cache 530, the scheduling tag is deleted, so as not to affect the loading and processing of subsequent data.
  • FIG. 8 is a structural block diagram of an apparatus for data scheduling according to still another embodiment of the present invention.
  • the device may be disposed in the server or may be used separately from the server.
  • the device includes a tag adding module 810, a dispatching module 820, a writing module 830, a queue adding module 840, a reading module 850, and a queue deleting module 860.
  • the tag adding module 810 is configured to add the scheduling tag to the source data according to a predetermined requirement, where the scheduling tag is used to identify the function of the source data and the processing duration requirement.
  • the dispatching module 820 is configured to allocate the source data to the target data cache according to the scheduling tag.
  • the write module 830 is configured to write the dispatched source data to the first virtual cache queue in the target data cache.
  • the queue adding module 840 is configured to add a second virtual cache queue and write the allocated source data to the second virtual cache queue if the data stored in the first virtual cache queue reaches a threshold.
  • the reading module 850 is configured to provide a read data operation for the corresponding target data processing framework after the data is written in the first virtual cache queue or the second virtual cache queue.
  • the received data can be written into the operation, and after the data is written in the cache queue, the corresponding target data processing framework can perform the read data operation, the sequence of the write operation and the read operation.
  • the order of the data operations is the same, that is, the virtual cache queue that first writes the cache data is first read by the data storage layer, and then the cached data of the virtual cache queue that is written to the cached data in the previous virtual cache queue is read.
  • the data storage layer reads data, that is, the first virtual cache queue that first writes the cache data first reads data from the data storage layer, and then writes the second virtual cache queue of the cached data in the first virtual cache queue.
  • the data storage layer reads the data.
  • the queue deletion module 860 is configured to delete the first virtual cache queue or the second virtual cache queue when the first virtual cache queue or the second virtual cache queue is empty.
  • the embodiment of the present invention is not limited to the case of two virtual cache queues.
  • the number of virtual cache queues can be arbitrarily increased within the allowable range of dividing a specific space in the storage space, that is, if the virtual cache queue of the current data is stored.
  • the data reaches the threshold, another virtual cache queue is added and the received data is written to the other virtual cache queue.
  • the virtual cache queue is empty, the virtual cache queue is deleted to provide storage space for the newly added virtual cache queue.
  • the data scheduling method and apparatus can schedule data processing to different data processing architectures according to data functions and processing time requirements, thereby ensuring that data can be correctly assigned to the target big data processing framework, thereby improving Big data processing rate.
  • the data is stored in a predefined format before the scheduling tag is added, which ensures the successful loading of the data, thereby facilitating subsequent data recording and data processing.
  • the added scheduling tag is deleted after the data scheduling is completed, which avoids the impact on subsequent data loading and data processing.
  • the data cache based on the horizontal expansion mechanism realizes horizontal expansion and expansion of data through the virtual cache queue, and does not need to transfer the cached data, thereby avoiding data loss.
  • each target processing architecture has a corresponding set of independent virtual cache queues, and unified system maintenance through DMQ, while data can be correctly dispatched to the virtual cache of the target big data processing framework through the scheduling label.
  • multiple virtual cache queue sets can simultaneously cache data. This method can meet most data processing related requirements, which greatly improves the efficiency of data caching, loading and processing.
  • the data scheduling method described in the embodiments of the present invention can be widely applied to industrial data analysis/processing, especially for complex big data processing processes, such as anomaly detection, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据调度的方法及装置,所述方法包括:将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求(S110);根据所述调度标签将所述源数据分派到对应的目标数据处理框架(S120)。该方法能够根据数据的功能和处理时长要求调度到不同的数据处理架构进行数据处理,保证了数据可以被正确地分派到目标大数据处理框架,从而提高大数据处理速率。

Description

一种数据调度的方法及装置 技术领域
本发明涉及大数据处理领域,尤其涉及一种数据调度的方法及装置。
背景技术
近几年来,随着计算机和信息技术的迅猛发展和普及应用,行业应用系统的规模迅速扩大,行业应用所产生的数据成爆炸性增长,现阶段大数据已远远超出了现有传统的计算技术和信息系统的处理能力。
目前,大数据处理方法通常只专注于特定的数据处理框架内部的数据处理,例如:用于历史数据分析的批量数据处理(如,MapReduce等)、用于实时流数据处理的流计算(如Storm,S4等)或用于相互关联的数据分析的图形计算(如HAMA等)。现有技术中并没有针对多个数据处理框架间数据如何进行调度的相关记载,故不能对数据进行全面的分析和目标整合。
发明内容
有鉴于此,本发明的一个实施例解决的问题之一是能够根据数据的功能和处理时长要求调度到不同的数据处理架构进行数据处理,从而提高大数据处理速率。
根据本发明的一个实施例,提供了一种数据调度的方法,所述方法包括:
将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求;
根据所述调度标签将所述源数据分派到对应的目标数据缓存或对应的目标数据处理框架。
根据本发明的一个实施例,提供了一种数据调度的装置,所述装置包括:
标签添加模块,用于将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求;
分派模块,用于根据所述调度标签将所述源数据分派到目标数据缓存或 对应的目标数据处理框架。
与现有技术相比,本发明具有以下优点:数据调度的方法和装置能够根据数据的功能和处理时长要求调度到不同的目标数据缓存或目标数据处理架构进行数据处理,保证了数据可以被正确地分派到目标大数据处理框架,从而提高大数据处理速率。
附图说明
本发明的其它特点、特征、优点和益处通过以下结合附图的详细描述将变得更加显而易见。
图1为根据本发明一个实施例的数据调度的方法的流程图。
图2为根据本发明一个实施例的另一数据调度的方法的流程图
图3-1为根据本发明一个实施例的数据调度的方法中从第一虚拟缓存队列到增加第二虚拟缓存队列的示意图。
图3-2为根据本发明一个实施例的数据调度的方法中以从两个虚拟缓存队列到增加第三个虚拟缓存队列的示意图。
图3-3为根据本发明一个实施例的数据调度的方法中以从虚拟缓存队列集中虚拟缓存队列的写入和读取操作的示意图。
图3-4为根据本发明一个实施例的数据调度的方法中以从虚拟缓存队列集中将存储空间为空的虚拟缓存队列删除的示意图
图4为根据本发明一个实施例的在复杂大数据处理方案中数据处理的系统结构示意图。
图5为根据本发明另一个实施例的在复杂大数据处理方案中数据处理的系统结构示意图。
图6为根据本发明一个实施例的数据调度的装置的结构框图。
图7为根据本发明一个实施例的另一数据调度的装置的结构框图。
图8为根据本发明一个实施例的又一数据调度的装置的结构框图
具体实施方式
下面将参照附图更详细地描述本公开的优选实施方式。虽然附图中显示了本公开的优选实施方式,然而应该理解,可以以各种形式实现本公开而不应被这里阐述的实施方式所限制。相反,提供这些实施方式是为了使本公开 更加透彻和完整,并且能够将本公开的范围完整的传达给本领域的技术人员。
图1为根据本发明一个实施例的数据调度的方法的流程图。该方法包括步骤S110和步骤S120:
在步骤S110中,服务器将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求。
其中,调度标签可以包括功能要求字段和处理时长要求字段,例如:调度标签1为:F:batch(功能:批量),T:Millisecond(时长:毫秒);调度标签2为:F:real-time(功能:实时),T:NA(时长:不适用);调度标签3为:F:incremental(功能:增量),T:NA(时长:不适用);调度标签4为:F:batch(功能:实时),T:NA(时长:不适用)。其中,“F”表示功能要求字段,包括但不限于批量、实时或增量等,需要说明的是功能要求字段是强制字段,不可以使用NA。其中,“T”表示处理时长要求字段,包括但不限于毫秒或天等,需要说明的是处理时长要求字段是可选字段,如果没有具有的处理时长要求可以使用NA。
其中,源数据的功能和处理时长要求是根据用户的需求确定,用户可以根据不同的需求制定预定的规则以确定源数据的功能和处理时长,例如,用户可以根据不同的业务类型/应用类型确定源数据的不同功能和处理时长,如离线数据确定功能为批量,时长为NA;交互类数据确定功能为实时,时长为毫秒;返回时间要求高的应用确定功能为实时,时长为毫秒;返回时间要求较低的应用确定功能为批量,时长为NA,其余的应用确定功能为增量,时长为天。
本发明实施例所述的源数据为待处理数据,其可以是数据包(例如二进制数、文本文件或.rar包等)。作为可选的,在所述将源数据按照预定要求添加调度标签之前,对所述源数据进行预处理,预处理一般为经过数据抽取和转换(如经过ETL(Extract-Transform-Load)过程调整)后使数据具有预定的格式(如.rar数据包等)或结构(数据模式(一般以某一数据模型为基础描述数据库中数据的逻辑和特征))。本发明实施例所述的预处理可以包括:提取所述源数据中的预定字段,将所述预定字段存储为预定义的格式数据。预定字段的提取规则一般根据用户的要求确定,如提取业务数据或交互数据等,预定义的格式一般与数据加载过程使用的数据格式/结构相同,以保证数 据的加载速度和流畅度。
在步骤S120中,服务器根据所述调度标签将所述源数据分派到对应的目标数据缓存或对应的目标数据处理框架。
需要说明的是,具体根据所述功能要求字段和处理时长要求字段将所述源数据分派到批量数据处理、流数据处理或图形数据处理框架进行数据处理,或分派到批量数据处理、流数据处理或图形数据处理框架对应的数据缓存中。可以基于调度标签中功能和处理时长要求来执行调度动作,进一步是通过强制性字段“F”和可选性字段“T”来分派数据,例如:调度标签1中F:batch,T:Millisecond,则将源数据分派到FW(3)或FW(3)对应的第三数据缓存;调度标签2中F:real-time,则将源数据分派到FW(2)(数据处理框架2)或FW(2)对应的第二数据缓存;调度标签3中F:incremental,则将源数据分派到FW(2);调度标签4中F:batch(功能:实时),T:NA,则将源数据分派到FW(1)(数据处理框架1)或FW(1)对应的第一数据缓存,其中,FW(1)、FW(2)和FW(3)表示不同的数据处理框架,作为可选的,FW(1)可以表示图形数据处理框架,FW(2)可以表示流数据处理框架,FW(3)可以表示批量数据处理框架。
图2为根据本发明一个实施例的另一数据调度的方法的流程图。该方法包括步骤S210、步骤S220、步骤230和步骤S240:
在步骤S210中,服务器将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求。
在步骤S220中,服务器根据所述调度标签将所述源数据分派到对应的目标数据缓存。
在步骤S230中,将分派的源数据写入所述目标数据缓存中的第一虚拟缓存队列。
其中,虚拟缓存队列是一种虚拟缓存方式,其通过在存储空间内划分出特定的空间作为虚拟缓存的地址,在该虚拟缓存的地址中通过虚拟缓存队列暂时缓存数据。本发明实施例中的虚拟缓存队列是将接收到的源数据暂时缓存后供持久性数据存储的数据存储层进行读取加载。
其中,每个虚拟缓存队列的存储空间根据数据类型、数据存储空间或用户需求确定,例如,若数据类型为图片(一般数据存储空间为1M-2M),则设置每个缓存队列可存储5条数据,每个缓存队列的存储空间为10M;若数 据类型为文本信息(一般数据存储空间为10k-100k),则设置每个缓存队列可存储10条数据,每个缓存队列的存储空间为1M。每个缓存队列的存储空间也可以根据用户的需求进行任意设定,本发明实施例对每个缓存队列的存储空间的设定方法和每个缓存队列的存储空间大小的设定方法不做具体限定,任何具有虚拟缓存队列功能的虚拟缓存均应包含在发明的保护范围内。
在步骤S240中,若所述第一虚拟缓存队列存储的数据达到阈值,则增加第二虚拟缓存队列,并将分派的源数据写入所述第二虚拟缓存队列。
具体地,从第一虚拟缓存队列到增加第二虚拟缓存队列的示意图如图3-1所示。
其中,虚拟缓存队列存储的数据的阈值可以根据数据类型、数据存储空间或用户需求确定,例如,若数据类型为图片(一般数据存储空间为1M-2M)、设定每个缓存队列可存储5条数据,每个缓存队列的存储空间为10M,则阈值可以设置为4条数据和/或总存储空间已占用9M;阈值也可以设置为虚拟缓存队列存储空间使用率大于等于90%或95%等。
通常情况下,第二虚拟缓存队列一般与第一虚拟缓存队列设置的存储空间相同,但也可以根据用户的需要设置为不同的存储空间。
需要说明的是,在所述第一虚拟缓存队列或所述第二虚拟缓存队列写入数据后为对应的目标数据处理框架提供读取数据操作。具体地,在缓存队列被增加后即可将接收到的数据进行写入操作,在缓存队列写入数据后对应的目标数据处理框架即可执行读取数据操作,写入操作的先后顺序与读取数据操作的先后顺序一致,即,先写入缓存数据的虚拟缓存队列先供数据存储层读取数据,后写入缓存数据的虚拟缓存队列在前一虚拟缓存队列中的缓存数据被读取完毕后供数据存储层读取数据,即先写入缓存数据的第一虚拟缓存队列先供数据存储层读取数据,后写入缓存数据的第二虚拟缓存队列在第一虚拟缓存队列中的缓存数据被读取完毕后供数据存储层读取数据。
进一步,当所述第一虚拟缓存队列或所述第二虚拟缓存队列为空时,删除所述第一虚拟缓存队列或第二虚拟缓存队列。具体地,虚拟缓存队列执行读取数据完毕后其存储空间为空,则将该虚拟缓存队列删除。
本发明实施例并不只限制于两个虚拟缓存队列的情况,在存储空间内划分出特定的空间的允许范围内虚拟缓存队列的数量可以任意增加,即若当前写入数据的虚拟缓存队列存储的数据达到阈值,则增加另一虚拟缓存队列, 并将接收到的数据写入该另一虚拟缓存队列。另外,当虚拟缓存队列为空时,则删除该虚拟缓存队列,为新增加的虚拟缓存队列提供存储空间。
作为可选的,每个目标数据缓存对应一个目标数据处理框架,每个数据缓存可以包括一个虚拟缓存队列集,该虚拟缓存队列集是至少一个虚拟缓存队列组成的集合,其通过在存储空间内划分出特定的空间作为虚拟缓存的地址,在该虚拟缓存的地址中通过虚拟缓存队列集暂时缓存数据。本发明实施例中的虚拟缓存队列集是将接收到的源数据暂时缓存后供持久性数据存储的数据存储层进行读取加载。虚拟缓存队列集可以单独存在也可以多个虚拟缓存队列集同时存在。
其中,虚拟缓存队列集中的每个虚拟缓存队列的存储空间根据数据类型、数据存储空间或用户需求确定,例如,若数据类型为图片(一般数据存储空间为1M-2M),则设置每个缓存队列可存储5条数据,每个缓存队列的存储空间为10M;若数据类型为文本信息(一般数据存储空间为10k-100k),则设置每个缓存队列可存储10条数据,每个缓存队列的存储空间为1M。同一虚拟缓存队列集中的不同虚拟缓存队列的存储空间可以相同也可以不同,具体可以根据用户需求进行设定。本发明实施例对每个缓存队列的存储空间的设定方法和每个缓存队列的存储空间大小的设定方法不做具体限定。
具体地,以从两个虚拟缓存队列到增加第三个虚拟缓存队列的示意图如图3-2所示。
需要说明的是,在最后一个虚拟缓存队列增加到虚拟缓存队列集中时执行写入数据操作,当前最早增加在虚拟缓存队列集中的虚拟缓存队列执行读取数据操作,即读取数据时按照缓存数据写入的先后顺序进行读取,即先写入缓存数据的虚拟缓存队列先供数据存储层读取数据,后写入缓存数据的虚拟缓存队列在前一虚拟缓存队列中的缓存数据被读取完毕后供数据存储层读取数据。例如:虚拟缓存队列集中按照增加时间的先后顺序依次增加的虚拟缓存队列为A、B和C,当虚拟缓存队列C存储的数据达到阈值时,在虚拟缓存队列集中增加虚拟缓存队列D,则此时虚拟缓存队列D执行写入缓存数据操作,虚拟缓存队列A执行读取数据操作,当虚拟缓存队列A被读取数据完毕,则虚拟缓存队列B执行读取数据操作,依次类推,虚拟缓存队列C和D依次执行读取数据操作。
具体地,以从虚拟缓存队列集中虚拟缓存队列的写入和读取操作的示意 图如图3-3所示,其中,箭头的方向表示数据的走向,即箭头指向虚拟缓存队列时为写入操作,箭头指向虚拟缓存队列外部时为读取操作。
进一步,当虚拟缓存队列集中的虚拟缓存队列为空时,将所述虚拟缓存队列删除。具体地,虚拟缓存队列执行读取数据完毕后其存储空间为空,则将该虚拟缓存队列删除。例如:虚拟缓存队列集中按照增加时间的先后顺序依次增加的虚拟缓存队列为A、B和C,则虚拟缓存队列A、B和C依次执行读取数据操作,即当虚拟缓存队列A被读取数据完毕,则虚拟缓存队列B执行读取数据操作,并将虚拟缓存队列A删除,依次类推,当虚拟缓存队列B被读取数据完毕,则虚拟缓存队列C执行读取数据操作,并将虚拟缓存队列B删除。
具体地,以从虚拟缓存队列集中将存储空间为空的虚拟缓存队列删除的示意图如图3-4所示。
图4为根据本发明一个实施例的在复杂大数据处理方案中数据处理的系统结构示意图。其系统结构主要分为五部分:源数据410、本发明实施例所述的数据调度420、数据存储层430、数据处理层440和用户应用程序450。
需要说明的是,源数据410可以直接传输到数据调度420,与可以经过预处理后传输到数据调度420,数据调度420将源数据或预处理后的源数据按照预定要求添加调度标签,并根据调度标签对源数据或预处理后的源数据分派到对应的目标数据处理框架,具体的分派过程包括:数据存储层430接收到源数据+调度标签,或预处理后的源数据+调度标签后,将数据处理框架自身定义/需要使用的数据格式/结构的数据加载4301到存储特定数据处理框架(例如:可以通过ETL的加载步骤来实现、DBMS4302(Database Management System,数据库管理系统)和DFS4303(Depth-First-Search,深度优先搜索算法)来实现)。其中,数据加载4301过程中预定义的格式数据应与源数据包中的数据记录一致,数据格式应该通过用户根据业务需求和已存储数据定义,例如关系数据库或分布式文件系统等。其中,DBMS可以满足大数据处理框架的数据库的关系数据存储要求,可以选用MySQL(一种关系型数据库管理系统)或MS SQL(一种数据库平台)等。其中,DFS也可以满足大数据处理框架的数据库的文件数据存储要求,可以选用HDFS(一种分布式文件系统)。数据存储层430将数据包或文件对应发送给数据处理层440,数据处理层440根据用户的要求结合了多种数据处理框架或技术, 作为可选的,可以包括批量数据处理框架4401、流数据处理框架4402、和/或图形数据处理框架4403等。用户应用程序450是建在数据处理层的业务逻辑/面向算法的大数据处理应用程序。作为可选的,针对不同的大数据处理框架和其它数据处理过程,预定义的格式数据可以被分别定义。
需要说明的是,在将所述源数据分派到对应的目标数据处理框架后,即在进入数据存储层430之前,删除所述调度标签,以免影响后续数据的加载和处理。
本发明实施例提出了基于功能与处理时间要求的数据调度,这在复杂的大数据处理的方案中妥善的解决了数据分派问题。其在数据调度中在所述源数据包按照预定要求附加调度标签,数据可以被正确地分派到目标大数据处理框架,该方法能够满足大多数数据处理相关的要求。此外,为了保证数据成功加载,对源数据进行预处理后存储的预定义格式数据需与数据存储层中数据加载过程定义的数据记录相一致。本发明实施例所述的数据调度的方法可广泛应用于工业数据分析/处理,特别是针对复杂的大数据处理过程,例如异常检测等。
图5为根据本发明一个实施例的在复杂大数据处理方案中数据处理的系统结构示意图。其系统结构主要分为六部分:源数据510、负责数据分派的数据调度520、本发明实施例所述的数据缓存530、数据存储层540、数据处理层550和用户应用程序560。
需要说明的是,源数据510可以直接传输到数据调度520,与可以经过预处理后传输到数据调度520,数据调度520将源数据510或预处理后的源数据510按照预定要求添加调度标签,并根据调度标签对源数据510或预处理后的源数据510分派到对应的目标数据处理框架的虚拟缓存队列集中,删除调度标签,其中预处理后的源数据的格式应与数据存储层540进行数据记载过程中预定义的格式相同。具体的分派过程包括:数据缓存530基于分布式消息队列机制(DMQ)5301根据调度标签将所述接收到的数据分派到对应目标处理框架的虚拟缓存队列集中,针对每个目标处理框架的虚拟缓存队列集基于横向扩展机制将分派的数据分别写入每个目标处理框架的虚拟缓存队列集中的虚拟缓存队列(即若所述每个目标处理框架的虚拟缓存队列集中的虚拟缓存队列存储的数据达到阈值,则在所述虚拟缓存队列集中增加另一虚拟缓存队列,并将接收到的数据写入所述另一虚拟缓存队列),例如: 分派到批量数据处理框架的虚拟缓存队列集VQS1、流数据处理框架的虚拟缓存队列集VQS2、和/或图形数据处理框架的虚拟缓存队列集VQS3,其中,VQS1、VQS2和VQS3具有至少一个虚拟缓存队列(VQ)组成。数据存储层540通过接口分别加载5401每个目标处理框架的虚拟缓存队列集中的虚拟缓存队列,并依次对虚拟缓存队列集中的虚拟缓存队列进行读取操作(例如:可以通过ETL的加载步骤来实现、DBMS5402(Database Management System,数据库管理系统)和DFS5403(Depth-First-Search,深度优先搜索算法)来实现)。其中,数据加载5401过程中预定义的格式数据应与源数据包中的数据记录一致,数据格式应该是用户根据业务需求和已存储数据定义,例如关系数据库或分布式文件系统等。其中,DBMS可以满足大数据处理框架的数据库的关系数据存储要求,可以选用MySQL(一种关系型数据库管理系统)或MS SQL(一种数据库平台)等。其中,DFS也可以满足大数据处理框架的数据库的文件数据存储要求,可以选用HDFS(一种分布式文件系统)。数据存储层540将数据包或文件对应发送给数据处理层550,数据处理层550根据用户的要求结合了多种数据处理框架或技术,作为可选的,可以包括批量数据处理框架5501、流数据处理框架5502、和/或图形数据处理框架5503等。用户应用程序560是建在数据处理层的业务逻辑/面向算法的大数据处理应用程序。图5中VQS1、VQS2和VQS3的块状结构仅是一种对数据结构的示例,不对本发明的保护范围进行限定。
需要说明的是,数据存储层540的加载组件仅需提供目标处理框架的身份信息即可顺序的读取数据缓存530相应目标处理框架的虚拟缓存队列集中的虚拟缓存队列。在将所述源数据分派到对应目标处理框架的虚拟缓存队列集后,即在进入数据缓存530之前,删除所述调度标签,以免影响后续数据的缓存、存储和处理。
本发明实施例提出了基于功能与处理时间要求的数据调度,这在复杂的大数据处理的方案中妥善的解决了数据分派问题。其在数据调度中在所述源数据包按照预定要求附加调度标签,数据可以被正确地分派到目标大数据处理框架,该方法能够满足大多数数据处理相关的要求。此外,为了保证数据成功加载,对源数据进行预处理后存储的预定义格式数据需与数据存储层中数据加载过程定义的数据记录相一致。本发明实施例所述的数据调度的方法可广泛应用于工业数据分析/处理,特别是针对复杂的大数据处理过程,例如 异常检测等。同时,基于横向扩展机制的数据缓存,通过虚拟缓存队列实现对数据的横向扩展缓存,无需对缓存数据进行转移,从而避免了数据丢失的情况。针对多目标处理架构的情况,每个目标处理架构都有对应独立的虚拟缓存队列集,并通过DMQ进行统一系统维护,同时数据通过调度标签可以被正确地分派到目标大数据处理框架的虚拟缓存队列集中,多个虚拟缓存队列集可以同时进行数据缓存,该方法能够满足大多数数据处理相关的要求,从而大大提高了数据缓存、加载和处理的效率。本发明实施例所述的数据调度的方法可广泛应用于工业数据分析/处理,特别是针对复杂的大数据处理过程,例如异常检测等。
图6为根据本发明一个实施例的数据调度的装置的结构框图。该装置可以设置在服务器中,也可以独立于服务器进行单独使用,该装置包括标签添加模块610和分派模块620:
标签添加模块610,用于将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求。
其中,调度标签可以包括功能要求字段和处理时长要求字段,例如:调度标签1为:F:batch(功能:批量),T:Millisecond(时长:毫秒);调度标签2为:F:real-time(功能:实时),T:NA(时长:不适用);调度标签3为:F:incremental(功能:增量),T:NA(时长:不适用);调度标签4为:F:batch(功能:实时),T:NA(时长:不适用)。其中,“F”表示功能要求字段,包括但不限于批量、实时或增量等,需要说明的是功能要求字段是强制字段,不可以使用NA。其中,“T”表示处理时长要求字段,包括但不限于毫秒或天等,需要说明的是处理时长要求字段是可选字段,如果没有具有的处理时长要求可以使用NA。
其中,源数据的功能和处理时长要求是根据用户的需求确定,用户可以根据不同的需求制定预定的规则以确定源数据的功能和处理时长,例如,用户可以根据不同的业务类型/应用类型确定源数据的不同功能和处理时长,如离线数据确定功能为批量,时长为NA;交互类数据确定功能为实时,时长为毫秒;返回时间要求高的应用确定功能为实时,时长为毫秒;返回时间要求较低的应用确定功能为批量,时长为NA,其余的应用确定功能为增量,时长为天。
分派模块620,用于根据所述调度标签将所述源数据分派到目标数据缓 存或对应的目标数据处理框架。
需要说明的是,具体根据所述功能要求字段和处理时长要求字段将所述源数据分派到批量数据处理、流数据处理或图形数据处理框架进行数据处理,或分派到批量数据处理、流数据处理或图形数据处理框架对应的数据缓存中。可以基于调度标签中功能和处理时长要求来执行调度动作,进一步是通过强制性字段“F”和可选性字段“T”来分派数据,例如:调度标签1中F:batch,T:Millisecond,则将源数据分派到FW(3)或FW(3)对应的第三数据缓存;调度标签2中F:real-time,则将源数据分派到FW(2)(数据处理框架2)或FW(2)对应的第二数据缓存;调度标签3中F:incremental,则将源数据分派到FW(2);调度标签4中F:batch(功能:实时),T:NA,则将源数据分派到FW(1)(数据处理框架1)或FW(1)对应的第一数据缓存,其中,FW(1)、FW(2)和FW(3)表示不同的数据处理框架,作为可选的,FW(1)可以表示图形数据处理框架,FW(2)可以表示流数据处理框架,FW(3)可以表示批量数据处理框架。
图7为根据本发明另一个实施例的数据调度的装置的结构框图。该装置可以设置在服务器中,也可以独立于服务器进行单独使用,该装置包括标签添加模块710、分派模块720、预处理模块730和标签删除模块740。
标签添加模块710,用于将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求。
分派模块720,用于根据所述调度标签将所述源数据分派到目标数据缓存或对应的目标数据处理框架。
预处理模块730,用于对所述源数据进行预处理预处理模块;具体用于提取所述源数据中的预定字段,将所述预定字段存储为预定义的格式数据。作为可选的,在所述将源数据按照预定要求添加调度标签之前,对所述源数据进行预处理,预处理一般为经过数据抽取和转换(如经过ETL(Extract-Transform-Load)过程调整)后使数据具有预定的格式(如.rar数据包等)或结构(数据模式(一般以某一数据模型为基础描述数据库中数据的逻辑和特征))。本发明实施例所述的预处理可以包括:提取所述源数据中的预定字段,将所述预定字段存储为预定义的格式数据。预定字段的提取规则一般根据用户的要求确定,如提取业务数据或交互数据等,预定义的格式一般与数据加载过程使用的数据格式/结构相同,以保证数据的加载速度和流 畅度。
标签删除模块740,用于在将所述源数据分派到对应的目标数据处理框架后,删除所述调度标签。即调度标签仅在数据调度过程中使用,在的数据调度完成后,删除调度标签。以图4所示的在复杂大数据处理方案中数据处理的系统结构示意图为例。在将所述源数据分派到对应的目标数据处理框架后,即在进入数据存储层430或数据缓存530之前,删除所述调度标签,以免影响后续数据的加载和处理。
图8为根据本发明又一个实施例的数据调度的装置的结构框图。该装置可以设置在服务器中,也可以独立于服务器进行单独使用,该装置包括标签添加模块810、分派模块820、写入模块830、队列增加模块840、读取模块850和队列删除模块860。
标签添加模块810,用于将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求。
分派模块820,用于根据所述调度标签将所述源数据分派到目标数据缓存。
写入模块830,用于将分派的源数据写入所述目标数据缓存中的第一虚拟缓存队列。
队列增加模块840,用于若所述第一虚拟缓存队列存储的数据达到阈值,则增加第二虚拟缓存队列,并将分派的源数据写入所述第二虚拟缓存队列。
读取模块850,用于在所述第一虚拟缓存队列或所述第二虚拟缓存队列写入数据后为对应的目标数据处理框架提供读取数据操作。
具体地,在缓存队列被增加后即可将接收到的数据进行写入操作,在缓存队列写入数据后对应的目标数据处理框架即可执行读取数据操作,写入操作的先后顺序与读取数据操作的先后顺序一致,即,先写入缓存数据的虚拟缓存队列先供数据存储层读取数据,后写入缓存数据的虚拟缓存队列在前一虚拟缓存队列中的缓存数据被读取完毕后供数据存储层读取数据,即先写入缓存数据的第一虚拟缓存队列先供数据存储层读取数据,后写入缓存数据的第二虚拟缓存队列在第一虚拟缓存队列中的缓存数据被读取完毕后供数据存储层读取数据。
队列删除模块860,用于当所述第一虚拟缓存队列或所述第二虚拟缓存队列为空时,删除所述第一虚拟缓存队列或第二虚拟缓存队列。
本发明实施例并不只限制于两个虚拟缓存队列的情况,在存储空间内划分出特定的空间的允许范围内虚拟缓存队列的数量可以任意增加,即若当前写入数据的虚拟缓存队列存储的数据达到阈值,则增加另一虚拟缓存队列,并将接收到的数据写入该另一虚拟缓存队列。另外,当虚拟缓存队列为空时,则删除该虚拟缓存队列,为新增加的虚拟缓存队列提供存储空间。
本发明实施例所述的数据调度的方法和装置能够根据数据的功能和处理时长要求调度到不同的数据处理架构进行数据处理,保证了数据可以被正确地分派到目标大数据处理框架,从而提高大数据处理速率。同时,在添加调度标签之前将数据存储为预定义的格式,保证了数据的成功加载,从而便于后续的数据记录和数据处理。另外,在数据调度完成后删除添加的调度标签,避免了对后续数据加载和数据处理的影响。同时,基于横向扩展机制的数据缓存,通过虚拟缓存队列实现对数据的横向扩展缓存,无需对缓存数据进行转移,从而避免了数据丢失的情况。针对多目标处理架构的情况,每个目标处理架构都有对应独立的虚拟缓存队列集,并通过DMQ进行统一系统维护,同时数据通过调度标签可以被正确地分派到目标大数据处理框架的虚拟缓存队列集中,多个虚拟缓存队列集可以同时进行数据缓存,该方法能够满足大多数数据处理相关的要求,从而大大提高了数据缓存、加载和处理的效率。本发明实施例所述的数据调度的方法可广泛应用于工业数据分析/处理,特别是针对复杂的大数据处理过程,例如异常检测等
本领域技术人员应当理解,上面所公开的各个实施例,可以在不偏离发明实质的情况下做出各种变形和改变。因此,本发明的保护范围应当由所附的权利要求书来限定。

Claims (17)

  1. 一种数据调度的方法,所述方法包括:
    将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求;
    根据所述调度标签将所述源数据分派到对应的目标数据缓存或对应的目标数据处理框架。
  2. 根据权利要求1所述的数据调度的方法,在所述将源数据按照预定要求添加调度标签之前,还包括对所述源数据进行预处理,所述预处理包括:
    提取所述源数据中的预定字段,将所述预定字段存储为预定义的格式数据。
  3. 根据权利要求1所述的数据调度的方法,其中,所述调度标签包括:
    功能要求字段和处理时长要求字段。
  4. 根据权利要求3所述的数据调度的方法,其中,根据所述调度标签将所述源数据分派到对应的目标数据缓存的步骤包括:
    根据所述功能要求字段和处理时长要求字段将所述源数据分派到批量数据处理、流数据处理或图形数据处理框架对应的目标数据缓存或对应的目标数据处理框架。
  5. 根据权利要求1-4任一项所述的数据调度的方法,还包括:
    在将所述源数据分派到对应的目标数据缓存或对应的目标数据处理框架后,删除所述调度标签。
  6. 根据权利要求1所述的数据调度的方法,还包括:
    将分派的源数据写入所述目标数据缓存中的第一虚拟缓存队列;
    若所述第一虚拟缓存队列存储的数据达到阈值,则增加第二虚拟缓存队列,并将分派的源数据写入所述第二虚拟缓存队列。
  7. 根据权利要求6所述的数据调度的方法,所述第一虚拟缓存队列和第二缓存队列的存储空间根据数据类型、数据存储空间或用户需求确定。
  8. 根据权利要求6所述的数据调度的方法,还包括:
    在所述第一虚拟缓存队列或所述第二虚拟缓存队列写入数据后为对应的目标数据处理框架提供读取数据操作。
  9. 根据权利要求6所述的数据调度的方法,还包括:
    当所述第一虚拟缓存队列或所述第二虚拟缓存队列为空时,删除所述第一虚拟缓存队列或第二虚拟缓存队列。
  10. 一种数据调度的装置,所述装置包括:
    标签添加模块,用于将源数据按照预定要求添加调度标签,所述调度标签用于识别所述源数据的功能和处理时长要求;
    分派模块,用于根据所述调度标签将所述源数据分派到目标数据缓存或对应的目标数据处理框架。
  11. 根据权利要求10所述的数据调度的装置,还包括用于对所述源数据进行预处理预处理模块;
    所述预处理模块,具体用于提取所述源数据中的预定字段,将所述预定字段存储为预定义的格式数据。
  12. 根据权利要求10所述的数据调度的装置,其中,在标签添加模块中,所述调度标签包括:功能要求字段和处理时长要求字段。
  13. 根据权利要求10所述的数据调度的装置,其中,所述分派模块,具体用于根据所述功能要求字段和处理时长要求字段将所述源数据分派到批量数据处理、流数据处理或图形数据处理框架对应的目标数据缓存或对应的目标数据处理框架。
  14. 根据权利要求10-13任一项所述的数据调度的装置,还包括:
    标签删除模块,用于在将所述源数据分派到对应的目标数据缓存或对应的目标数据处理框架后,删除所述调度标签。
  15. 根据权利要求10所述的数据调度的装置,还包括:
    写入模块,用于将分派的源数据写入所述目标数据缓存中的第一虚拟缓存队列;
    队列增加模块,用于若所述第一虚拟缓存队列存储的数据达到阈值,则增加第二虚拟缓存队列,并将分派的源数据写入所述第二虚拟缓存队列。
  16. 根据权利要求15所述的数据调度的装置,还包括:
    读取模块,用于在所述第一虚拟缓存队列或所述第二虚拟缓存队列写入数据后为对应的目标数据处理框架提供读取数据操作。
  17. 根据权利要求15所述的数据调度的装置,还包括:
    队列删除模块,用于当所述第一虚拟缓存队列或所述第二虚拟缓存队列为空时,删除所述第一虚拟缓存队列或第二虚拟缓存队列。
PCT/CN2017/081688 2016-04-25 2017-04-24 一种数据调度的方法及装置 WO2017186082A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610262798.6A CN107305580A (zh) 2016-04-25 2016-04-25 一种数据调度的方法及装置
CN201610262798.6 2016-04-25

Publications (1)

Publication Number Publication Date
WO2017186082A1 true WO2017186082A1 (zh) 2017-11-02

Family

ID=60150577

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/081688 WO2017186082A1 (zh) 2016-04-25 2017-04-24 一种数据调度的方法及装置

Country Status (2)

Country Link
CN (1) CN107305580A (zh)
WO (1) WO2017186082A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452519A (zh) * 2008-12-10 2009-06-10 华中科技大学 一种用于射频识别中间件的数据调度方法
CN101465792A (zh) * 2007-12-18 2009-06-24 北京北方微电子基地设备工艺研究中心有限责任公司 一种数据调度方法及装置
EP2113848A1 (en) * 2008-04-30 2009-11-04 Siemens Energy & Automation, Inc. Adaptive caching for high volume extract transform load process
CN103838682A (zh) * 2014-03-10 2014-06-04 华为技术有限公司 一种文件目录的读取方法和设备
CN104994171A (zh) * 2015-07-15 2015-10-21 上海斐讯数据通信技术有限公司 一种分布式存储方法与系统
CN105306345A (zh) * 2015-10-08 2016-02-03 南京南瑞继保电气有限公司 基于jms消息的电力调度实时数据发布系统及方法

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102577237B (zh) * 2010-12-20 2014-04-02 华为技术有限公司 网站托管服务调度方法、应用访问处理方法、装置及系统
US8855436B2 (en) * 2011-10-20 2014-10-07 Xerox Corporation System for and method of selective video frame compression and decompression for efficient event-driven searching in large databases
CN102799555B (zh) * 2012-07-24 2014-03-12 中国电力科学研究院 电力信息系统中可配置数据交互工具的设计方法及其系统
CN104331327B (zh) * 2014-12-02 2017-07-11 山东乾云启创信息科技股份有限公司 大规模虚拟化环境中任务调度的优化方法及优化系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101465792A (zh) * 2007-12-18 2009-06-24 北京北方微电子基地设备工艺研究中心有限责任公司 一种数据调度方法及装置
EP2113848A1 (en) * 2008-04-30 2009-11-04 Siemens Energy & Automation, Inc. Adaptive caching for high volume extract transform load process
CN101452519A (zh) * 2008-12-10 2009-06-10 华中科技大学 一种用于射频识别中间件的数据调度方法
CN103838682A (zh) * 2014-03-10 2014-06-04 华为技术有限公司 一种文件目录的读取方法和设备
CN104994171A (zh) * 2015-07-15 2015-10-21 上海斐讯数据通信技术有限公司 一种分布式存储方法与系统
CN105306345A (zh) * 2015-10-08 2016-02-03 南京南瑞继保电气有限公司 基于jms消息的电力调度实时数据发布系统及方法

Also Published As

Publication number Publication date
CN107305580A (zh) 2017-10-31

Similar Documents

Publication Publication Date Title
CN104881466A (zh) 数据分片的处理以及垃圾文件的删除方法和装置
CN105247478B (zh) 用于存储命令的方法及相关装置
US9817754B2 (en) Flash memory management
CN109983459B (zh) 用于标识语料库中出现的n-gram的计数的方法和设备
CN102402401A (zh) 一种磁盘io请求队列调度的方法
WO2023040399A1 (zh) 一种业务持久化方法及装置
US10795606B2 (en) Buffer-based update of state data
CN116841623A (zh) 访存指令的调度方法、装置、电子设备和存储介质
CN106201918B (zh) 一种基于大数据量和大规模缓存快速释放的方法和系统
CN111625507A (zh) 一种文件处理方法及装置
CN104115127B (zh) 存储系统和数据管理方法
CN107391672A (zh) 数据的读写方法及消息化的分布式文件系统
CN113407343A (zh) 一种基于资源分配的业务处理方法、装置及设备
US12093732B2 (en) Fast shutdown of large scale-up processes
WO2017186082A1 (zh) 一种数据调度的方法及装置
CN103607451A (zh) 支持并发的客户端与服务器端的文档操作同步方法
CN112035428A (zh) 分布式存储系统、方法、装置、电子设备和存储介质
CN116841624A (zh) 访存指令的调度方法、装置、电子设备和存储介质
CN114385891B (zh) 数据搜索方法、装置、电子设备及存储介质
CN108334457B (zh) 一种io处理方法及装置
CN115525226A (zh) 硬件批量指纹计算方法、装置及设备
CN108132970A (zh) 基于云计算的大数据分布式处理方法及系统
CN104182490B (zh) 一种管理数据访问的方法及装置
CN103891272A (zh) 用于视频分析和编码的多个流处理
US20150100607A1 (en) Apparatus and method for data management

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17788734

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 17788734

Country of ref document: EP

Kind code of ref document: A1