WO2023155176A1 - Etl system construction method and apparatus, data processing method and apparatus, and etl system - Google Patents

Etl system construction method and apparatus, data processing method and apparatus, and etl system Download PDF

Info

Publication number
WO2023155176A1
WO2023155176A1 PCT/CN2022/076973 CN2022076973W WO2023155176A1 WO 2023155176 A1 WO2023155176 A1 WO 2023155176A1 CN 2022076973 W CN2022076973 W CN 2022076973W WO 2023155176 A1 WO2023155176 A1 WO 2023155176A1
Authority
WO
WIPO (PCT)
Prior art keywords
server
data
servers
database
scheduling
Prior art date
Application number
PCT/CN2022/076973
Other languages
French (fr)
Chinese (zh)
Inventor
王建宙
段季芳
王瑜
袁菲
汤玥
沈国梁
王萍
李园园
吴建波
何德材
吴建民
王洪
Original Assignee
京东方科技集团股份有限公司
北京中祥英科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司, 北京中祥英科技有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202280000222.6A priority Critical patent/CN116917884A/en
Priority to PCT/CN2022/076973 priority patent/WO2023155176A1/en
Publication of WO2023155176A1 publication Critical patent/WO2023155176A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating

Definitions

  • the present disclosure relates to the technical field of data processing, and in particular to a construction method and device of an ETL system, a data processing method and device, and an ETL system.
  • the information system constructed by the production line can integrate the product data uploaded by the equipment of the production line and the product data uploaded manually for analysis, so that developers can judge the cause of product failure based on the analysis results.
  • ETL extract-transform-load
  • a method for constructing an ETL system includes multiple servers and multiple databases.
  • the method includes: configuring the multiple servers so that the configured multiple servers have ETL functions, and the multiple Each server has a corresponding identification; configure scan IP addresses for multiple servers, so that multiple servers can access any database in multiple databases through the scan IP address, and the multiple databases have high availability; Set a virtual IP address for each server, and the working mode of multiple servers.
  • the virtual IP address is used to establish a communication connection with multiple servers.
  • the working mode of multiple servers includes active-active mode and active-standby mode.
  • the multiple servers can have the ETL function.
  • the multiple servers can have the ETL function.
  • the multiple servers can have the ETL function.
  • the multiple servers can access the scan IP of any database in multiple databases, when a database fails, other databases can continue to work normally, ensuring high availability among databases.
  • the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
  • multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.
  • the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.
  • a device for constructing an ETL system including: a configuration unit and a processing unit.
  • the configuration unit is configured to: configure multiple servers, so that the configured multiple servers have ETL functions, and each of the multiple servers has a corresponding identifier; configure scan IP addresses for multiple servers, so that multiple servers Any one of multiple databases can be accessed through the scan IP address, and the multiple databases are highly available.
  • the processing unit is configured to: set a virtual IP address for multiple servers through a setting tool, and the working mode of the multiple servers, the virtual IP address is used to establish a communication connection with the multiple servers, and the working mode of the multiple servers Including active-active mode and active-standby mode.
  • the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
  • multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.
  • the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.
  • a data processing method is provided, which is applied to an ETL system.
  • the ETL system includes a plurality of servers, and the plurality of servers are equipped with ETL tools, and the ETL tools are used to process data.
  • the method includes: obtaining the first scheduling period The first data to be processed; in response to the first setting operation, determine the working modes of the multiple servers in the first scheduling period; where the working modes include active-active mode and active-standby mode.
  • multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; sends the first indication information to the first server, the first indication information is used to instruct the first server to process the first data to be processed, and the first server is in the working state within the first scheduling period server.
  • the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of.
  • the method also includes: detecting the working status of multiple databases; when it is detected that the first database cannot work, storing the data stored in the first database to the second database, the first database and the second database are the multiple databases different databases.
  • the method further includes: receiving source data from multiple systems in the first scheduling period; combining multiple source data to obtain first data to be processed, and storing the first data to be processed in the first scheduling period
  • a preset storage area of a database the first database includes multiple storage areas, and the multiple storage areas are used to store data to be processed in different scheduling periods.
  • the above "obtaining the first data to be processed in the first scheduling period" may specifically include: extracting the first data to be processed from a preset storage area of the Hive database.
  • the method further includes: detecting the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in a working state, sending a second indication to the first server information, the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the third instruction information is sent to the second server, and the third instruction information uses In order to instruct the second server to continue processing the first data to be processed, the second server is a server in a working state among the plurality of servers.
  • the method when the processing result is successful, further includes: acquiring the second data to be processed in the second scheduling period; sending a fourth instruction for the first server to process the second data to be processed to the first server information.
  • the method before sending the fourth indication information to the first server, the method further includes: in response to the second setting operation, determining the working modes of the plurality of servers in the second scheduling period.
  • the method further includes: in response to the third setting operation, determining the duration and scheduling frequency of the scheduling periods of the plurality of servers.
  • a data processing device which is applied to an ETL system.
  • the ETL system includes a plurality of servers, and the plurality of servers are equipped with ETL tools, and the ETL tools are used to process data.
  • the device includes: an acquisition unit configured to In order to obtain the first data to be processed in the first scheduling period; the determining unit is configured to determine the working modes of multiple servers in the first scheduling period in response to the first setting operation; wherein the working modes include active-active mode, master standby mode.
  • multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; the sending unit is configured to send first indication information to the first server, the first indication information is used to instruct the first server to process the first data to be processed, and the first server is the first Servers in working state during the scheduling period.
  • the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of.
  • the device also includes a detection unit configured to: detect the working status of multiple databases; when it is detected that the first database cannot work, store the data stored in the first database to the second database, and the first database and the second database are different databases among the plurality of databases.
  • the device further includes a receiving unit and a processing unit; the receiving unit is configured to receive source data from multiple systems in the first scheduling period; the processing unit is configured to combine multiple source data, The first data to be processed is obtained, and the first data to be processed is stored in a preset storage area of the first database.
  • the first database includes multiple storage areas, and the multiple storage areas are used to store the data to be processed in different scheduling periods.
  • the obtaining unit is specifically configured to: extract the first data to be processed from the preset storage area of the Hive database.
  • the detection unit is further configured to detect the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in working state, the control sending unit sends the first The server sends second instruction information, and the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the control sending unit sends the third Instruction information, the third instruction information is used to instruct the second server to continue processing the first data to be processed, and the second server is a server in a working state among the plurality of servers.
  • the obtaining unit is further configured to obtain the second data to be processed in the second scheduling period; the sending unit is also configured to send the data to the first server for processing by the first server. Fourth indication information of the second data to be processed.
  • the determining unit before sending the fourth indication information to the first server, is further configured to determine the working modes of the plurality of servers in the second scheduling period in response to the second setting operation.
  • the determining unit is further configured to determine the duration and scheduling frequency of the scheduling periods of the multiple servers in response to the third setting operation.
  • an ETL system in another aspect, includes a setting tool and multiple servers, the setting tool communicates with multiple servers through a virtual IP, and the setting tool communicates with multiple servers; the setting tool is used to respond to the first The setting operation is to determine the scheduling tasks and scheduling cycles of multiple servers; the setting tool is also used to respond to the second setting operation to determine the working modes of multiple servers in each scheduling cycle.
  • the working modes include active-active mode, active-standby mode, In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; multiple servers are used to receive scheduling tasks from the scheduler, process the data according to the scheduling tasks, and store the processed data in the Hbase database.
  • the ETL system also includes multiple databases, multiple servers access multiple databases through scan IP addresses, multiple databases are used to store one or more of configuration information, data processing processes, and scheduling tasks of multiple servers , multiple databases with high availability.
  • the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
  • the setting tool is further configured to adjust the scheduling tasks and/or scheduling periods of the multiple servers in response to the third setting operation.
  • a computer readable storage medium stores computer program instructions.
  • the computer program instructions run on a computer, the computer executes the ETL system construction method and data processing method as described in any of the above embodiments.
  • a computer readable storage medium stores computer program instructions, and when the computer program instructions run on a computer, the computer executes the data processing method as described in any one of the above embodiments.
  • a computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the ETL system construction method as described in any of the above embodiments.
  • a computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the data processing method as described in any of the above embodiments.
  • a computer program is provided.
  • the computer program When the computer program is executed on the computer, the computer program causes the computer to execute the ETL system construction method described in any of the above embodiments.
  • a computer program When the computer program is executed on a computer, the computer program causes the computer to execute the data processing method described in any of the above embodiments.
  • a chip in yet another aspect, includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run computer programs or instructions, so as to implement the construction method of the ETL system as described in any of the above embodiments.
  • a chip in yet another aspect, includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run computer programs or instructions to implement the data processing method as described in any of the above embodiments.
  • the chip provided in this application further includes a memory for storing computer programs or instructions.
  • all or part of the above computer instructions may be stored on a computer-readable storage medium.
  • the computer-readable storage medium may be packaged together with the processor of the device, or may be packaged separately with the processor of the device, which is not limited in the present application.
  • Fig. 1 is the structural diagram of a kind of ETL system that the application embodiment provides;
  • FIG. 2 is a schematic diagram of an interface for setting a scheduling cycle provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of another interface for setting the scheduling cycle provided by the embodiment of the present application.
  • Fig. 4 is a schematic interface diagram of a F5 tool provided by the embodiment of the present application.
  • Fig. 5 is the structural diagram of another kind of ETL system that the embodiment of the present application provides;
  • Fig. 6 is the structural diagram of another kind of ETL system that the embodiment of the present application provides;
  • FIG. 7 is a schematic flow diagram of a method for constructing an ETL system provided in an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a data processing method provided in an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of another data processing method provided by the embodiment of the present application.
  • FIG. 10 is a schematic flowchart of another data processing method provided in the embodiment of the present application.
  • FIG. 11 is a schematic flowchart of another data processing method provided in the embodiment of the present application.
  • FIG. 12 is a schematic flowchart of another data processing method provided in the embodiment of the present application.
  • FIG. 13 is a schematic flow diagram of a data processing method provided in an embodiment of the present application.
  • FIG. 14 is a schematic flowchart of a data processing method provided in an embodiment of the present application.
  • FIG. 15 is a schematic flowchart of a data processing method provided in an embodiment of the present application.
  • FIG. 16 is a schematic diagram of a login interface provided by an embodiment of the present application.
  • FIG. 17 is a schematic diagram of a data processing result of a server provided by an embodiment of the present application.
  • FIG. 18 is a schematic diagram of data processing results of another server provided in the embodiment of the present application.
  • FIG. 19 is a schematic diagram of data processing results of another server provided in the embodiment of the present application.
  • FIG. 20 is a schematic diagram of a construction device of an ETL system provided by an embodiment of the present application.
  • FIG. 21 is a schematic diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 22 is a schematic diagram of a communication device provided by an embodiment of the present application.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present disclosure, unless otherwise specified, "plurality” means two or more.
  • the expressions “coupled” and “connected” and their derivatives may be used.
  • the term “connected” may be used in describing some embodiments to indicate that two or more elements are in direct physical or electrical contact with each other.
  • the term “coupled” may be used when describing some embodiments to indicate that two or more elements are in direct physical or electrical contact.
  • the terms “coupled” or “communicatively coupled” may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other.
  • the embodiments disclosed herein are not necessarily limited by the context herein.
  • At least one of A, B and C has the same meaning as “at least one of A, B or C” and both include the following combinations of A, B and C: A only, B only, C only, A and B A combination of A and C, a combination of B and C, and a combination of A, B and C.
  • a and/or B includes the following three combinations: A only, B only, and a combination of A and B.
  • the term “if” is optionally interpreted to mean “when” or “at” or “in response to determining” or “in response to detecting,” depending on the context.
  • the phrases “if it is determined that " or “if [the stated condition or event] is detected” are optionally construed to mean “when determining ! or “in response to determining ! depending on the context Or “upon detection of [stated condition or event]” or “in response to detection of [stated condition or event]”.
  • ETL Can be used to describe the process of extracting, transforming, and loading data from the source to the destination. ETL can refer to: extracting the required data from different data sources, performing data cleaning and conversion on the extracted data, and then loading the cleaned and converted data into the database.
  • a node or server can be configured with an ETL tool that can perform ETL functions.
  • the ETL tool may be an application program or a device capable of performing the function, without limitation.
  • a node or a server configured with an ETL tool may be referred to as an ETL platform/system, or may be a part of the ETL platform/system.
  • Database It can be used to store the data required for the operation of the ETL platform. For example, data extracted from different data sources, cleaned data, converted data, etc., can also be used to store configuration files related to ETL.
  • the database can be used to store job content submitted by ETL to the server, ETL configuration information, ETL scheduling information, and the like. Specifically, reference may be made to the description of the following embodiments.
  • the database may be a relational database management system (RDBMS).
  • RDBMS relational database management system
  • it may be a MySQL database, an Oracle database, a Postgresql database, and the like.
  • High availability It means that a system is specially designed so that when one device fails, other devices can continue to operate, thereby reducing downtime and maintaining the availability of the system.
  • the A healthy database among multiple databases can continue to provide storage services.
  • the system can continue to provide storage services.
  • an embodiment of the present application provides an ETL system, which may include multiple servers configured with ETL tools, and the multiple servers may share one virtual IP address to achieve high availability of the multiple servers.
  • Each of the multiple servers can be connected to multiple databases, and the multiple databases also have high availability. In this way, through the high availability among multiple servers and multiple databases, the problem that the entire ETL system cannot operate normally when a server or database fails is avoided.
  • FIG. 1 is a schematic structural diagram of an ETL system provided by an embodiment of the present application.
  • the ETL system may include: a plurality of servers (only two servers are shown in the figure, server 01 and server 02 respectively) and a configuration tool.
  • the setting tool communicates with a plurality of servers respectively.
  • the setting tool can determine the scheduling tasks, scheduling periods and working modes of multiple servers in response to the setting operation.
  • a setup tool can be a setter, a controller, a control tool, etc.
  • the setting tool can be one device or multiple devices.
  • the setting tool may include a scheduler 10 and a preset tool 20 .
  • the scheduler 10 can communicate with multiple servers through virtual IP addresses.
  • the preset tool 20 may be communicatively coupled to a plurality of servers via communication links.
  • the communication link may be a wired communication link or a wireless communication link.
  • the scheduler 10 may be configured to determine scheduling tasks and scheduling periods of multiple servers in response to a setting operation.
  • Scheduling tasks can refer to the data that the server needs to extract, transform and integrate.
  • the scheduling period may refer to the time interval for the server to execute the scheduling task.
  • the caller 10 may assign scheduling tasks to multiple servers at preset time intervals.
  • the scheduler 10 may have a corresponding setting interface.
  • the setting interface can determine the scheduling tasks and scheduling periods of multiple servers in response to user operations.
  • FIG. 2 and FIG. 3 it is the setting interface of the scheduling period provided by the embodiment of the present application.
  • the setting interface shown in Figure 2 can be used to set the duration of the scheduling cycle.
  • the setting interface shown in Figure 3 can be used to set the running frequency in the scheduling period.
  • the preset tool 20 can be used to determine the working modes of multiple servers in each scheduling period in response to the setting operation.
  • the preset tool 20 may be a hardware device with a settings page.
  • the setting page can determine the working modes of multiple servers in each scheduling period in response to the user's setting operation. For example, you can set the working mode of multiple servers in the area corresponding to the resource.
  • the node name, address, and server port can respond to the setting operation to determine the server in the working state and the port to receive data.
  • multiple servers may be configured and the statuses of the multiple servers may be set as working status.
  • information of a new server can be added.
  • the information of the server may be deleted.
  • the F5 tool is used as an example for description, and of course, other tools may also be used to replace the functions of the F5 tool, without limitation.
  • the working modes of the multiple servers may include active-active mode (Active-Active) and active-standby mode (Active-Standby).
  • Active-Active active-active mode
  • Active-standby mode Active-Standby
  • multiple servers can execute the scheduling tasks of the ETL system at the same time.
  • the setting tool can assign the task run by the server to another server that is running normally, so that the server that is running normally can continue to process the task. Therefore, the normal operation of the ETL system is guaranteed under the condition of fully using the computing resources of multiple servers.
  • both the server 01 and the server 02 are in the active-active mode, both the server 01 and the server 02 are in the working state, for example, the server 01 executes the scheduling task 1, and the server 02 executes the scheduling task 2.
  • the configuration tool can assign the scheduling task 1 to the server 02, so that the server 02 can continue to execute the scheduling task 1.
  • the active-standby mode refers to multiple servers including the active server and the standby server.
  • the master server is used to execute the scheduling tasks of the ETL system.
  • the standby server is used to continue to execute the scheduled task when the primary server cannot execute the scheduled task.
  • server 01 is used as the primary server and server 02 is the backup server.
  • Instruction information is sent, and the instruction information may be used to instruct the server 02 to continue executing the scheduled task 1 .
  • the indication information may include identification information of the scheduled task 1 or data corresponding to the scheduled task 1.
  • the server 02 may acquire the data corresponding to the scheduling task 2 according to the identification information.
  • the server 02 may directly process the data.
  • the plurality of servers are all configured with ETL tools. All the multiple servers can execute the scheduling task.
  • the setting page of the preset tool may be as shown in FIG. 4 .
  • the setting page can respond to the developer's setting operations to determine the working modes of multiple servers and the scheduled tasks to be executed.
  • the ETL system may further include multiple databases (only database 21 and database 22 are shown in the figure).
  • the multiple databases are communicatively connected to multiple servers respectively.
  • the plurality of databases can be used to store information such as job content, configuration information, and scheduling plan of the server.
  • the stored information can be synchronized between the multiple databases.
  • the job content of the server may include the above-mentioned scheduling task and the data processed by the scheduling task.
  • the scheduling plan may include the duration of the scheduling period, the scheduling frequency of the scheduling period, and the like.
  • the multiple databases may have the same browsing (scan) IP address, and multiple servers may access any one of the multiple databases through the scanning IP address.
  • both the virtual IP address and the browsing IP address are used to access a device.
  • a configuration tool can access any one of multiple servers through a virtual IP address.
  • the server can access any database among multiple databases by browsing the IP address.
  • the plurality of servers may also be configured with respective corresponding IP addresses.
  • the multiple databases may also be configured with respective corresponding IP addresses.
  • the database 21 in combination with the database 21 and the database 22 in FIG. service; or, when a hardware or software failure occurs in the database 22, the database 21 can continue to provide storage services.
  • the ETL system provided in the embodiment of the present application may also communicate with a distributed computing (hadoop) system.
  • the hapoop system may include multiple databases, such as a Hive database and a distributed (Hbase) database.
  • the Hive database communicates with multiple source systems and ETL systems respectively.
  • the source system can be used to provide raw data, for example, it can provide product history information and bad CODE information.
  • the multiple source systems may include a yield management system (yield management system, YMS) system and a management data warehouse (management data warehouse, MDW) system.
  • YMS yield management system
  • MDW management data warehouse
  • the YMS system can be used to provide product history information.
  • the MDW system can be used to provide code of error (CODE) information for detected products.
  • CODE code of error
  • the history information of the product may refer to the basic information of the product directly uploaded by the factory equipment to the YMS system.
  • the historical resume information of a product can include: factory (FACTORY), product lot number (LOT_ID), product (such as glass) identification (GLS_ID), event time key (EVENT_TIMEKEY), product type (PRODUCT_TYPE), former process site (OLD_OPER_CODE)
  • FACTORY factory
  • LOT_ID product lot number
  • product such as glass identification
  • PRODUCT_TYPE event time key
  • PRODUCT_TYPE product type
  • former process site OLD_OPER_CODE
  • PRODUCT_ID product model
  • EQP_ID equipment identification
  • unit identification UNIT_ID
  • subunit identification subunit identification
  • LAST_PROCESS_IN_TIME previous process investment time
  • the historical resume information of a product can be stored in the YMS system in the form of an Oracle table, or can be stored in the YMS system in the form of an array, without limitation.
  • the history information of the product may be shown in Table 1.
  • table fields illustrate FACTORY factory LOT_ID LOT ID GLS_ID GLASS ID
  • EVENT_TIMEKEY event time PRODUCT_TYPE product type OLD_OPER_CODE former craft site
  • PRODUCT_ID Product number EQP_ID Equipment name UNIT_ID unit name SUB_UNIT_ID subunit name LAST_PROCESS_IN_TIME Pre-process investment time
  • EVENT_TIME event time PRODUCT_TYPE product type OLD_OPER_CODE former craft site
  • PRODUCT_ID Product number EQP_ID Equipment name UNIT_ID unit name
  • SUB_UNIT_ID subunit name LAST_PROCESS_IN_TIME Pre-process investment time
  • EVENT_TIME event time EVENT_TIME event time
  • table fields in Table 1 are only exemplary, and may also include other fields, such as product size, thickness, and other fields, which are not limited.
  • bad CODE information of the product may be a problem in the production process of the product.
  • bad CODE information of a product can include factory (FACTORY), site (STEP), product lot number (LOT_ID), product identification (GLS_ID), former process site (PRODUCT_ID), product type (PRODUCT_TYPE), product size (PRODUCT_SIZE), One or more of product model (NEW_MODEL), CODE grade (DEFECT_GRADE), bad CODE (DEFECT_CODE), and CODE detection time (TXN_TIME).
  • the bad CODE information of the product can be stored in the MDW system in the form of a table, or can be stored in the MDW system in the form of an array, without limitation.
  • the historical defect CODE information of the product can be shown in Table 2.
  • table field illustrate FACTORY factory STEP site LOT_ID LOT ID GLS_ID GLASS ID PRODUCT_ID former craft site PRODUCT_TYPE product type PRODUCT_SIZE product size NEW_MODEL Product model 2 DEFECT_GRADE CODE level
  • table fields in Table 2 are only exemplary, and may also include other table fields, for example, may also include identification information of the detection device, etc., without limitation.
  • the Hive database can store the synchronized Hive history information.
  • the Hive history information may include partition fields and fields extracted from Table 1 and Table 2. This partition field can identify the Hive history information after synchronization.
  • the Hive database may include multiple storage areas, and the synchronized Hive resume information may be stored in the corresponding storage areas in the form of parquet.
  • Hive history information in different storage areas can have different partition fields.
  • the partition field may be the time field (TIMEDAY) corresponding to the Hive history information.
  • the fields of the synchronized Hive history information can also include factory (FACTORY), product lot number (LOT_ID), product identification ( GLS_ID), event time key (EVENT_TIMEKEY), product type (PRODUCT_TYPE), former process site (OLD_OPER_CODE), product model (PRODUCT_ID), equipment identification (EQP_ID), unit identification (UNIT_ID), subunit identification (SUB_UNIT_ID), former process One or more of the time invested (LAST_PROCESS_IN_TIME).
  • the synchronized Hive resume information may also include other fields, which are not limited.
  • the synchronized Hive history information can be shown in Table 3.
  • the Hbase database can be used to store the data processed by the ETL system.
  • the ETL system can obtain the synchronized Hive resume information from the data in the Hive database, process the synchronized Hive resume information, and store the processed data in the Hbase database to facilitate the use of the background system.
  • the process for the ETL system to process the synchronized Hive history information may include:
  • filtering the synchronized Hive resume information may refer to filtering according to the site information and time information of the product, and determining information such as the residence time and device type of each product in the same site.
  • the equipment type of the product (represented by the field STEP_ID) is part of the equipment name (EQP_ID). For example, it can be the first five characters of EQP_ID. For example, if the EQP_ID of the product is AAEWS07, the corresponding STEP_ID can be AAEWS.
  • the preset form can be set as required.
  • the preset format may be a format that is convenient for the background system to correspond to.
  • the processed data stored in the Hbase database in the form of tables may also be stored in other forms, without limitation.
  • the processed data can be as shown in Table 4.
  • the MD5 field is obtained by encrypting the LOT_ID field with the md5 function to obtain the first three digits of the field.
  • SEQ_ID is the spliced value of multiple fields.
  • the plurality of fields may include TRKG, OLD_OPER_CODE, STEP_ID. Among them, TRKG represents history information.
  • the embodiment of the present application provides a method for constructing an ETL system and a data processing method (abbreviated as a data processing method) applied to the ETL system.
  • a data processing method abbreviated as a data processing method
  • a method for constructing an ETL system provided by the embodiment of the present application may include:
  • the multiple servers may be server 01 and server 02 in FIG. 1 or FIG. 2 .
  • Configuring multiple servers may refer to installing a preset configuration file for multiple servers, and the server after installing the preset configuration file may have an ETL function, for example, the configuration file may be Pentaho Server.
  • the setting tool can also respond to the setting operation and set a cluster (cluster) configuration for multiple servers, so that the multiple servers can have the same cluster ID. In this way, multiple servers can be managed through the cluster ID.
  • each server may also be set with a corresponding ID.
  • the ID of server 01 may be node 1 (node1), and the ID of server 02 may be node2.
  • scan IP can be used to access multiple databases.
  • each server can store the scan IP in response to the developer's setting operation.
  • the virtual IP address can be used to establish a communication connection with multiple servers, and other devices such as (scheduler) can access any server in the multiple servers through the virtual IP address.
  • other devices such as (scheduler) can access any server in the multiple servers through the virtual IP address.
  • the interface can determine working modes of multiple servers in response to a setting operation.
  • setting is made for multiple servers so that multiple servers can have the ETL function.
  • multiple servers to access the scan IP of any database in multiple databases, when a database fails, other databases can continue to work normally, ensuring high availability among databases.
  • a data processing method provided by the embodiment of the present application, the method is applied to the setting tool or some devices of the setting tool in the ETL system of Figure 1 or Figure 5 above, such as a scheduler, the method includes:
  • the first data to be processed may be Hive history information after synchronization.
  • the setting tool may acquire the first data to be processed in the first scheduling period from the Hive database.
  • the server configured with the ETL tool can send the indication information used to indicate the information associated with multiple source systems to the Hive database through the setting tool, so that after the Hive database receives the indication information,
  • the history information and bad CODE information of the products in the first scheduling cycle may be obtained from multiple source systems and integrated to obtain the first data to be processed.
  • the first setting operation may refer to an operation on a setting page corresponding to the preset tool.
  • the working modes of multiple servers reference may be made to the above-mentioned descriptions of the active-active mode and the active-standby mode, and details are not repeated here.
  • S803. Send the first indication information to the first server.
  • the first server receives the first indication information.
  • the first instruction information may be used to instruct the first server to process the first data to be processed.
  • the first server is a server in a working state within the first scheduling period among the plurality of servers.
  • the setting tool can obtain the data to be processed in the first call cycle, determine the working mode of multiple servers, and send instruction information for processing data to the server in the working state among the multiple servers , enabling the server to perform ETL processing on the data to be processed.
  • the method may further include S901.
  • the first server acquires first scheduling data according to the first indication information.
  • the first indication information may include first data to be processed.
  • the first server can directly acquire the first data to be processed from the first indication information.
  • the first indication information may include an identifier or a storage address of the first data to be processed.
  • the first server can acquire the first data to be processed according to the identifier or storage address of the first data to be processed.
  • the first server uses an ETL tool to process the first scheduling data to obtain processed first scheduling data.
  • the processed first scheduling data may refer to data after processing the synchronized Hive history information. Specifically, reference may be made to the above description, and details are not repeated here.
  • the method may further include:
  • detecting the working status of multiple databases may refer to detecting whether the multiple databases can work normally.
  • the working status among multiple databases may be detected through information interaction with multiple databases.
  • information can be sent to multiple databases periodically or randomly, and when the response information of the database is received, it is determined that the database is in a normal working state; when the response information of a certain database is not received, it is determined that the database is abnormal.
  • the second database may be a database without failure among the multiple databases.
  • the first database and the second data are different databases among the plurality of databases.
  • the method may further include:
  • S1102. Merge multiple source data to obtain first data to be processed, and store the first data to be processed in a preset storage area of a preset database.
  • the preset database can be used to store the data to be processed.
  • the preset database may be the Hive database in FIG. 6 .
  • a preset database can include multiple storage areas. The multiple storage areas can be used to store data to be processed in different scheduling periods. Data to be processed in different scheduling cycles can have unique identifiers.
  • the identifier can be a partition field.
  • the method for obtaining the first data to be processed in the first scheduling period may specifically include: extracting the first data to be processed from a preset storage area of the first database.
  • the data of multiple source databases can be integrated to obtain data to be processed in a unified format, so that the server can process the data to be processed.
  • the method may further include:
  • the second instruction information is used to instruct the first server to continue processing the first data to be processed.
  • the third instruction information is used to instruct the second server to continue processing the first data to be processed.
  • the data to be processed can be processed in a timely manner to avoid data omission.
  • the method may further include:
  • the fourth instruction information is used to instruct the second server to process the second data to be processed.
  • the method may also include:
  • the working modes of the plurality of servers in the second scheduling period are determined.
  • the working modes of the multiple servers in the second scheduling period may be the same as or different from the working modes of the multiple servers in the first scheduling period.
  • the working mode of multiple servers in the first scheduling period is active-active mode
  • the working mode of multiple servers in the second scheduling period can be active-active-standby mode or active-active mode, without limitation.
  • the ETL system can continue to process data, which ensures the periodic work of the ETL system.
  • the method may also include:
  • the duration and scheduling frequency of the scheduling periods of the plurality of servers are determined.
  • the duration and scheduling frequency of the scheduling period may refer to the duration and scheduling frequency of the first scheduling period, or may refer to the duration and scheduling frequency of the second scheduling period.
  • the duration of the scheduling cycle and the scheduling frequency can be set as required.
  • the duration and scheduling frequency of multiple scheduling cycles can be the same or different, and are not limited.
  • the working mode of multiple servers, the duration of the scheduling cycle, and the scheduling frequency can be adjusted manually, which is flexible and convenient.
  • scheduling task is related to user requirements.
  • User needs can refer to determining why a product has a CODE.
  • Scheduling tasks may refer to performing ETL processing on product history information and CODE information.
  • scheduling a task may refer to obtaining a synchronized Hive resume from a Hive database according to a scheduling cycle. And process the acquired data in the synchronized Hive resume table.
  • the synchronized Hive history table may refer to the history table after synchronization of glass history information and bad CODE information.
  • bad CODE may be generated during the production process of each glass. For example, scratches in the box, foreign body Gap, abnormal lighting, black spot CODE, etc. User needs can refer to determining the cause of the bad CODE of the glass. Scheduling tasks may refer to performing ETL processing on the history information and bad CODE information of each glass.
  • the process of establishing data synchronization and data processing jobs through the ETL system may include: start ⁇ obtain scheduled time ⁇ data synchronization (date sync) ⁇ data processing ⁇ update scheduled time ⁇ process successfully.
  • start ⁇ obtain scheduled time ⁇ data synchronization (date sync) ⁇ data processing ⁇ update scheduled time ⁇ process successfully.
  • obtaining the scheduling time may refer to determining the duration of the first scheduling period.
  • the duration of the first scheduling cycle can be set as required. For example, with days as the granularity, the duration of the first scheduling cycle can be 0:00-24:00, or it can be 6:00 of the current day to the next day. 6:00, of course, can also be other time periods without limitation.
  • the data synchronization process may be as shown in FIG. 14 .
  • table input refers to obtaining source data from the Oracle database
  • table output refers to the storage area where the Parquet output component outputs data to the hive database.
  • the data processing may include: according to the site information and time information, filter the historical history information table of the obtained GLASS, and calculate information such as the stay time and equipment of each GLASS in the same site, and obtain the processed data ;Write the processed data into the Hbase database.
  • the processed data can refer to the table 4 above.
  • the synchronized Hive history information may be converted in a multi-process manner.
  • the branch corresponding to table input 1 in FIG. 15 may refer to the information related to the device in the converted and synchronized Hive history information.
  • the branch corresponding to table input 2 may refer to the information related to the subunit of the device in the converted and synchronized Hive history information.
  • table input 1 can refer to the information related to the device (such as EQP_ID) in the synchronized Hive history information
  • row to column 1 can refer to converting the device-specific history information of each row into a column display, through The purpose of converting rows to columns is to convert the data into the input format required by the Hbase database.
  • Filtering records may refer to filtering null values (NULL) in data.
  • Table input 2 may refer to the information (such as UNIT_ID) related to the subunit of the device in the synchronized Hive history information.
  • Table output may refer to storing/writing Table 4 into the Hbase database.
  • the data processing efficiency of the ETL platform can be improved.
  • updating the scheduling time may refer to determining the scheduling period of the next scheduled task.
  • the user console can be used to set the scheduling period of multiple servers.
  • the developer may log in to the login interface shown in FIG. 16 and enter a user name and password. If the entered user name and corresponding password are correct, the pages shown in Figure 2 and Figure 3 can be displayed.
  • correct user name and password may mean that the input user name and corresponding password are the same as the stored user name and corresponding password.
  • communication connections can be established with the multiple servers through a client (such as a computer).
  • the client may be provided with a user console, and the user console may be an application program or a page for controlling multiple servers.
  • the client may display a login interface as shown in FIG. 16 .
  • the page corresponding to the F5 tool can also be displayed through the client. That is, the client communicates with the F5 tool.
  • the client may be configured with an application or a webpage for controlling the F5 tool, through which the F5 tool may be controlled.
  • this embodiment of the present application also provides a data processing effect after the simulated server is down.
  • the specific simulation process can include:
  • FIG. 17 it is the operation status of server 01 (IP address is XX.XX.XX.28). As shown in Figure 18, it is the operation status of server 02 (IP address is XX.XX.XX.27).
  • both servers can operate normally.
  • the F5 tool controls the server 02 to stop running, and controls the server 01 to continue running.
  • the F5 tool can adjust the server load. Both the server 01 and the server 02 can perform data processing operations.
  • the ETL job will be executed normally on another server to meet the high availability requirements of ETL in the actual production environment.
  • the embodiment of the present application can divide the functional modules or functional units of the construction device of the ETL system according to the above method example, for example, each functional module or functional unit can be divided corresponding to each function, or two or more than two functions can be integrated in a processing module.
  • the above-mentioned integrated modules can be implemented not only in the form of hardware, but also in the form of software function modules or functional units.
  • the division of modules or units in the embodiment of the present application is schematic, and is only a logical function division, and there may be another division manner in actual implementation.
  • FIG. 20 it is a schematic structural diagram of an ETL system construction device provided by the embodiment of the present application, and the device includes: a configuration unit 201 and a processing unit 202.
  • the configuration unit 201 is configured to: configure multiple servers so that the configured multiple servers have ETL functions, and multiple servers each have a corresponding identification; configure scan IP addresses for multiple servers, so that Multiple servers can access any database in multiple databases through the scan IP address, and the multiple databases have high availability.
  • the processing unit 202 is configured to: set a virtual IP address for the multiple servers through a preset tool, and a working mode of the multiple servers, the virtual IP address is used to establish a communication connection with the multiple servers, and the multiple servers
  • the working mode includes active-active mode and active-standby mode.
  • the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
  • multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.
  • the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.
  • FIG. 21 it is a schematic structural diagram of a data processing device provided in the embodiment of the present application, which is applied to an ETL system.
  • the ETL system includes multiple servers, and the multiple servers are all equipped with ETL tools.
  • the ETL tools are used to process data.
  • the apparatus includes: an acquiring unit 211 , a determining unit 212 , and a sending unit 213 .
  • the obtaining unit 210 is configured to obtain the first data to be processed in the first scheduling period.
  • the determining unit 212 is configured to determine the working modes of the multiple servers in the first scheduling period in response to the first setting operation; wherein the working modes include a dual-active mode and an active-standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state.
  • the sending unit 213 is configured to send first indication information to the first server, where the first indication information is used to instruct the first server to process the first data to be processed, and the first server is a server in a working state within the first scheduling period.
  • the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of.
  • the device also includes a detection unit 214 configured to: detect the working status of multiple databases; when it is detected that the first database cannot work, store the data stored in the first database to the second database, The first database and the second database are different databases among the plurality of databases.
  • the device further includes a receiving unit 215 and a processing unit 216 .
  • the receiving unit 215 is configured to receive source data from multiple systems in the first scheduling period.
  • the processing unit 216 is configured to combine multiple source data to obtain first data to be processed, and store the first data to be processed in a preset storage area of the first database.
  • the first database includes multiple storage areas, and multiple Each storage area is used to store data to be processed in different scheduling periods.
  • the obtaining unit 211 is specifically configured to extract the first data to be processed from a preset storage area of the Hive database.
  • the detection unit 214 is further configured to detect the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in the working state, the control sending unit 213 sends The first server sends second instruction information, and the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the control sending unit 213 sends Sending third instruction information, where the third instruction information is used to instruct the second server to continue processing the first data to be processed, and the second server is a server in a working state among the plurality of servers.
  • the acquiring unit 211 is further configured to acquire the second data to be processed in the second scheduling period; the sending unit 213 is further configured to send the data used for the first
  • the server processes fourth indication information of the second data to be processed.
  • the determining unit 191 before sending the fourth indication information to the first server, is further configured to determine the working modes of the multiple servers in the second scheduling period in response to the second setting operation.
  • the processing unit 216 is further configured to adjust the duration and scheduling frequency of the scheduling periods of the multiple servers in response to the third setting operation.
  • the acquisition unit 211 in the embodiment of the present application may be integrated on a communication interface, and the configuration unit 201 and the processing unit 202 may be integrated on a processor.
  • the specific implementation is shown in Figure 22.
  • Fig. 22 shows a schematic structural diagram of another possible communication device of the construction device of the ETL system and the data processing device involved in the above embodiment.
  • the communication device includes: a processor 2202 and a communication interface 2203 .
  • the processor 2202 is used to control and manage the actions of the device, for example, to execute the steps executed by the processing unit 212 and the processing unit 216 above, and/or to execute other processes of the technologies described herein.
  • the communication interface 2203 is used to support communication between the device and other network entities, for example, to perform the steps performed by the above-mentioned obtaining unit 211 .
  • the device may also include a memory 2201 and a bus 2204, and the memory 2201 is used to store program codes and data of the device.
  • the memory 2201 may be a memory in the device, etc., and the memory may include a volatile memory, such as a random access memory; the memory may also include a non-volatile memory, such as a read-only memory, flash memory, hard disk or Solid-state hard disk; the memory may also include a combination of the above-mentioned types of memory.
  • a volatile memory such as a random access memory
  • the memory may also include a non-volatile memory, such as a read-only memory, flash memory, hard disk or Solid-state hard disk
  • the memory may also include a combination of the above-mentioned types of memory.
  • the above-mentioned processor 2202 may realize or execute various exemplary logic blocks, modules and circuits described in conjunction with the disclosure of this application.
  • the processor may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute the various illustrative logical blocks, modules and circuits described in connection with the present disclosure.
  • the processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of DSP and a microprocessor, and the like.
  • the bus 2204 may be an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like.
  • EISA Extended Industry Standard Architecture
  • the bus 2204 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 22 , but it does not mean that there is only one bus or one type of bus.
  • the device in Fig. 22 may also be a chip.
  • the chip includes one or more than two (including two) processors 2202 and a communication interface 2203 .
  • the chip further includes a memory 2201 .
  • the memory 2201 may include a read-only memory and a random access memory, and provides operation instructions and data to the processor 2202 .
  • a part of the memory 2201 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • the memory 2201 stores the following elements, execution modules or data structures, or their subsets, or their extended sets.
  • the corresponding operation is executed by calling the operation instruction stored in the memory 2201 (the operation instruction may be stored in the operating system).
  • Some embodiments of the present disclosure provide a computer-readable storage medium (for example, a non-transitory computer-readable storage medium), where computer program instructions are stored in the computer-readable storage medium.
  • a computer-readable storage medium for example, a non-transitory computer-readable storage medium
  • the computer is made to execute the construction method of the ETL system as described in any one of the above embodiments.
  • the above-mentioned computer-readable storage medium may include, but is not limited to: a magnetic storage device (for example, a hard disk, a floppy disk, or a magnetic tape, etc.), an optical disk (for example, a CD (Compact Disk, a compact disk), a DVD (Digital Versatile Disk, Digital Versatile Disk), etc.), smart cards and flash memory devices (for example, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), card, stick or key drive, etc.).
  • Various computer-readable storage media described in this disclosure can represent one or more devices and/or other machine-readable storage media for storing information.
  • the term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
  • Some embodiments of the present disclosure also provide a computer program product, for example, the computer program product is stored on a non-transitory computer-readable storage medium.
  • the computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the methods described in the above-mentioned embodiments.
  • Some embodiments of the present disclosure also provide a computer program.
  • the computer program When the computer program is executed on the computer, the computer program causes the computer to execute the methods described in the above-mentioned embodiments.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

Abstract

An ETL system construction method and apparatus, a data processing method and apparatus, and an ETL system, which relate to the technical field of data processing and allow rational use of an ETL platform. The ETL system comprises a setting tool and a plurality of servers, wherein the setting tool is in communication connection with the plurality of servers by means of virtual IPs, the setting tool is used for determining scheduling tasks and scheduling cycles of the plurality of servers in response to a first setting operation, and the setting tool is used for determining working modes of the plurality of servers in each scheduling cycle in response to a second setting operation, the working modes comprising an active-active mode and an active-standby mode; and the plurality of servers are used for receiving the scheduling tasks from the setting tool, processing data according to the scheduling tasks, and storing the processed data in an Hbase database. The ETL system provided in the embodiments of the present application ensures the normal operation of an ETL platform by means of the high availability of a plurality of servers and a plurality of databases.

Description

ETL系统的构建方法及装置、数据处理方法及装置、ETL系统Construction method and device of ETL system, data processing method and device, ETL system 技术领域technical field
本公开涉及数据处理技术领域,尤其涉及一种ETL系统的构建方法及装置、数据处理方法及装置、ETL系统。The present disclosure relates to the technical field of data processing, and in particular to a construction method and device of an ETL system, a data processing method and device, and an ETL system.
背景技术Background technique
随着企业的发展,企业的各个部门(如业务线、生产线、产品线)都会承建各种信息化系统,用于处理各自的业务。例如,生产线承建的信息化系统可以将生产线的设备上传的产品数据、以及人工上传的产品数据进行统一整合后进行分析,进而方便开发人员根据分析结果判断产品的不良原因。With the development of the enterprise, various departments of the enterprise (such as business lines, production lines, and product lines) will undertake the construction of various information systems to process their respective businesses. For example, the information system constructed by the production line can integrate the product data uploaded by the equipment of the production line and the product data uploaded manually for analysis, so that developers can judge the cause of product failure based on the analysis results.
为了将各部门的数据整合,避免数据孤岛的现象,需要使用满足数据集成需求的抽取-转换-加载(extract-transform-load,ETL)平台来实现。ETL平台可以将数据从来源端经过抽取、交互转换、加载至目的端,从而可以将分散、零乱、标准不统一的数据整合到一起,便于开发人员进行数据的分析。因此如何合理的使用ETL平台成为亟待解决的技术问题。In order to integrate the data of various departments and avoid the phenomenon of data islands, it is necessary to use an extract-transform-load (ETL) platform that meets the data integration requirements. The ETL platform can extract, interactively transform, and load data from the source to the destination, so that scattered, messy, and non-uniform data can be integrated to facilitate data analysis by developers. Therefore, how to use the ETL platform reasonably has become a technical problem to be solved urgently.
发明内容Contents of the invention
一方面,提供一种ETL系统的构建方法,该ETL系统包括多个服务器以及多个数据库,该方法包括:对多个服务器进行配置,以使得配置后的多个服务器具有ETL功能,且多个服务器各自具有对应的标识;对多个服务器配置scan IP地址,以使得多个服务器可以通过该scan IP地址访问多个数据库中的任一数据库,该多个数据库具有高可用性;通过设置工具为多个服务器设置虚拟IP地址,以及多个服务器的工作模式,该虚拟IP地址用于建立与多个服务器之间的通信连接,多个服务器的工作模式包括双活模式和主备模式。In one aspect, a method for constructing an ETL system is provided. The ETL system includes multiple servers and multiple databases. The method includes: configuring the multiple servers so that the configured multiple servers have ETL functions, and the multiple Each server has a corresponding identification; configure scan IP addresses for multiple servers, so that multiple servers can access any database in multiple databases through the scan IP address, and the multiple databases have high availability; Set a virtual IP address for each server, and the working mode of multiple servers. The virtual IP address is used to establish a communication connection with multiple servers. The working mode of multiple servers includes active-active mode and active-standby mode.
基于上述技术方案,通过为多个服务器进行设置,以使得多个服务器可以具有ETL功能。同时,通过为多个服务器可以访问多个数据库中任一个数据库的scan IP,当出现数据库出现故障时,其他数据库可以继续正常工作,保证了数据库之间的高可用性。通过设置工具为多个服务器设置虚拟IP地址及工作模式,以使得可以通过该虚拟IP地址访问任一服务器,也即,通过多个服务器之间的高可用性,保证了ETL平台可以正常运行。Based on the above technical solution, by setting up multiple servers, the multiple servers can have the ETL function. At the same time, by allowing multiple servers to access the scan IP of any database in multiple databases, when a database fails, other databases can continue to work normally, ensuring high availability among databases. Set the virtual IP address and working mode for multiple servers through the setting tool, so that any server can be accessed through the virtual IP address, that is, through the high availability among multiple servers, the normal operation of the ETL platform is guaranteed.
在一些实施例中,多个数据库具有高可用性包括:当第一数据库无法工作时,第二数据库继续工作,第一数据库和第二数据库为所述多个数据库中不同的数据库。In some embodiments, the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
在一些实施例中,在双活模式下,多个服务器均处于工作状态,当第一 服务器无法处理数据时,第二服务器继续处理所述数据,第一服务器与第二服务器为多个服务器中不同的服务器。In some embodiments, in the active-active mode, multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.
在一些实施例中,在主备模式下,多个服务器包括主服务器和备服务器,主服务器用于处理数据,备服务器用于当主服务器无法处理数据时,继续处理数据。In some embodiments, in the active/standby mode, the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.
另一方面,提供一种ETL系统的构建装置,包括:配置单元、处理单元。配置单元,被配置为:对多个服务器进行配置,以使得配置后的多个服务器具有ETL功能,且多个服务器各自具有对应的标识;对多个服务器配置scan IP地址,以使得多个服务器可以通过该scan IP地址访问多个数据库中的任一数据库,该多个数据库具有高可用性。处理单元,被配置为:通过设置工具为多个服务器设置虚拟IP地址,以及多个服务器的工作模式,该虚拟IP地址用于建立与多个服务器之间的通信连接,多个服务器的工作模式包括双活模式和主备模式。In another aspect, a device for constructing an ETL system is provided, including: a configuration unit and a processing unit. The configuration unit is configured to: configure multiple servers, so that the configured multiple servers have ETL functions, and each of the multiple servers has a corresponding identifier; configure scan IP addresses for multiple servers, so that multiple servers Any one of multiple databases can be accessed through the scan IP address, and the multiple databases are highly available. The processing unit is configured to: set a virtual IP address for multiple servers through a setting tool, and the working mode of the multiple servers, the virtual IP address is used to establish a communication connection with the multiple servers, and the working mode of the multiple servers Including active-active mode and active-standby mode.
在一些实施例中,多个数据库具有高可用性包括:当第一数据库无法工作时,第二数据库继续工作,第一数据库和第二数据库为所述多个数据库中不同的数据库。In some embodiments, the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
在一些实施例中,在双活模式下,多个服务器均处于工作状态,当第一服务器无法处理数据时,第二服务器继续处理所述数据,第一服务器与第二服务器为多个服务器中不同的服务器。In some embodiments, in the active-active mode, multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.
在一些实施例中,在主备模式下,多个服务器包括主服务器和备服务器,主服务器用于处理数据,备服务器用于当主服务器无法处理数据时,继续处理数据。In some embodiments, in the active/standby mode, the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.
再一方面,提供了一种数据处理方法,应用于ETL系统,该ETL系统包括多个服务器,多个服务器均配置有ETL工具,ETL工具用于处理数据,该方法包括:获取第一调度周期的第一待处理数据;响应于第一设置操作,确定在第一调度周期内多个服务器的工作模式;其中,工作模式包括双活模式、主备模式。在双活模式下,多个服务器均处于工作状态;在主备模式下,多个服务器包括主服务器和备服务器,主服务器处于工作状态,备服务器处于休眠状态;当主服务器无法处理数据的情况下,备服务器从休眠状态转为工作状态;向第一服务器发送第一指示信息,第一指示信息用于指示第一服务器处理第一待处理数据,第一服务器为第一调度周期内处于工作状态的服务器。In another aspect, a data processing method is provided, which is applied to an ETL system. The ETL system includes a plurality of servers, and the plurality of servers are equipped with ETL tools, and the ETL tools are used to process data. The method includes: obtaining the first scheduling period The first data to be processed; in response to the first setting operation, determine the working modes of the multiple servers in the first scheduling period; where the working modes include active-active mode and active-standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; sends the first indication information to the first server, the first indication information is used to instruct the first server to process the first data to be processed, and the first server is in the working state within the first scheduling period server.
基于上述技术方案,在获取第一调取周期内的待处理数据之后,确定多 个服务器的工作模式,并向多个服务器中处于工作状态的服务器发送用于处理数据的指示信息,可以使得服务器可以对该待处理数据进行ETL处理。Based on the above technical solution, after obtaining the data to be processed in the first call cycle, determine the working modes of multiple servers, and send instruction information for processing data to the server in the working state among the multiple servers, which can make the server ETL processing can be performed on the data to be processed.
在一些实施例中,ETL系统还包括多个数据库,多个服务器分别与多个数据库通信连接,多个数据库之间通信连接,多个数据库用于存储服务器的配置信息、数据处理进程、调度任务中一个或多个。该方法还包括:检测多个数据库的工作状态;当检测到第一数据库无法工作时,将第一数据库中存储的数据存储至第二数据库,第一数据库与第二数据库为所述多个数据库中不同的数据库。In some embodiments, the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of. The method also includes: detecting the working status of multiple databases; when it is detected that the first database cannot work, storing the data stored in the first database to the second database, the first database and the second database are the multiple databases different databases.
在一些实施例中,该方法还包括:接收来自多个系统在第一调度周期的源数据;将多个源数据进行合并,得到第一待处理数据,并将第一待处理数据存储至第一数据库的预设存储区域,第一数据库包括多个存储区域,多个存储区域用于存储不同调度周期内的待处理数据。上述“获取第一调度周期的第一待处理数据”具体可以包括:从Hive数据库的预设存储区域中抽取第一待处理数据。In some embodiments, the method further includes: receiving source data from multiple systems in the first scheduling period; combining multiple source data to obtain first data to be processed, and storing the first data to be processed in the first scheduling period A preset storage area of a database, the first database includes multiple storage areas, and the multiple storage areas are used to store data to be processed in different scheduling periods. The above "obtaining the first data to be processed in the first scheduling period" may specifically include: extracting the first data to be processed from a preset storage area of the Hive database.
在一些实施例中,该方法还包括:检测第一服务器处理所述第一待处理数据的处理结果;当处理结果为失败,且第一服务器处于工作状态时,向第一服务器发送第二指示信息,第二指示信息用于指示第一服务器继续处理第一待处理数据;当处理结果为失败,且第一服务器无法处理数据时,向第二服务器发送第三指示信息,第三指示信息用于指示第二服务器继续处理第一待处理数据,第二服务器为所述多个服务器中处于工作状态的服务器。In some embodiments, the method further includes: detecting the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in a working state, sending a second indication to the first server information, the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the third instruction information is sent to the second server, and the third instruction information uses In order to instruct the second server to continue processing the first data to be processed, the second server is a server in a working state among the plurality of servers.
在一些实施例中,当处理结果为成功时,该方法还包括:获取第二调度周期的第二待处理数据;向第一服务器发送用于第一服务器处理第二待处理数据的第四指示信息。In some embodiments, when the processing result is successful, the method further includes: acquiring the second data to be processed in the second scheduling period; sending a fourth instruction for the first server to process the second data to be processed to the first server information.
在一些实施例中,在向第一服务器发送第四指示信息之前,该方法还包括:响应于第二设置操作,确定在第二调度周期内多个服务器的工作模式。In some embodiments, before sending the fourth indication information to the first server, the method further includes: in response to the second setting operation, determining the working modes of the plurality of servers in the second scheduling period.
在一些实施例中,该方法还包括:响应于第三设置操作,确定多个服务器的调度周期的时长及调度频率。In some embodiments, the method further includes: in response to the third setting operation, determining the duration and scheduling frequency of the scheduling periods of the plurality of servers.
又一方面,提供了一种数据处理装置,应用于ETL系统,该ETL系统包括多个服务器,多个服务器均配置有ETL工具,ETL工具用于处理数据,该装置包括:获取单元,被配置为获取第一调度周期的第一待处理数据;确定单元,被配置为响应于第一设置操作,确定在第一调度周期内多个服务器的工作模式;其中,工作模式包括双活模式、主备模式。在双活模式下,多个服务器均处于工作状态;在主备模式下,多个服务器包括主服务器和备服 务器,主服务器处于工作状态,备服务器处于休眠状态;当主服务器无法处理数据的情况下,备服务器从休眠状态转为工作状态;发送单元,被配置为向第一服务器发送第一指示信息,第一指示信息用于指示第一服务器处理第一待处理数据,第一服务器为第一调度周期内处于工作状态的服务器。In yet another aspect, a data processing device is provided, which is applied to an ETL system. The ETL system includes a plurality of servers, and the plurality of servers are equipped with ETL tools, and the ETL tools are used to process data. The device includes: an acquisition unit configured to In order to obtain the first data to be processed in the first scheduling period; the determining unit is configured to determine the working modes of multiple servers in the first scheduling period in response to the first setting operation; wherein the working modes include active-active mode, master standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; the sending unit is configured to send first indication information to the first server, the first indication information is used to instruct the first server to process the first data to be processed, and the first server is the first Servers in working state during the scheduling period.
在一些实施例中,ETL系统还包括多个数据库,多个服务器分别与多个数据库通信连接,多个数据库之间通信连接,多个数据库用于存储服务器的配置信息、数据处理进程、调度任务中一个或多个。该装置还包括检测单元,被配置为:检测多个数据库的工作状态;当检测到第一数据库无法工作时,将第一数据库中存储的数据存储至第二数据库,第一数据库与第二数据库为所述多个数据库中不同的数据库。In some embodiments, the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of. The device also includes a detection unit configured to: detect the working status of multiple databases; when it is detected that the first database cannot work, store the data stored in the first database to the second database, and the first database and the second database are different databases among the plurality of databases.
在一些实施例中,该装置还包括接收单元和处理单元;接收单元,被配置为接收来自多个系统在第一调度周期的源数据;处理单元,被配置为将多个源数据进行合并,得到第一待处理数据,并将第一待处理数据存储至第一数据库的预设存储区域,第一数据库包括多个存储区域,多个存储区域用于存储不同调度周期内的待处理数据。获取单元,具体用于:从Hive数据库的预设存储区域中抽取第一待处理数据。In some embodiments, the device further includes a receiving unit and a processing unit; the receiving unit is configured to receive source data from multiple systems in the first scheduling period; the processing unit is configured to combine multiple source data, The first data to be processed is obtained, and the first data to be processed is stored in a preset storage area of the first database. The first database includes multiple storage areas, and the multiple storage areas are used to store the data to be processed in different scheduling periods. The obtaining unit is specifically configured to: extract the first data to be processed from the preset storage area of the Hive database.
在一些实施例中,检测单元,还被配置为检测第一服务器处理所述第一待处理数据的处理结果;当处理结果为失败,且第一服务器处于工作状态时,控制发送单元向第一服务器发送第二指示信息,第二指示信息用于指示第一服务器继续处理第一待处理数据;当处理结果为失败,且第一服务器无法处理数据时,控制发送单元向第二服务器发送第三指示信息,第三指示信息用于指示第二服务器继续处理第一待处理数据,第二服务器为所述多个服务器中处于工作状态的服务器。In some embodiments, the detection unit is further configured to detect the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in working state, the control sending unit sends the first The server sends second instruction information, and the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the control sending unit sends the third Instruction information, the third instruction information is used to instruct the second server to continue processing the first data to be processed, and the second server is a server in a working state among the plurality of servers.
在一些实施例中,当处理结果为成功时,获取单元,还被配置为获取第二调度周期的第二待处理数据;发送单元,还被配置为向第一服务器发送用于第一服务器处理第二待处理数据的第四指示信息。In some embodiments, when the processing result is successful, the obtaining unit is further configured to obtain the second data to be processed in the second scheduling period; the sending unit is also configured to send the data to the first server for processing by the first server. Fourth indication information of the second data to be processed.
在一些实施例中,在向第一服务器发送第四指示信息之前,确定单元,还被配置为响应于第二设置操作,确定在第二调度周期内多个服务器的工作模式。In some embodiments, before sending the fourth indication information to the first server, the determining unit is further configured to determine the working modes of the plurality of servers in the second scheduling period in response to the second setting operation.
在一些实施例中,确定单元,还被配置为响应于第三设置操作,确定多个服务器的调度周期的时长及调度频率。In some embodiments, the determining unit is further configured to determine the duration and scheduling frequency of the scheduling periods of the multiple servers in response to the third setting operation.
再一方面,提供了一种ETL系统,ETL系统包括设置工具及多个服务器,设置工具通过虚拟IP与多个服务器通信连接,设置工具与多个服务器通信连 接;设置工具用于响应于第一设置操作,确定多个服务器的调度任务及调度周期;设置工具还用于响应于第二设置操作,确定每个调度周期内多个服务器的工作模式,工作模式包括双活模式、主备模式,在双活模式下,多个服务器均处于工作状态;在主备模式下,多个服务器包括主服务器和备服务器,主服务器处于工作状态,备服务器处于休眠状态;当主服务器无法处理数据的情况下,备服务器从休眠状态转为工作状态;多个服务器用于接收来自调度器的调度任务,并根据调度任务对数据进行处理,并将处理后的数据存储至Hbase数据库。In another aspect, an ETL system is provided, the ETL system includes a setting tool and multiple servers, the setting tool communicates with multiple servers through a virtual IP, and the setting tool communicates with multiple servers; the setting tool is used to respond to the first The setting operation is to determine the scheduling tasks and scheduling cycles of multiple servers; the setting tool is also used to respond to the second setting operation to determine the working modes of multiple servers in each scheduling cycle. The working modes include active-active mode, active-standby mode, In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; multiple servers are used to receive scheduling tasks from the scheduler, process the data according to the scheduling tasks, and store the processed data in the Hbase database.
在一些实施例中,ETL系统还包括多个数据库,多个服务器通过scan IP地址访问多个数据库,多个数据库用于存储多个服务器的配置信息、数据处理进程、调度任务中的一个或多个,多个数据库具有高可用性。In some embodiments, the ETL system also includes multiple databases, multiple servers access multiple databases through scan IP addresses, multiple databases are used to store one or more of configuration information, data processing processes, and scheduling tasks of multiple servers , multiple databases with high availability.
在一些实施例中,多个数据库具有高可用性包括:当第一数据库无法工作时,第二数据库继续工作,第一数据库和第二数据库为所述多个数据库中不同的数据库。In some embodiments, the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
在一些实施例中,设置工具还用于响应于第三设置操作,调整多个服务器的调度任务和/或调度周期。In some embodiments, the setting tool is further configured to adjust the scheduling tasks and/or scheduling periods of the multiple servers in response to the third setting operation.
又一方面,提供一种计算机可读存储介质。所述计算机可读存储介质存储有计算机程序指令,所述计算机程序指令在计算机上运行时,使得所述计算机执行如上述任一实施例所述的ETL系统的构建方法和数据处理方法。In yet another aspect, a computer readable storage medium is provided. The computer-readable storage medium stores computer program instructions. When the computer program instructions run on a computer, the computer executes the ETL system construction method and data processing method as described in any of the above embodiments.
又一方面,提供一种计算机可读存储介质。所述计算机可读存储介质存储有计算机程序指令,所述计算机程序指令在计算机上运行时,使得所述计算机执行如上述任一实施例所述的数据处理方法。In yet another aspect, a computer readable storage medium is provided. The computer-readable storage medium stores computer program instructions, and when the computer program instructions run on a computer, the computer executes the data processing method as described in any one of the above embodiments.
又一方面,提供一种计算机程序产品。所述计算机程序产品包括计算机程序指令,在计算机上执行所述计算机程序指令时,所述计算机程序指令使计算机执行如上述任一实施例所述的ETL系统的构建方法。In yet another aspect, a computer program product is provided. The computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the ETL system construction method as described in any of the above embodiments.
又一方面,提供一种计算机程序产品。所述计算机程序产品包括计算机程序指令,在计算机上执行所述计算机程序指令时,所述计算机程序指令使计算机执行如上述任一实施例所述的数据处理方法。In yet another aspect, a computer program product is provided. The computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the data processing method as described in any of the above embodiments.
又一方面,提供一种计算机程序。当所述计算机程序在计算机(上执行时,所述计算机程序使计算机执行如上述任一实施例所述的ETL系统的构建方法。In yet another aspect, a computer program is provided. When the computer program is executed on the computer, the computer program causes the computer to execute the ETL system construction method described in any of the above embodiments.
又一方面,提供一种计算机程序。当所述计算机程序在计算机(上执行时,所述计算机程序使计算机执行如上述任一实施例所述的数据处理方法。In yet another aspect, a computer program is provided. When the computer program is executed on a computer, the computer program causes the computer to execute the data processing method described in any of the above embodiments.
又一方面,提供一种芯片,芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行计算机程序或指令,以实现如上述任一实施例所述的ETL系统的构建方法。In yet another aspect, a chip is provided, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run computer programs or instructions, so as to implement the construction method of the ETL system as described in any of the above embodiments.
又一方面,提供一种芯片,芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行计算机程序或指令,以实现如上述任一实施例所述的数据处理方法。In yet another aspect, a chip is provided, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run computer programs or instructions to implement the data processing method as described in any of the above embodiments.
具体的,本申请中提供的芯片还包括存储器,用于存储计算机程序或指令。Specifically, the chip provided in this application further includes a memory for storing computer programs or instructions.
需要说明的是,上述计算机指令可以全部或者部分存储在计算机可读存储介质上。其中,计算机可读存储介质可以与装置的处理器封装在一起的,也可以与装置的处理器单独封装,本申请对此不作限定。It should be noted that all or part of the above computer instructions may be stored on a computer-readable storage medium. Wherein, the computer-readable storage medium may be packaged together with the processor of the device, or may be packaged separately with the processor of the device, which is not limited in the present application.
在本申请中,上述装置的名字对设备或功能模块本身不构成限定,在实际实现中,这些设备或功能模块可以以其他名称出现。只要各个设备或功能模块的功能和本申请类似,属于本申请权利要求及其等同技术的范围之内。In this application, the names of the above-mentioned devices do not limit the devices or functional modules themselves. In actual implementation, these devices or functional modules may appear with other names. As long as the functions of each device or functional module are similar to those of the present application, they fall within the scope of the claims of the present application and their equivalent technologies.
附图说明Description of drawings
为了更清楚地说明本公开中的技术方案,下面将对本公开一些实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例的附图,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。此外,以下描述中的附图可以视作示意图,并非对本公开实施例所涉及的产品的实际尺寸、方法的实际流程、信号的实际时序等的限制。In order to illustrate the technical solutions in the present disclosure more clearly, the following will briefly introduce the accompanying drawings required in some embodiments of the present disclosure. Obviously, the accompanying drawings in the following description are only appendices to some embodiments of the present disclosure. Figures, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings. In addition, the drawings in the following description can be regarded as schematic diagrams, and are not limitations on the actual size of the product involved in the embodiments of the present disclosure, the actual process of the method, the actual timing of signals, and the like.
图1为本申请实施例提供的一种ETL系统的结构图;Fig. 1 is the structural diagram of a kind of ETL system that the application embodiment provides;
图2为本申请实施例提供的一种设置调度周期的界面示意图;FIG. 2 is a schematic diagram of an interface for setting a scheduling cycle provided by an embodiment of the present application;
图3为本申请实施例提供的另一种设置调度周期的界面示意图;FIG. 3 is a schematic diagram of another interface for setting the scheduling cycle provided by the embodiment of the present application;
图4为本申请实施例提供的一种F5工具的界面示意图;Fig. 4 is a schematic interface diagram of a F5 tool provided by the embodiment of the present application;
图5为本申请实施例提供的另一种ETL系统的结构图;Fig. 5 is the structural diagram of another kind of ETL system that the embodiment of the present application provides;
图6为本申请实施例提供的又一种ETL系统的结构图;Fig. 6 is the structural diagram of another kind of ETL system that the embodiment of the present application provides;
图7为本申请实施例提供的一种ETL系统的构建方法的流程示意图;FIG. 7 is a schematic flow diagram of a method for constructing an ETL system provided in an embodiment of the present application;
图8为本申请实施例提供的一种数据处理方法的流程示意图;FIG. 8 is a schematic flowchart of a data processing method provided in an embodiment of the present application;
图9为本申请实施例提供的另一种数据处理方法的流程示意图;FIG. 9 is a schematic flowchart of another data processing method provided by the embodiment of the present application;
图10为本申请实施例提供的又一种数据处理方法的流程示意图;FIG. 10 is a schematic flowchart of another data processing method provided in the embodiment of the present application;
图11为本申请实施例提供的又一种数据处理方法的流程示意图;FIG. 11 is a schematic flowchart of another data processing method provided in the embodiment of the present application;
图12为本申请实施例提供的又一种数据处理方法的流程示意图;FIG. 12 is a schematic flowchart of another data processing method provided in the embodiment of the present application;
图13为本申请实施例提供的一种数据处理方法的流程示意图;FIG. 13 is a schematic flow diagram of a data processing method provided in an embodiment of the present application;
图14为本申请实施例提供的一种数据处理方法的流程示意图;FIG. 14 is a schematic flowchart of a data processing method provided in an embodiment of the present application;
图15为本申请实施例提供的一种数据处理方法的流程示意图;FIG. 15 is a schematic flowchart of a data processing method provided in an embodiment of the present application;
图16为本申请实施例提供的一种登录界面的示意图;FIG. 16 is a schematic diagram of a login interface provided by an embodiment of the present application;
图17为本申请实施例提供的一种服务器的数据处理结果的示意图;FIG. 17 is a schematic diagram of a data processing result of a server provided by an embodiment of the present application;
图18为本申请实施例提供的另一种服务器的数据处理结果的示意图;FIG. 18 is a schematic diagram of data processing results of another server provided in the embodiment of the present application;
图19为本申请实施例提供的另一种服务器的数据处理结果的示意图;FIG. 19 is a schematic diagram of data processing results of another server provided in the embodiment of the present application;
图20为本申请实施例提供的一种ETL系统的构建装置的示意图;FIG. 20 is a schematic diagram of a construction device of an ETL system provided by an embodiment of the present application;
图21为本申请实施例提供的一种数据处理装置的示意图;FIG. 21 is a schematic diagram of a data processing device provided by an embodiment of the present application;
图22为本申请实施例提供的一种通信装置的示意图。FIG. 22 is a schematic diagram of a communication device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合附图,对本公开一些实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开所提供的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。The technical solutions in some embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are only some of the embodiments of the present disclosure, not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments provided in the present disclosure belong to the protection scope of the present disclosure.
除非上下文另有要求,否则,在整个说明书和权利要求书中,术语“包括(comprise)”及其其他形式例如第三人称单数形式“包括(comprises)”和现在分词形式“包括(comprising)”被解释为开放、包含的意思,即为“包含,但不限于”。在说明书的描述中,术语“一个实施例(one embodiment)”、“一些实施例(some embodiments)”、“示例性实施例(exemplary embodiments)”、“示例(example)”、“特定示例(specific example)”或“一些示例(some examples)”等旨在表明与该实施例或示例相关的特定特征、结构、材料或特性包括在本公开的至少一个实施例或示例中。上述术语的示意性表示不一定是指同一实施例或示例。此外,所述的特定特征、结构、材料或特点可以以任何适当方式包括在任何一个或多个实施例或示例中。Throughout the specification and claims, unless the context requires otherwise, the term "comprise" and other forms such as the third person singular "comprises" and the present participle "comprising" are used Interpreted as the meaning of openness and inclusion, that is, "including, but not limited to". In the description of the specification, the terms "one embodiment", "some embodiments", "exemplary embodiments", "example", "specific examples" example)" or "some examples (some examples)" etc. are intended to indicate that specific features, structures, materials or characteristics related to the embodiment or examples are included in at least one embodiment or example of the present disclosure. Schematic representations of the above terms are not necessarily referring to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be included in any suitable manner in any one or more embodiments or examples.
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。Hereinafter, the terms "first" and "second" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present disclosure, unless otherwise specified, "plurality" means two or more.
在描述一些实施例时,可能使用了“耦接”和“连接”及其衍伸的表达。 例如,描述一些实施例时可能使用了术语“连接”以表明两个或两个以上部件彼此间有直接物理接触或电接触。又如,描述一些实施例时可能使用了术语“耦接”以表明两个或两个以上部件有直接物理接触或电接触。然而,术语“耦接”或“通信耦合(communicatively coupled)”也可能指两个或两个以上部件彼此间并无直接接触,但仍彼此协作或相互作用。这里所公开的实施例并不必然限制于本文内容。In describing some embodiments, the expressions "coupled" and "connected" and their derivatives may be used. For example, the term "connected" may be used in describing some embodiments to indicate that two or more elements are in direct physical or electrical contact with each other. As another example, the term "coupled" may be used when describing some embodiments to indicate that two or more elements are in direct physical or electrical contact. However, the terms "coupled" or "communicatively coupled" may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments disclosed herein are not necessarily limited by the context herein.
“A、B和C中的至少一个”与“A、B或C中的至少一个”具有相同含义,均包括以下A、B和C的组合:仅A,仅B,仅C,A和B的组合,A和C的组合,B和C的组合,及A、B和C的组合。"At least one of A, B and C" has the same meaning as "at least one of A, B or C" and both include the following combinations of A, B and C: A only, B only, C only, A and B A combination of A and C, a combination of B and C, and a combination of A, B and C.
“A和/或B”,包括以下三种组合:仅A,仅B,及A和B的组合。"A and/or B" includes the following three combinations: A only, B only, and a combination of A and B.
如本文中所使用,根据上下文,术语“如果”任选地被解释为意思是“当……时”或“在……时”或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定……”或“如果检测到[所陈述的条件或事件]”任选地被解释为是指“在确定……时”或“响应于确定……”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。As used herein, the term "if" is optionally interpreted to mean "when" or "at" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrases "if it is determined that ..." or "if [the stated condition or event] is detected" are optionally construed to mean "when determining ..." or "in response to determining ..." depending on the context Or "upon detection of [stated condition or event]" or "in response to detection of [stated condition or event]".
本文中“适用于”或“被配置为”的使用意味着开放和包容性的语言,其不排除适用于或被配置为执行额外任务或步骤的设备。The use of "suitable for" or "configured to" herein means open and inclusive language that does not exclude devices that are suitable for or configured to perform additional tasks or steps.
另外,“基于”的使用意味着开放和包容性,因为“基于”一个或多个所述条件或值的过程、步骤、计算或其他动作在实践中可以基于额外条件或超出所述的值。Additionally, the use of "based on" is meant to be open and inclusive, as a process, step, calculation, or other action that is "based on" one or more stated conditions or values may in practice be based on additional conditions or beyond stated values.
如本文所使用的那样,“约”、“大致”或“近似”包括所阐述的值以及处于特定值的可接受偏差范围内的平均值,其中所述可接受偏差范围如由本领域普通技术人员考虑到正在讨论的测量以及与特定量的测量相关的误差(即,测量系统的局限性)所确定。As used herein, "about", "approximately" or "approximately" includes the stated value as well as the average within the acceptable deviation range of the specified value, wherein the acceptable deviation range is as determined by one of ordinary skill in the art. Determined taking into account the measurement in question and the errors associated with the measurement of a particular quantity (ie, limitations of the measurement system).
以下,对本申请实施例涉及的名词进行解释,以方便读者理解。In the following, nouns involved in the embodiments of the present application are explained for the convenience of readers' understanding.
ETL:可以用来描述将数据从来源端经过抽取、转换和加载至目的端的过程。ETL可以是指:从不同数据源中抽取出所需的数据,将所抽取的数据进行数据清洗和转换,再将清洗和转换后的数据加载到数据库中。ETL: Can be used to describe the process of extracting, transforming, and loading data from the source to the destination. ETL can refer to: extracting the required data from different data sources, performing data cleaning and conversion on the extracted data, and then loading the cleaned and converted data into the database.
一种示例中,节点或服务器可以配置有ETL工具,该ETL工具可以执行ETL的功能。例如,该ETL工具可以为应用程序,也可以为可以执行该功能的器件,不予限制。In one example, a node or server can be configured with an ETL tool that can perform ETL functions. For example, the ETL tool may be an application program or a device capable of performing the function, without limitation.
需要说明的是,本申请实施例中,配置有ETL工具的节点或服务器可以称为ETL平台/系统,或者可以为ETL平台/系统中的一部分。It should be noted that, in the embodiment of the present application, a node or a server configured with an ETL tool may be referred to as an ETL platform/system, or may be a part of the ETL platform/system.
数据库:可以用于存储ETL平台运行所需的数据。如从不同数据源中抽取出的数据、清洗后的数据、转换后的数据等,还可以用于存储与ETL相关的配置文件等。例如,数据库可以用于存储ETL提交到服务器端的作业内容、ETL的配置信息、ETL的调度信息等。具体的,可以参照下述实施例的描述。Database: It can be used to store the data required for the operation of the ETL platform. For example, data extracted from different data sources, cleaned data, converted data, etc., can also be used to store configuration files related to ETL. For example, the database can be used to store job content submitted by ETL to the server, ETL configuration information, ETL scheduling information, and the like. Specifically, reference may be made to the description of the following embodiments.
例如,数据库可以为关系型数据库(relational database management system,RDBMS)。比如,可以为MySQL数据库、Oracle数据库、Postgresql数据库等。For example, the database may be a relational database management system (RDBMS). For example, it may be a MySQL database, an Oracle database, a Postgresql database, and the like.
需要说明的是,本申请实施例中的多个数据库之间以及多个配置有ETL工具的服务器之间具有高可用性。It should be noted that in the embodiment of the present application, there is high availability between multiple databases and multiple servers configured with ETL tools.
高可用性:是指一个系统经过专门的设计,当一个设备出现故障时,其他设备可以继续运行,从而减少停工时间,保持该系统的可用性。High availability: It means that a system is specially designed so that when one device fails, other devices can continue to operate, thereby reducing downtime and maintaining the availability of the system.
例如,以一个系统具有多个数据库,且该多个数据库之间的高可用性为例,当多个数据库中的一个数据库无法提供存储服务时,如该数据库断电或发生硬件或软件故障,该多个数据库中的处于正常状态的数据库可以继续提供存储服务。从而该系统可以继续提供存储服务。For example, taking a system with multiple databases and high availability among the multiple databases as an example, when one of the multiple databases cannot provide storage services, such as a power outage or hardware or software failure of the database, the A healthy database among multiple databases can continue to provide storage services. Thus the system can continue to provide storage services.
又例如,以一个系统具有多个配置有ETL工具的服务器,且该多个服务器之间的高可用性为例,当该多个服务器中的一个服务器发生硬件或软件故障时,多个服务器中处于正常状态的服务器可以继续工作,以保证该系统可以正常提供ETL的数据处理服务。For another example, take a system with multiple servers configured with ETL tools, and high availability between the multiple servers as an example, when a hardware or software failure occurs in one of the multiple servers, the multiple servers are in Servers in a normal state can continue to work to ensure that the system can normally provide ETL data processing services.
通常情况下,大多数ETL平台/系统是基于单节点/单服务器运行,当单节点/单服务器发生故障时,会影响ETL平台/系统的正常运行。Normally, most ETL platforms/systems operate based on a single node/single server. When a single node/single server fails, it will affect the normal operation of the ETL platform/system.
鉴于此,本申请实施例提供了一种ETL系统,该系统可以包括配置有ETL工具的多个服务器,该多个服务器可以共用一个虚拟IP地址,以实现多个服务器的高可用性。该多个服务器中每个服务器可以连接有多个数据库,该多个数据库之间也具有高可用性。如此,通过多个服务器以及多个数据库之间的高可用性,避免了当一个服务器或数据库发生故障时,导致整个ETL系统无法正常运行的问题。In view of this, an embodiment of the present application provides an ETL system, which may include multiple servers configured with ETL tools, and the multiple servers may share one virtual IP address to achieve high availability of the multiple servers. Each of the multiple servers can be connected to multiple databases, and the multiple databases also have high availability. In this way, through the high availability among multiple servers and multiple databases, the problem that the entire ETL system cannot operate normally when a server or database fails is avoided.
下面将结合说明书附图,对本申请实施例的实施方式进行详细描述。The implementation of the embodiment of the present application will be described in detail below in conjunction with the accompanying drawings.
如图1所示,图1为本申请实施例提供的一种ETL系统的结构示意图。该ETL系统可以包括:多个服务器(图中仅示出了2个服务器,分别为服务器01和服务器02)以及设置工具。设置工具分别与多个服务器通信连接。As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of an ETL system provided by an embodiment of the present application. The ETL system may include: a plurality of servers (only two servers are shown in the figure, server 01 and server 02 respectively) and a configuration tool. The setting tool communicates with a plurality of servers respectively.
其中,设置工具可以响应于设置操作,确定多个服务器的调度任务、调度周期及工作模式。设置工具可以成为设置器、控制器、控制工具等。设置 工具可以为一个设备,也可以为多个设备。例如,如图1所示,设置工具可以包括调度器10和预设工具20。调度器10可以通过虚拟IP地址与多个服务器通信连接。预设工具20可以通过通信链路与多个服务器通信连接。通信链路可以为有线通信链路,也可以为无线通信链路。Wherein, the setting tool can determine the scheduling tasks, scheduling periods and working modes of multiple servers in response to the setting operation. A setup tool can be a setter, a controller, a control tool, etc. The setting tool can be one device or multiple devices. For example, as shown in FIG. 1 , the setting tool may include a scheduler 10 and a preset tool 20 . The scheduler 10 can communicate with multiple servers through virtual IP addresses. The preset tool 20 may be communicatively coupled to a plurality of servers via communication links. The communication link may be a wired communication link or a wireless communication link.
具体的,调度器10可以用于响应于设置操作,确定多个服务器的调度任务及调度周期。调度任务可以是指服务器需要抽取、转换和整合的数据。调度周期可以是指服务器执行调度任务的时间间隔。例如,调取器10可以按照预设时间间隔向多个服务器分配调度任务。Specifically, the scheduler 10 may be configured to determine scheduling tasks and scheduling periods of multiple servers in response to a setting operation. Scheduling tasks can refer to the data that the server needs to extract, transform and integrate. The scheduling period may refer to the time interval for the server to execute the scheduling task. For example, the caller 10 may assign scheduling tasks to multiple servers at preset time intervals.
一种示例中,调度器10可以具有对应的设置界面。该设置界面可以响应于用户的操作,确定多个服务器的调度任务和调度周期。In an example, the scheduler 10 may have a corresponding setting interface. The setting interface can determine the scheduling tasks and scheduling periods of multiple servers in response to user operations.
例如,如图2和图3所示,为本申请实施例提供的调度周期的设置界面。图2所示的设置界面可以用于设置调度周期的时长。图3所示的设置界面可以用于设置调度周期内的运行频率。For example, as shown in FIG. 2 and FIG. 3 , it is the setting interface of the scheduling period provided by the embodiment of the present application. The setting interface shown in Figure 2 can be used to set the duration of the scheduling cycle. The setting interface shown in Figure 3 can be used to set the running frequency in the scheduling period.
其中,预设工具20可以用于响应于设置操作,确定每个调度周期内多个服务器的工作模式。例如,预设工具20可以为具有设置页面的硬件设备。该设置页面可以响应于用户的设置操作,确定每个调度周期内多个服务器的工作模式。例如,可以通过在资源对应的区域设置多个服务器的工作模式。比如,节点名称、地址及服务器端口可以响应设置操作,确定处于工作状态的服务器及其接收数据的端口。响应于点击“节点列表”的操作,可以配置多个服务器,并将该多个服务器的状态设置为工作状态。响应于点击“增加”的操作,可以增加新的服务器的信息。响应于点击“取消”的操作,可以删除服务器的信息。需要说明的是,本申请实施例中以F5工具为例进行了描述,当然,还可以使用其他工具替代F5工具的功能,不予限制。Wherein, the preset tool 20 can be used to determine the working modes of multiple servers in each scheduling period in response to the setting operation. For example, the preset tool 20 may be a hardware device with a settings page. The setting page can determine the working modes of multiple servers in each scheduling period in response to the user's setting operation. For example, you can set the working mode of multiple servers in the area corresponding to the resource. For example, the node name, address, and server port can respond to the setting operation to determine the server in the working state and the port to receive data. In response to the operation of clicking "node list", multiple servers may be configured and the statuses of the multiple servers may be set as working status. In response to the operation of clicking "add", information of a new server can be added. In response to the operation of clicking "Cancel", the information of the server may be deleted. It should be noted that in the embodiment of the present application, the F5 tool is used as an example for description, and of course, other tools may also be used to replace the functions of the F5 tool, without limitation.
其中,多个服务器的工作模式可以包括双活模式(Active-Active)和主备模式(Active-Standby)。Wherein, the working modes of the multiple servers may include active-active mode (Active-Active) and active-standby mode (Active-Standby).
下面结合图1中的服务器01和服务器02,对多个服务器的双活模式和主备模式进行说明。The active-active mode and active-standby mode of multiple servers will be described below in conjunction with server 01 and server 02 in FIG. 1 .
1、双活模式1. Active-active mode
在双活模式下,多个服务器可以同时执行ETL系统的调度任务。当一个服务器出现问题无法运行时,设置工具可以将该服务器运行的任务分配给另一台处于正常运行的服务器,以使得该处于正常运行的服务器可以继续处理该任务。从而在充分使用多个服务器的运算资源的情况下,保证了ETL系统的正常工作。In the active-active mode, multiple servers can execute the scheduling tasks of the ETL system at the same time. When a server fails to run because of a problem, the setting tool can assign the task run by the server to another server that is running normally, so that the server that is running normally can continue to process the task. Therefore, the normal operation of the ETL system is guaranteed under the condition of fully using the computing resources of multiple servers.
一种示例中,如图1所示,当服务器01和服务器02处于双活模式时,服务器01和服务器02均处于工作状态,如,服务器01执行调度任务1,服务器02执行调度任务2。当服务器01出现问题,无法正常执行调度任务1时(如断电、硬件故障等),设置工具可以将调度任务1分配给服务器02,以使得服务器02可以继续执行调度任务1。In one example, as shown in FIG. 1 , when the server 01 and the server 02 are in the active-active mode, both the server 01 and the server 02 are in the working state, for example, the server 01 executes the scheduling task 1, and the server 02 executes the scheduling task 2. When the server 01 has a problem and cannot normally execute the scheduling task 1 (such as power failure, hardware failure, etc.), the configuration tool can assign the scheduling task 1 to the server 02, so that the server 02 can continue to execute the scheduling task 1.
2、主备模式2. Active/standby mode
主备模式是指多个服务器中包括主服务器和备服务器。主服务器用于执行ETL系统的调度任务。备服务器用于当主服务器无法执行调度任务时,继续执行该调度任务。The active-standby mode refers to multiple servers including the active server and the standby server. The master server is used to execute the scheduling tasks of the ETL system. The standby server is used to continue to execute the scheduled task when the primary server cannot execute the scheduled task.
一种示例中,结合图1,以服务器01为主服务器,服务器02为备服务器为例,当设置工具在服务器01执行调度任务1的过程中,检测到服务器01出现故障时,可以向服务器02发送指示信息,该指示信息可以用于指示服务器02继续执行调度任务1。该指示信息可以包括调度任务1的标识信息或者调度任务1对应的数据。当指示信息包括调度任务1的标识信息时,服务器02可以根据该标识信息获取调度任务2对应的数据。当指示信息包括调度任务1对应的数据时,服务器02可以直接对该数据进行处理。In one example, with reference to Figure 1, server 01 is used as the primary server and server 02 is the backup server. Instruction information is sent, and the instruction information may be used to instruct the server 02 to continue executing the scheduled task 1 . The indication information may include identification information of the scheduled task 1 or data corresponding to the scheduled task 1. When the indication information includes the identification information of the scheduling task 1, the server 02 may acquire the data corresponding to the scheduling task 2 according to the identification information. When the indication information includes the data corresponding to the scheduled task 1, the server 02 may directly process the data.
由上述可知,本申请实施例中,通过上述两种工作模式,当多个服务器中任一个服务器无法正常执行调度任务时,其他服务器均可以继续执行调度任务,保证了ETL系统的正常运行。As can be seen from the above, in the embodiment of the present application, through the above two working modes, when any one of the multiple servers cannot normally execute the scheduling task, other servers can continue to execute the scheduling task, ensuring the normal operation of the ETL system.
其中,该多个服务器均配置有ETL工具。该多个服务器均可以执行调度任务。Wherein, the plurality of servers are all configured with ETL tools. All the multiple servers can execute the scheduling task.
一种示例中,预设工具的设置页面可以如图4所示。该设置页面可以响应于开发人员的设置操作,确定多个服务器的工作模式和执行的调度任务。In an example, the setting page of the preset tool may be as shown in FIG. 4 . The setting page can respond to the developer's setting operations to determine the working modes of multiple servers and the scheduled tasks to be executed.
一种可能的实施例中,如图5所示,ETL系统还可以包括多个数据库(图中仅示出了数据库21和数据库22)。该多个数据库分别与多个服务器通信连接。In a possible embodiment, as shown in FIG. 5 , the ETL system may further include multiple databases (only database 21 and database 22 are shown in the figure). The multiple databases are communicatively connected to multiple servers respectively.
其中,该多个数据库均可以用于存储服务器的作业内容、配置信息以及调度计划等信息。该多个数据库之间可以同步存储的信息。服务器的作业内容可以包括上述调度任务及调度任务处理的数据。调度计划可以包括调度周期的时长以及调度周期的调度频率等。Wherein, the plurality of databases can be used to store information such as job content, configuration information, and scheduling plan of the server. The stored information can be synchronized between the multiple databases. The job content of the server may include the above-mentioned scheduling task and the data processed by the scheduling task. The scheduling plan may include the duration of the scheduling period, the scheduling frequency of the scheduling period, and the like.
一种示例中,该多个数据库可以具有同一个浏览(scan)IP地址,多个服务器可以通过该浏览IP地址访问多个数据库中的任一数据库。In an example, the multiple databases may have the same browsing (scan) IP address, and multiple servers may access any one of the multiple databases through the scanning IP address.
需要说明的是,本申请实施例中,虚拟IP地址和浏览IP地址均用于访问 一个设备。例如,设置工具可以通过虚拟IP地址访问多个服务器中的任一服务器。虚拟IP地址与该多个服务器的IP地址之间具有对应关系。服务器可以通过浏览IP地址访问多个数据库之间的任一数据库。该多个服务器还可以配置有各自对应的IP地址。该多个数据库还可以配置有各自对应的IP地址。浏览IP地址与该多个数据库的IP地址之间具有对应关系。It should be noted that, in this embodiment of the application, both the virtual IP address and the browsing IP address are used to access a device. For example, a configuration tool can access any one of multiple servers through a virtual IP address. There is a corresponding relationship between the virtual IP address and the IP addresses of the plurality of servers. The server can access any database among multiple databases by browsing the IP address. The plurality of servers may also be configured with respective corresponding IP addresses. The multiple databases may also be configured with respective corresponding IP addresses. There is a corresponding relationship between the browsing IP address and the IP addresses of the plurality of databases.
需要说明的是,本申请实施例中,结合图5中的数据库21和数据库22,多个数据库之间具有高可用性可以是指:当数据库21发生硬件或软件故障时,数据库22可以继续提供存储服务;或者,当数据库22发生硬件或软件故障时,数据库21可以继续提供存储服务。It should be noted that, in the embodiment of the present application, in combination with the database 21 and the database 22 in FIG. service; or, when a hardware or software failure occurs in the database 22, the database 21 can continue to provide storage services.
又一种可能的实施例中,如图6所示,本申请实施例提供的ETL系统还可以与分布式计算(hadoop)系统通信连接。该hapoop系统可以包括多个数据库,如可以包括Hive数据库和分布式(Hbase)数据库。In yet another possible embodiment, as shown in FIG. 6 , the ETL system provided in the embodiment of the present application may also communicate with a distributed computing (hadoop) system. The hapoop system may include multiple databases, such as a Hive database and a distributed (Hbase) database.
其中,Hive数据库分别与多个源系统、ETL系统通信连接。源系统可以用于提供原始数据,例如,可以提供产品的历史履历信息及不良CODE信息。Among them, the Hive database communicates with multiple source systems and ETL systems respectively. The source system can be used to provide raw data, for example, it can provide product history information and bad CODE information.
一种示例中,多个源系统可以包括良率管理系统(yield management system,YMS)系统和管理数据仓库(management data warehouse,MDW)系统。YMS系统可以用于提供产品的历史履历信息。MDW系统可以用于提供检测到的产品的不良编码(CODE)信息。In an example, the multiple source systems may include a yield management system (yield management system, YMS) system and a management data warehouse (management data warehouse, MDW) system. The YMS system can be used to provide product history information. The MDW system can be used to provide code of error (CODE) information for detected products.
其中,产品的历史履历信息可以是指工厂设备直接上传至YMS系统的产品的基本信息。例如,产品的历史履历信息可以包括:工厂(FACTORY)、产品批号(LOT_ID)、产品(如玻璃)标识(GLS_ID)、事件时间键(EVENT_TIMEKEY)、产品类型(PRODUCT_TYPE)、前工艺站点(OLD_OPER_CODE)、产品型号(PRODUCT_ID)、设备标识(EQP_ID)、单元标识(UNIT_ID)、子单元标识(SUB_UNIT_ID)、前工艺投入时间(LAST_PROCESS_IN_TIME)中的一个或多个。Wherein, the history information of the product may refer to the basic information of the product directly uploaded by the factory equipment to the YMS system. For example, the historical resume information of a product can include: factory (FACTORY), product lot number (LOT_ID), product (such as glass) identification (GLS_ID), event time key (EVENT_TIMEKEY), product type (PRODUCT_TYPE), former process site (OLD_OPER_CODE) One or more of , product model (PRODUCT_ID), equipment identification (EQP_ID), unit identification (UNIT_ID), subunit identification (SUB_UNIT_ID), and previous process investment time (LAST_PROCESS_IN_TIME).
一种示例中,产品的历史履历信息可以以Oracle表的形式存储在YMS系统,也可以以数组的形式存储在YMS系统,不予限制。例如,以产品为玻璃为例,产品的历史履历信息可以如表1所示。In one example, the historical resume information of a product can be stored in the YMS system in the form of an Oracle table, or can be stored in the YMS system in the form of an array, without limitation. For example, taking the product as glass as an example, the history information of the product may be shown in Table 1.
表1Table 1
表字段table fields 说明illustrate
FACTORYFACTORY 工厂factory
LOT_IDLOT_ID LOT IDLOT ID
GLS_IDGLS_ID GLASS IDGLASS ID
EVENT_TIMEKEYEVENT_TIMEKEY 事件时间event time
PRODUCT_TYPEPRODUCT_TYPE 产品类型product type
OLD_OPER_CODEOLD_OPER_CODE 前工艺站点former craft site
PRODUCT_IDPRODUCT_ID 产品型号Product number
EQP_IDEQP_ID 设备名Equipment name
UNIT_IDUNIT_ID 单元名unit name
SUB_UNIT_IDSUB_UNIT_ID 子单元名subunit name
LAST_PROCESS_IN_TIMELAST_PROCESS_IN_TIME 前工艺投入时间Pre-process investment time
EVENT_TIMEEVENT_TIME 事件时间event time
需要说明的是,表1中的表字段仅为示例性的,还可以包括其他字段,如,还可以包括产品的尺寸、厚度等字段,不予限制。It should be noted that the table fields in Table 1 are only exemplary, and may also include other fields, such as product size, thickness, and other fields, which are not limited.
其中,产品的不良CODE信息可以为产品在生产过程中出现的问题。例如,产品的不良CODE信息可以包括工厂(FACTORY)、站点(STEP)、产品批号(LOT_ID)、产品标识(GLS_ID)、前工艺站点(PRODUCT_ID)、产品类型(PRODUCT_TYPE)、产品大小(PRODUCT_SIZE)、产品型号(NEW_MODEL)、CODE等级(DEFECT_GRADE)、不良CODE(DEFECT_CODE)、CODE检测时间(TXN_TIME)中的一个或多个。Wherein, the bad CODE information of the product may be a problem in the production process of the product. For example, bad CODE information of a product can include factory (FACTORY), site (STEP), product lot number (LOT_ID), product identification (GLS_ID), former process site (PRODUCT_ID), product type (PRODUCT_TYPE), product size (PRODUCT_SIZE), One or more of product model (NEW_MODEL), CODE grade (DEFECT_GRADE), bad CODE (DEFECT_CODE), and CODE detection time (TXN_TIME).
一种示例中,产品的不良CODE信息可以以表格的形式存储至MDW系统,也可以以数组的形式存储至MDW系统,不予限制。例如,以产品为玻璃为例,产品的历不良CODE信息可以如表2所示。In one example, the bad CODE information of the product can be stored in the MDW system in the form of a table, or can be stored in the MDW system in the form of an array, without limitation. For example, taking the product as glass as an example, the historical defect CODE information of the product can be shown in Table 2.
表2Table 2
表字段table field 说明illustrate
FACTORYFACTORY 工厂factory
STEPSTEP 站点site
LOT_IDLOT_ID LOT IDLOT ID
GLS_IDGLS_ID GLASS IDGLASS ID
PRODUCT_IDPRODUCT_ID 前工艺站点former craft site
PRODUCT_TYPEPRODUCT_TYPE 产品类型product type
PRODUCT_SIZEPRODUCT_SIZE 产品大小product size
NEW_MODEL NEW_MODEL 产品型号2Product model 2
DEFECT_GRADEDEFECT_GRADE CODE等级CODE level
DEFECT_CODEDEFECT_CODE 不良CODEbad code
TXN_TIMETXN_TIME CODE检测时间CODE detection time
需要说明的是,表2中的表字段仅为示例性的,还可以包括其他表字段,例如,还可以包括检测设备的标识信息等,不予限制。It should be noted that the table fields in Table 2 are only exemplary, and may also include other table fields, for example, may also include identification information of the detection device, etc., without limitation.
需要指出的是,上述表1和表2是指同一产品对应的信息。It should be pointed out that the above Table 1 and Table 2 refer to the information corresponding to the same product.
结合上述表1和表2,Hive数据库可以存储同步后的Hive履历信息。该Hive履历信息可以包括分区字段以及从表1和表2中抽取的字段。该分区字段可以标识同步后的Hive履历信息。Combining the above Table 1 and Table 2, the Hive database can store the synchronized Hive history information. The Hive history information may include partition fields and fields extracted from Table 1 and Table 2. This partition field can identify the Hive history information after synchronization.
一种示例中,Hive数据库可以包括多个存储区域,同步的后的Hive履历信息可以以parquet的形式存储至对应的存储区域。不同存储区域的Hive履历信息可以具有不同的分区字段。例如,分区字段可以为Hive履历信息对应的时间字段(TIMEDAY)。In one example, the Hive database may include multiple storage areas, and the synchronized Hive resume information may be stored in the corresponding storage areas in the form of parquet. Hive history information in different storage areas can have different partition fields. For example, the partition field may be the time field (TIMEDAY) corresponding to the Hive history information.
例如,当同步后的Hive履历信息中的分区字段为时间字段时,结合表1和表2,同步后的Hive履历信息的字段还可以包括工厂(FACTORY)、产品批号(LOT_ID)、产品标识(GLS_ID)、事件时间键(EVENT_TIMEKEY)、产品类型(PRODUCT_TYPE)、前工艺站点(OLD_OPER_CODE)、产品型号(PRODUCT_ID)、设备标识(EQP_ID)、单元标识(UNIT_ID)、子单元标识(SUB_UNIT_ID)、前工艺投入时间(LAST_PROCESS_IN_TIME)中的一个或多个。当然,同步后的Hive履历信息还可以包括其他字段,不予限制。比如,同步后的Hive履历信息可以如表3所示。For example, when the partition field in the synchronized Hive history information is a time field, combined with Table 1 and Table 2, the fields of the synchronized Hive history information can also include factory (FACTORY), product lot number (LOT_ID), product identification ( GLS_ID), event time key (EVENT_TIMEKEY), product type (PRODUCT_TYPE), former process site (OLD_OPER_CODE), product model (PRODUCT_ID), equipment identification (EQP_ID), unit identification (UNIT_ID), subunit identification (SUB_UNIT_ID), former process One or more of the time invested (LAST_PROCESS_IN_TIME). Of course, the synchronized Hive resume information may also include other fields, which are not limited. For example, the synchronized Hive history information can be shown in Table 3.
表3table 3
表字段table field 说明illustrate
FACTORYFACTORY 工厂factory
LOT_IDLOT_ID LOT IDLOT ID
GLS_IDGLS_ID GLASS IDGLASS ID
EVENT_TIMEKEYEVENT_TIMEKEY 事件时间event time
PRODUCT_TYPEPRODUCT_TYPE 产品类型product type
OLD_OPER_CODEOLD_OPER_CODE 前工艺站点former craft site
PRODUCT_IDPRODUCT_ID 产品型号Product number
EQP_IDEQP_ID 设备名Equipment name
UNIT_IDUNIT_ID 单元名unit name
SUB_UNIT_IDSUB_UNIT_ID 子单元名subunit name
LAST_PROCESS_IN_TIMELAST_PROCESS_IN_TIME 前工艺投入时间Pre-process investment time
EVENT_TIMEEVENT_TIME 事件时间event time
TIMEDAYTIMEDAY 分区字段partition field
需要说明的是,表3的字段中除分区字段之外的其他字段为基于表1和表2中的字段同步后的字段。It should be noted that, among the fields in Table 3, other fields except the partition field are fields synchronized based on the fields in Table 1 and Table 2.
其中,Hbase数据库可以用于存储ETL系统处理完成的数据。例如,ETL系统可以从Hive数据库中的数据获取同步后的Hive履历信息,并对该同步后的Hive履历信息进行处理,并将处理后的数据存储至Hbase数据库中,以便于后台系统的使用。Among them, the Hbase database can be used to store the data processed by the ETL system. For example, the ETL system can obtain the synchronized Hive resume information from the data in the Hive database, process the synchronized Hive resume information, and store the processed data in the Hbase database to facilitate the use of the background system.
例如,ETL系统对同步后的Hive履历信息进行处理的过程可以包括:For example, the process for the ETL system to process the synchronized Hive history information may include:
1、读取同步后的Hive履历信息,并对同步后的Hive履历信息进行筛选过滤,得到处理后的数据。1. Read the synchronized Hive history information, and filter the synchronized Hive history information to obtain the processed data.
其中,对同步后的Hive履历信息进行筛选过滤可以是指根据产品的站点信息、时间信息进行筛选过滤,并确定每个产品在同一站点内的停留时间和设备类型等信息。Wherein, filtering the synchronized Hive resume information may refer to filtering according to the site information and time information of the product, and determining information such as the residence time and device type of each product in the same site.
具体的,产品在同一站点内的停留时间(PROCESS_TIME)=产品的事件时间(EVENT_TIME)-前工艺的投入时间(LAST_LOGGED_IN_TIME)。产品的设备类型(以字段STEP_ID表示)为设备名(EQP_ID)的部分字符。如,可以为EQP_ID的前五个字符。例如产品的EQP_ID为AAEWS07,则对应的STEP_ID可以为AAEWS。Specifically, the residence time of the product in the same site (PROCESS_TIME) = the event time of the product (EVENT_TIME) - the input time of the previous process (LAST_LOGGED_IN_TIME). The equipment type of the product (represented by the field STEP_ID) is part of the equipment name (EQP_ID). For example, it can be the first five characters of EQP_ID. For example, if the EQP_ID of the product is AAEWS07, the corresponding STEP_ID can be AAEWS.
2、将处理后的数据按照预设形式,写入Hbase数据库。2. Write the processed data into the Hbase database according to the preset format.
其中,预设形式可以根据需要设置。例如,预设形式可以为便于后台系统对应的格式。Wherein, the preset form can be set as required. For example, the preset format may be a format that is convenient for the background system to correspond to.
一种示例中,Hbase数据库以表格的形式存储的处理后的数据,也可以以其他的形式存储,不予限制。例如,该处理后的数据可以如表4所示。In one example, the processed data stored in the Hbase database in the form of tables may also be stored in other forms, without limitation. For example, the processed data can be as shown in Table 4.
表4Table 4
Figure PCTCN2022076973-appb-000001
Figure PCTCN2022076973-appb-000001
Figure PCTCN2022076973-appb-000002
Figure PCTCN2022076973-appb-000002
需要说明的是,表4中,MD5字段是对LOT_ID字段使用md5函数进行加密后,得到字段的前三位。MD5函数及使用MD5函数进行加密的方法,可以参照现有技术,不予赘述。SEQ_ID为多个字段的拼接值。该多个字段可以包括TRKG、OLD_OPER_CODE、STEP_ID。其中,TRKG表示履历信息。It should be noted that in Table 4, the MD5 field is obtained by encrypting the LOT_ID field with the md5 function to obtain the first three digits of the field. For the MD5 function and the encryption method using the MD5 function, reference may be made to the prior art, and details are not repeated here. SEQ_ID is the spliced value of multiple fields. The plurality of fields may include TRKG, OLD_OPER_CODE, STEP_ID. Among them, TRKG represents history information.
基于图1或图5或图6的ETL系统,本申请实施例提供一种ETL系统的构建方法、以及一种应用于该ETL系统的数据处理方法(简称为数据处理方法)。下面对ETL系统的构建方法和数据处理方法分别进行说明:Based on the ETL system shown in FIG. 1 or FIG. 5 or FIG. 6 , the embodiment of the present application provides a method for constructing an ETL system and a data processing method (abbreviated as a data processing method) applied to the ETL system. The following describes the construction method and data processing method of the ETL system respectively:
一、ETL系统的构建方法。First, the construction method of ETL system.
如图7所示,为本申请实施例提供的一个ETL系统的构建方法,该方法可以包括:As shown in Figure 7, a method for constructing an ETL system provided by the embodiment of the present application may include:
S701、对多个服务器进行配置,以使得配置后的多个服务器具有ETL功能。S701. Configure multiple servers, so that the configured multiple servers have an ETL function.
其中,多个服务器可以为图1或图2中的服务器01和服务器02。对多个服务器进行配置可以是指为多个服务器安装预设配置文件,安装预设配置文件后的服务器可以具有ETL功能,例如,该配置文件可以为Pentaho Server。Wherein, the multiple servers may be server 01 and server 02 in FIG. 1 or FIG. 2 . Configuring multiple servers may refer to installing a preset configuration file for multiple servers, and the server after installing the preset configuration file may have an ETL function, for example, the configuration file may be Pentaho Server.
进一步的,为了保证每个服务器具有各自对应的标识信息,设置工具还可以响应于设置操作,为多个服务器设置集群(cluster)配置,以使得该多个服务器可以具有相同的cluster ID。如此,可以通过该cluster ID管理多个服务器。Further, in order to ensure that each server has its own corresponding identification information, the setting tool can also respond to the setting operation and set a cluster (cluster) configuration for multiple servers, so that the multiple servers can have the same cluster ID. In this way, multiple servers can be managed through the cluster ID.
其中,各个服务器还可以设置有对应的ID。例如,服务器01的ID可以为节点1(node1),服务器02的ID可以为node2。Wherein, each server may also be set with a corresponding ID. For example, the ID of server 01 may be node 1 (node1), and the ID of server 02 may be node2.
S702、为多个服务器配置scanIP,以使得多个服务器通过该浏览IP地址访问多个数据库中的任一数据库。S702. Configure scanIP for the multiple servers, so that the multiple servers access any database in the multiple databases through the browsing IP address.
其中,scan IP可以用于访问多个数据库。Among them, scan IP can be used to access multiple databases.
一种可能的实现方式中,各个服务器可以响应于开发人员的设置操作,存储scan IP。In a possible implementation, each server can store the scan IP in response to the developer's setting operation.
S703、通过预设工具为多个服务器设置虚拟IP地址,以及多个服务器的 工作模式。S703. Set the virtual IP addresses for the multiple servers and the working modes of the multiple servers through the preset tool.
其中,虚拟IP地址可以用于建立与多个服务器的通信连接,其他设备如(调度器)可以通过该虚拟IP地址访问该多个服务器中的任一服务器。该虚拟IP分别与各个服务器的IP地址之间具有对应关系。多个服务器的工作模式可以参照上面的描述,不予赘述。Wherein, the virtual IP address can be used to establish a communication connection with multiple servers, and other devices such as (scheduler) can access any server in the multiple servers through the virtual IP address. There is a corresponding relationship between the virtual IP and the IP address of each server. For the working modes of multiple servers, reference may be made to the above description, and details are not repeated here.
一种可能的实现方式中,当预设工具具有如图4所示的界面时,该界面可以响应于设置操作,确定多个服务器的工作模式。In a possible implementation manner, when the preset tool has an interface as shown in FIG. 4 , the interface can determine working modes of multiple servers in response to a setting operation.
基于该实施例提供的技术方案,通过为多个服务器进行设置,以使得多个服务器可以具有ETL功能。同时,通过为多个服务器可以访问多个数据库中任一个数据库的scan IP,当出现数据库出现故障时,其他数据库可以继续正常工作,保证了数据库之间的高可用性。通过预设工具为多个服务器设置虚拟IP地址及工作模式,以使得可以通过该虚拟IP地址访问任一服务器,也即,通过多个服务器之间的高可用性,保证了ETL平台可以正常运行。Based on the technical solution provided by this embodiment, setting is made for multiple servers so that multiple servers can have the ETL function. At the same time, by allowing multiple servers to access the scan IP of any database in multiple databases, when a database fails, other databases can continue to work normally, ensuring high availability among databases. Set the virtual IP address and working mode for multiple servers through preset tools, so that any server can be accessed through the virtual IP address, that is, through the high availability among multiple servers, the normal operation of the ETL platform is guaranteed.
二、数据处理方法。2. Data processing method.
如图8所示,为本申请实施例提供的一个数据处理方法,该方法应用于上述图1或图5的ETL系统中的设置工具或者设置工具的部分器件,如调度器,该方法包括:As shown in Figure 8, a data processing method provided by the embodiment of the present application, the method is applied to the setting tool or some devices of the setting tool in the ETL system of Figure 1 or Figure 5 above, such as a scheduler, the method includes:
S801、获取第一调度周期的第一待处理数据。S801. Acquire first data to be processed in a first scheduling period.
其中,第一待处理数据可以为同步后的Hive履历信息。Wherein, the first data to be processed may be Hive history information after synchronization.
一种可能的实现方式中,设置工具可以从Hive数据库中获取第一调度周期的第一待处理数据。In a possible implementation manner, the setting tool may acquire the first data to be processed in the first scheduling period from the Hive database.
需要说明的是,本申请实施例中,配置有ETL工具的服务器可以通过设置工具向Hive数据库发送用于指示关联多个源系统的信息的指示信息,以使得Hive数据库接收到该指示信息之后,可以从多个源系统中获取第一调度周期的产品的履历信息以及不良CODE信息,并进行整合,得到第一待处理数据。It should be noted that, in this embodiment of the application, the server configured with the ETL tool can send the indication information used to indicate the information associated with multiple source systems to the Hive database through the setting tool, so that after the Hive database receives the indication information, The history information and bad CODE information of the products in the first scheduling cycle may be obtained from multiple source systems and integrated to obtain the first data to be processed.
S802、响应于第一设置操作,确定在第一调度周期内多个服务器的工作模式。S802. In response to the first setting operation, determine working modes of multiple servers in the first scheduling period.
其中,第一设置操作可以是指通过在预设工具对应的设置页面上的操作。多个服务器的工作模式可以参照上述的双活模式和主备模式的描述,不予赘述。Wherein, the first setting operation may refer to an operation on a setting page corresponding to the preset tool. For the working modes of multiple servers, reference may be made to the above-mentioned descriptions of the active-active mode and the active-standby mode, and details are not repeated here.
S803、向第一服务器发送第一指示信息。相应的,第一服务器接收第一指示信息。S803. Send the first indication information to the first server. Correspondingly, the first server receives the first indication information.
其中,第一指示信息可以用于指示第一服务器处理第一待处理数据。第一服务器为多个服务器中在第一调度周期内处于工作状态的服务器。Wherein, the first instruction information may be used to instruct the first server to process the first data to be processed. The first server is a server in a working state within the first scheduling period among the plurality of servers.
基于图8的技术方案,设置工具可以获取第一调取周期内的待处理数据,并确定多个服务器的工作模式,并向多个服务器中处于工作状态的服务器发送用于处理数据的指示信息,可以使得服务器可以对该待处理数据进行ETL处理。Based on the technical solution in Figure 8, the setting tool can obtain the data to be processed in the first call cycle, determine the working mode of multiple servers, and send instruction information for processing data to the server in the working state among the multiple servers , enabling the server to perform ETL processing on the data to be processed.
一种可能的实施例中,如图9所示,该方法还可以包括S901。In a possible embodiment, as shown in FIG. 9 , the method may further include S901.
S901、第一服务器根据第一指示信息,获取第一调度数据。S901. The first server acquires first scheduling data according to the first indication information.
一种可能的实现方式中,第一指示信息可以包括第一待处理数据。如此,第一服务器可以直接从第一指示信息中获取第一待处理数据。In a possible implementation manner, the first indication information may include first data to be processed. In this way, the first server can directly acquire the first data to be processed from the first indication information.
又一种可能的实现方式中,第一指示信息可以包括第一待处理数据的标识或存储地址。如此,第一服务器可以根据第一待处理数据的标识或存储地址获取第一待处理数据。In yet another possible implementation manner, the first indication information may include an identifier or a storage address of the first data to be processed. In this way, the first server can acquire the first data to be processed according to the identifier or storage address of the first data to be processed.
S902、第一服务器使用ETL工具对所述第一调度数据进行处理,得到处理后的第一调度数据。S902. The first server uses an ETL tool to process the first scheduling data to obtain processed first scheduling data.
其中,处理后的第一调度数据可以是指对同步后的Hive履历信息处理后的数据。具体的,可以参照上面的描述,不予赘述。Wherein, the processed first scheduling data may refer to data after processing the synchronized Hive history information. Specifically, reference may be made to the above description, and details are not repeated here.
一种可能的实施例中,如图10所示,本申请实施例中,该方法还可以包括:In a possible embodiment, as shown in FIG. 10, in the embodiment of the present application, the method may further include:
S1001、检测多个数据库的工作状态。S1001. Detect working states of multiple databases.
其中,检测多个数据库的工作状态可以是指检测多个数据库是否可以正常工作。Wherein, detecting the working status of multiple databases may refer to detecting whether the multiple databases can work normally.
一种可能的实现方式中,可以通过与多个数据库之间进行信息的交互,检测多个数据库之间的工作状态。例如,可以周期性的或随机的向多个数据库发送信息,当接收到数据库的响应信息时,确定数据库处于正常工作状态;当没有接收到某一数据库的响应信息时,确定该数据库异常。In a possible implementation manner, the working status among multiple databases may be detected through information interaction with multiple databases. For example, information can be sent to multiple databases periodically or randomly, and when the response information of the database is received, it is determined that the database is in a normal working state; when the response information of a certain database is not received, it is determined that the database is abnormal.
S1002、当检测到第一数据库无法工作时,将第一数据库中存储的数据存储至第二数据库。S1002. When it is detected that the first database cannot work, store the data stored in the first database to the second database.
其中,第二数据库可以为多个数据库中未发生故障的数据库。第一数据库和第二数据为多个数据库中不同的数据库。Wherein, the second database may be a database without failure among the multiple databases. The first database and the second data are different databases among the plurality of databases.
基于该可能的实施例,通过多个数据库之间的高可用性,可以保证数据库的正常工作。Based on this possible embodiment, through the high availability among multiple databases, the normal operation of the databases can be guaranteed.
一种可能的实施例中,如图11所示,该方法还可以包括:In a possible embodiment, as shown in FIG. 11, the method may further include:
S1101、接收来自多个系统的第一调度周期的源数据。S1101. Receive source data of a first scheduling period from multiple systems.
S1102、将多个源数据进行合并,得到第一待处理数据,并将第一待处理数据存储至预设数据库的预设存储区域。S1102. Merge multiple source data to obtain first data to be processed, and store the first data to be processed in a preset storage area of a preset database.
其中,预设数据库可以用于存储待处理数据。例如,预设数据库可以为图6中的Hive数据库。预设数据库可以包括多个存储区域。该多个存储区域可以用于存储不同调度周期内的待处理数据。不同调度周期内的待处理数据可以具有唯一的标识。例如,该标识可以为分区字段。Among them, the preset database can be used to store the data to be processed. For example, the preset database may be the Hive database in FIG. 6 . A preset database can include multiple storage areas. The multiple storage areas can be used to store data to be processed in different scheduling periods. Data to be processed in different scheduling cycles can have unique identifiers. For example, the identifier can be a partition field.
上述S801中,获取第一调度周期获取第一待处理数据的方法具体可以包括:从第一数据库的预设存储区域中抽取第一待处理数据。In S801 above, the method for obtaining the first data to be processed in the first scheduling period may specifically include: extracting the first data to be processed from a preset storage area of the first database.
基于该可能的实施例,可以将多个源数据库的数据进行整合,得到统一格式的待处理的数据,以便于服务器可以对该待处理的数据进行处理。Based on this possible embodiment, the data of multiple source databases can be integrated to obtain data to be processed in a unified format, so that the server can process the data to be processed.
一种可能的实施例中,如图12所示,该方法还可以包括:In a possible embodiment, as shown in FIG. 12, the method may further include:
S1201、检测第一服务器处理第一待处理数据的处理结果。S1201. Detect a processing result of processing the first data to be processed by the first server.
S1202、当处理结果为失败,且第一服务器处于工作状态时,向第一服务器发送第二指示信息。S1202. When the processing result is failure and the first server is in a working state, send second indication information to the first server.
其中,第二指示信息用于指示第一服务器继续处理第一待处理数据。Wherein, the second instruction information is used to instruct the first server to continue processing the first data to be processed.
S1203、当处理结果为失败,且第一服务器无法处理数据时,向第二服务器发送第三指示信息。S1203. When the processing result is failure and the first server cannot process the data, send third indication information to the second server.
其中,第三指示信息用于指示第二服务器继续处理第一待处理数据。Wherein, the third instruction information is used to instruct the second server to continue processing the first data to be processed.
基于该可能的实现方式,可以使得待处理数据能够得到及时的处理,避免出现数据的遗漏。Based on this possible implementation manner, the data to be processed can be processed in a timely manner to avoid data omission.
一种可能的实施例中,当处理结果为成功时,如图12所示,该方法还可以包括:In a possible embodiment, when the processing result is successful, as shown in FIG. 12 , the method may further include:
S1204、获取第二调度周期的第二待处理数据。S1204. Acquire second data to be processed in the second scheduling period.
S1205、向第一服务器发送第四指示信息。S1205. Send fourth indication information to the first server.
其中,第四指示信息用于指示第二服务器处理第二待处理数据。Wherein, the fourth instruction information is used to instruct the second server to process the second data to be processed.
进一步的,在S1205之前,该方法还可以包括:Further, before S1205, the method may also include:
响应于第二设置操作,确定在第二调度周期内多个服务器的工作模式。In response to the second setting operation, the working modes of the plurality of servers in the second scheduling period are determined.
其中,第二调度周期内多个服务器的工作模式可以与第一调度周期内多个服务器的工作模式相同,也可以不同。例如,第一调度周期内多个服务器的工作模式为双活模式,则第二调度周期内多个服务器的工作模式可以为主主备模式,也可以为双活模式,不予限制。The working modes of the multiple servers in the second scheduling period may be the same as or different from the working modes of the multiple servers in the first scheduling period. For example, if the working mode of multiple servers in the first scheduling period is active-active mode, then the working mode of multiple servers in the second scheduling period can be active-active-standby mode or active-active mode, without limitation.
基于该可能的实施例,ETL系统可以继续对数据进行处理,保证了ETL 系统的周期性的工作。Based on this possible embodiment, the ETL system can continue to process data, which ensures the periodic work of the ETL system.
一种可能的实施例中,该方法还可以包括:In a possible embodiment, the method may also include:
响应于第三设置操作,确定多个服务器的调度周期的时长及调度频率。In response to the third setting operation, the duration and scheduling frequency of the scheduling periods of the plurality of servers are determined.
其中,调度周期的时长及调度频率可以是指第一调度周期的时长及调度频率,也可以是指第二调度周期的时长及调度频率。调度周期的时长及调度频率可以根据需要设置。多个调度周期的时长及调度频率可以相同,也可以不同,不予限制。The duration and scheduling frequency of the scheduling period may refer to the duration and scheduling frequency of the first scheduling period, or may refer to the duration and scheduling frequency of the second scheduling period. The duration of the scheduling cycle and the scheduling frequency can be set as required. The duration and scheduling frequency of multiple scheduling cycles can be the same or different, and are not limited.
基于该可能的实施中,可以通过人工的方式,调整多个服务器的工作模式、调度周期的时长及调度频率,灵活方便。Based on this possible implementation, the working mode of multiple servers, the duration of the scheduling cycle, and the scheduling frequency can be adjusted manually, which is flexible and convenient.
下面以ETL系统处理的数据为产品的履历信息,预设工具为F5工具为例,对本申请实施例提供的数据处理方法进行说明。Taking the data processed by the ETL system as product history information and the preset tool as the F5 tool as an example, the data processing method provided by the embodiment of the present application will be described below.
一、获取调度任务及调度任务的工作频率。1. Obtain the scheduling task and the working frequency of the scheduling task.
其中,该调度任务与用户需求相关。用户需求可以是指确定产品出现CODE的原因。调度任务可以是指将产品的履历信息及CODE信息进行ETL处理。例如,调度任务可以是指按照调度周期从Hive数据库中获取同步后的Hive履历表。并对获取到的同步后的Hive履历表中的数据进行处理。Wherein, the scheduling task is related to user requirements. User needs can refer to determining why a product has a CODE. Scheduling tasks may refer to performing ETL processing on product history information and CODE information. For example, scheduling a task may refer to obtaining a synchronized Hive resume from a Hive database according to a scheduling cycle. And process the acquired data in the synchronized Hive resume table.
其中,同步后的Hive履历表可以是指玻璃的历史履历信息及不良CODE信息同步后的履历表。同步后的Hive履历表中的字段可以参照上述表3,不予赘述。Wherein, the synchronized Hive history table may refer to the history table after synchronization of glass history information and bad CODE information. For the fields in the synchronized Hive resume, refer to Table 3 above, and details will not be described here.
一种示例中,以产品为玻璃(glass)为例,每张玻璃在生产过程中可能会产生不良CODE。例如,盒内划伤、异物Gap、异常点灯、黑点CODE等。用户需求可以是指确定玻璃出现不良CODE的产生原因。调度任务可以是指将每张玻璃的历史履历信息以及不良CODE信息进行ETL处理。In one example, if the product is glass (glass) as an example, bad CODE may be generated during the production process of each glass. For example, scratches in the box, foreign body Gap, abnormal lighting, black spot CODE, etc. User needs can refer to determining the cause of the bad CODE of the glass. Scheduling tasks may refer to performing ETL processing on the history information and bad CODE information of each glass.
二、通过ETL系统建立数据同步、数据处理作业。2. Establish data synchronization and data processing operations through the ETL system.
一种示例中,如图13所示,通过ETL系统建立数据同步、数据处理作业的过程可以包括:开始→获取调度时间→数据同步(date sync)→数据处理→更新调度时间→处理成功。下面对该过程进行详细说明:In one example, as shown in FIG. 13 , the process of establishing data synchronization and data processing jobs through the ETL system may include: start → obtain scheduled time → data synchronization (date sync) → data processing → update scheduled time → process successfully. The process is described in detail below:
1、获取调度时间。1. Obtain the scheduling time.
其中,获取调度时间可以是指确定第一调度周期的时长。Wherein, obtaining the scheduling time may refer to determining the duration of the first scheduling period.
其中,第一调度周期的时长可以根据需要设置,例如,以天为粒度,第一调度周期的时长可以为0:00~24:00,或者,也可以为当天的6:00~下一天的6:00,当然,也可以为其他时长,不予限制。Wherein, the duration of the first scheduling cycle can be set as required. For example, with days as the granularity, the duration of the first scheduling cycle can be 0:00-24:00, or it can be 6:00 of the current day to the next day. 6:00, of course, can also be other time periods without limitation.
2、数据同步。2. Data synchronization.
一种示例中,数据同步的过程可以如图14所示。图14中,表输入是指从Oracle数据库中获取源数据,表输出可以是指Parquet output组件为数据输出至hive数据库的存储区域。In an example, the data synchronization process may be as shown in FIG. 14 . In Figure 14, table input refers to obtaining source data from the Oracle database, and table output refers to the storage area where the Parquet output component outputs data to the hive database.
3、数据处理。3. Data processing.
其中,数据处理可以包括:根据站点信息、时间信息,对获取到的GLASS的历史履历信息表进行筛选过滤,并计算每一张GLASS在同一站点内停留时间和设备等信息,得到处理后的数据;将处理后的数据,写入Hbase数据库。Among them, the data processing may include: according to the site information and time information, filter the historical history information table of the obtained GLASS, and calculate information such as the stay time and equipment of each GLASS in the same site, and obtain the processed data ;Write the processed data into the Hbase database.
其中,GLASS在同一站点内的停留时间(PROCESS_TIME)=GLASS的事件时间(EVENT_TIME)-前工艺的投入时间(LAST_LOGGED_IN_TIME)。Among them, the residence time of GLASS in the same site (PROCESS_TIME) = the event time of GLASS (EVENT_TIME) - the input time of the previous process (LAST_LOGGED_IN_TIME).
其中,处理后的数据可以参照上述表4所示。Wherein, the processed data can refer to the table 4 above.
一种示例中,为了提高数据的转换效率,如图15所示,可以通过多进程的方式,将同步后的Hive履历信息进行转换。图15中表输入1对应的分支可以是指转换同步后的Hive履历信息中与设备相关的信息。表输入2对应的分支可以是指转换同步后的Hive履历信息中与设备的子单元相关的信息。In one example, in order to improve data conversion efficiency, as shown in FIG. 15 , the synchronized Hive history information may be converted in a multi-process manner. The branch corresponding to table input 1 in FIG. 15 may refer to the information related to the device in the converted and synchronized Hive history information. The branch corresponding to table input 2 may refer to the information related to the subunit of the device in the converted and synchronized Hive history information.
其中,图15中,表输入1可以是指同步后的Hive履历信息中与设备相关的信息(如EQP_ID),行转列1可以是指将每一行的设备别履历信息转换为列显示,通过行转列的目的是为了将数据转化为Hbase数据库所需的输入格式。过滤记录可以是指过滤数据中的空值(NULL)。表输入2可以是指同步后的Hive履历信息中与设备的子单元相关的信息(如UNIT_ID)。表输出可以是指将表4存储至/写入Hbase数据库中。Among them, in Fig. 15, table input 1 can refer to the information related to the device (such as EQP_ID) in the synchronized Hive history information, and row to column 1 can refer to converting the device-specific history information of each row into a column display, through The purpose of converting rows to columns is to convert the data into the input format required by the Hbase database. Filtering records may refer to filtering null values (NULL) in data. Table input 2 may refer to the information (such as UNIT_ID) related to the subunit of the device in the synchronized Hive history information. Table output may refer to storing/writing Table 4 into the Hbase database.
基于该多分支同步处理的方式,可以提供ETL平台的数据处理效率。Based on the multi-branch synchronous processing method, the data processing efficiency of the ETL platform can be improved.
4、更新调度时间。4. Update the scheduling time.
其中,更新调度时间可以是指确定下个调度任务的调度周期。Wherein, updating the scheduling time may refer to determining the scheduling period of the next scheduled task.
三、登录用户控制台,设置调度周期的时长。3. Log in to the user console and set the duration of the scheduling cycle.
其中,用户控制台可以用于设置多个服务器的调度周期。Among them, the user console can be used to set the scheduling period of multiple servers.
一种示例中,在显示上述图2和图3所示的页面之前,开发人员可以登录如图16所示的登录界面,并输入用户名和密码。在输入的用户名和对应的密码正确的情况下,可以显示如图2和图3所示的页面。In one example, before the pages shown in FIG. 2 and FIG. 3 are displayed, the developer may log in to the login interface shown in FIG. 16 and enter a user name and password. If the entered user name and corresponding password are correct, the pages shown in Figure 2 and Figure 3 can be displayed.
其中,用户名和密码正确可以是指输入的用户名和对应的密码与存储的用户名和对应的密码均相同。Wherein, correct user name and password may mean that the input user name and corresponding password are the same as the stored user name and corresponding password.
需要说明的是,本申请实施例中,当对多个服务器进行配置后,可以通过客户端(如计算机等)与多个服务器建立通信连接。该客户端可以设置有用户控制台,该用户控制台可以为用于控制多个服务器的应用程序或者页面。 响应于登录操作,客户端可以显示如图16所示的登录界面。It should be noted that, in the embodiment of the present application, after multiple servers are configured, communication connections can be established with the multiple servers through a client (such as a computer). The client may be provided with a user console, and the user console may be an application program or a page for controlling multiple servers. In response to the login operation, the client may display a login interface as shown in FIG. 16 .
四、响应于在F5工具对应的页面上进行的设置操作,确定多个服务器的工作模式。4. Determine the working modes of multiple servers in response to the setting operations performed on the page corresponding to the F5 tool.
其中,F5工具对应的页面也可以通过客户端进行显示。也即,该客户端与F5工具通信连接。例如,客户端可以配置有用于控制F5工具的应用程序或网页,通过该应用程序或网页,可以控制F5工具。Wherein, the page corresponding to the F5 tool can also be displayed through the client. That is, the client communicates with the F5 tool. For example, the client may be configured with an application or a webpage for controlling the F5 tool, through which the F5 tool may be controlled.
基于该实施例,在确定调度任务之后,通过ETL系统建立数据同步、数据处理作业、并在数据处理完成后,更新调度时间,为下一次调度任务作好准备。保证了ETL系统的正常运行。Based on this embodiment, after the scheduling task is determined, data synchronization and data processing jobs are established through the ETL system, and after the data processing is completed, the scheduling time is updated to prepare for the next scheduling task. Ensure the normal operation of the ETL system.
一种可能的实施例,以两台服务器(服务器01和服务器02)为例,本申请实施例还提供了一种模拟服务器宕机后的数据处理效果。具体模拟过程可以包括:In a possible embodiment, taking two servers (server 01 and server 02) as an example, this embodiment of the present application also provides a data processing effect after the simulated server is down. The specific simulation process can include:
1、建立测试作业(job_data_sync),并每分钟的频率设置调度频率。1. Create a test job (job_data_sync), and set the scheduling frequency every minute.
其中,如图17所示,为服务器01(IP地址为XX.XX.XX.28)的运行情况。如图18所示,为服务器02(IP地址为XX.XX.XX.27)的运行情况。Wherein, as shown in FIG. 17 , it is the operation status of server 01 (IP address is XX.XX.XX.28). As shown in Figure 18, it is the operation status of server 02 (IP address is XX.XX.XX.27).
由图17和图18所示,两个服务器都可以正常运行。As shown in Figure 17 and Figure 18, both servers can operate normally.
2、响应于停止运行服务器01的操作,F5工具控制服务器02停止运行,并控制服务器01继续运行。2. In response to the operation of stopping the server 01, the F5 tool controls the server 02 to stop running, and controls the server 01 to continue running.
其中,当服务器02停止运行时,服务器01的运行情况可以如图19所示。Wherein, when the server 02 stops running, the running situation of the server 01 can be shown in FIG. 19 .
由图19所示,在服务器01和服务器02具有高可用的情况下,F5工具可以调整服务器的负载。服务器01和服务器02均可以执行数据处理的操作。当某个服务器宕机,ETL作业会在另一台服务器上正常执行,满足实际生产环境中的ETL高可用需求。As shown in Figure 19, in the case that server 01 and server 02 are highly available, the F5 tool can adjust the server load. Both the server 01 and the server 02 can perform data processing operations. When a server goes down, the ETL job will be executed normally on another server to meet the high availability requirements of ETL in the actual production environment.
需要指出的是,本申请各实施例之间可以相互借鉴或参考,例如,相同或相似的步骤,方法实施例、系统实施例和装置实施例之间,均可以相互参考,不予限制。It should be pointed out that the various embodiments of the present application may refer to each other, for example, the same or similar steps, method embodiments, system embodiments and device embodiments may refer to each other without limitation.
本申请实施例可以根据上述方法示例对ETL系统的构建装置进行功能模块或者功能单元的划分,例如,可以对应各个功能划分各个功能模块或者功能单元,也可以将两个或两个以上的功能集成在一个处理模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块或者功能单元的形式实现。其中,本申请实施例中对模块或者单元的划分是示意性的,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。The embodiment of the present application can divide the functional modules or functional units of the construction device of the ETL system according to the above method example, for example, each functional module or functional unit can be divided corresponding to each function, or two or more than two functions can be integrated in a processing module. The above-mentioned integrated modules can be implemented not only in the form of hardware, but also in the form of software function modules or functional units. Wherein, the division of modules or units in the embodiment of the present application is schematic, and is only a logical function division, and there may be another division manner in actual implementation.
如图20所示,为本申请实施例提供的一种ETL系统的构建装置的结构示 意图,该装置包括:配置单元201、处理单元202。As shown in Figure 20, it is a schematic structural diagram of an ETL system construction device provided by the embodiment of the present application, and the device includes: a configuration unit 201 and a processing unit 202.
其中,配置单元201,被配置为:对多个服务器进行配置,以使得配置后的多个服务器具有ETL功能,且多个服务器各自具有对应的标识;对多个服务器配置scan IP地址,以使得多个服务器可以通过该scan IP地址访问多个数据库中的任一数据库,该多个数据库具有高可用性。处理单元202,被配置为:通过预设工具为多个服务器设置虚拟IP地址,以及多个服务器的工作模式,该虚拟IP地址用于建立与多个服务器之间的通信连接,多个服务器的工作模式包括双活模式和主备模式。Wherein, the configuration unit 201 is configured to: configure multiple servers so that the configured multiple servers have ETL functions, and multiple servers each have a corresponding identification; configure scan IP addresses for multiple servers, so that Multiple servers can access any database in multiple databases through the scan IP address, and the multiple databases have high availability. The processing unit 202 is configured to: set a virtual IP address for the multiple servers through a preset tool, and a working mode of the multiple servers, the virtual IP address is used to establish a communication connection with the multiple servers, and the multiple servers The working mode includes active-active mode and active-standby mode.
在一些实施例中,多个数据库具有高可用性包括:当第一数据库无法工作时,第二数据库继续工作,第一数据库和第二数据库为所述多个数据库中不同的数据库。In some embodiments, the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.
在一些实施例中,在双活模式下,多个服务器均处于工作状态,当第一服务器无法处理数据时,第二服务器继续处理所述数据,第一服务器与第二服务器为多个服务器中不同的服务器。In some embodiments, in the active-active mode, multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.
在一些实施例中,在主备模式下,多个服务器包括主服务器和备服务器,主服务器用于处理数据,备服务器用于当主服务器无法处理数据时,继续处理数据。In some embodiments, in the active/standby mode, the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.
如图21所示,为本申请实施例提供的一种数据处理装置的结构示意图,应用于ETL系统,该ETL系统包括多个服务器,多个服务器均配置有ETL工具,ETL工具用于处理数据,该装置包括:获取单元211、确定单元212、发送单元213。As shown in Figure 21, it is a schematic structural diagram of a data processing device provided in the embodiment of the present application, which is applied to an ETL system. The ETL system includes multiple servers, and the multiple servers are all equipped with ETL tools. The ETL tools are used to process data. , the apparatus includes: an acquiring unit 211 , a determining unit 212 , and a sending unit 213 .
其中,获取单元210,被配置为获取第一调度周期的第一待处理数据。确定单元212,被配置为响应于第一设置操作,确定在第一调度周期内多个服务器的工作模式;其中,工作模式包括双活模式、主备模式。在双活模式下,多个服务器均处于工作状态;在主备模式下,多个服务器包括主服务器和备服务器,主服务器处于工作状态,备服务器处于休眠状态;当主服务器无法处理数据的情况下,备服务器从休眠状态转为工作状态。发送单元213,被配置为向第一服务器发送第一指示信息,第一指示信息用于指示第一服务器处理第一待处理数据,第一服务器为第一调度周期内处于工作状态的服务器。Wherein, the obtaining unit 210 is configured to obtain the first data to be processed in the first scheduling period. The determining unit 212 is configured to determine the working modes of the multiple servers in the first scheduling period in response to the first setting operation; wherein the working modes include a dual-active mode and an active-standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state. The sending unit 213 is configured to send first indication information to the first server, where the first indication information is used to instruct the first server to process the first data to be processed, and the first server is a server in a working state within the first scheduling period.
在一些实施例中,ETL系统还包括多个数据库,多个服务器分别与多个数据库通信连接,多个数据库之间通信连接,多个数据库用于存储服务器的配置信息、数据处理进程、调度任务中一个或多个。如图21所示,该装置还包括检测单元214,被配置为:检测多个数据库的工作状态;当检测到第一数 据库无法工作时,将第一数据库中存储的数据存储至第二数据库,第一数据库与第二数据库为所述多个数据库中不同的数据库。In some embodiments, the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of. As shown in Figure 21, the device also includes a detection unit 214 configured to: detect the working status of multiple databases; when it is detected that the first database cannot work, store the data stored in the first database to the second database, The first database and the second database are different databases among the plurality of databases.
在一些实施例中,如图21所示,该装置还包括接收单元215和处理单元216。接收单元215,被配置为接收来自多个系统在第一调度周期的源数据。处理单元216,被配置为将多个源数据进行合并,得到第一待处理数据,并将第一待处理数据存储至第一数据库的预设存储区域,第一数据库包括多个存储区域,多个存储区域用于存储不同调度周期内的待处理数据。获取单元211,具体用于从Hive数据库的预设存储区域中抽取第一待处理数据。In some embodiments, as shown in FIG. 21 , the device further includes a receiving unit 215 and a processing unit 216 . The receiving unit 215 is configured to receive source data from multiple systems in the first scheduling period. The processing unit 216 is configured to combine multiple source data to obtain first data to be processed, and store the first data to be processed in a preset storage area of the first database. The first database includes multiple storage areas, and multiple Each storage area is used to store data to be processed in different scheduling periods. The obtaining unit 211 is specifically configured to extract the first data to be processed from a preset storage area of the Hive database.
在一些实施例中,检测单元214,还被配置为检测第一服务器处理所述第一待处理数据的处理结果;当处理结果为失败,且第一服务器处于工作状态时,控制发送单元213向第一服务器发送第二指示信息,第二指示信息用于指示第一服务器继续处理第一待处理数据;当处理结果为失败,且第一服务器无法处理数据时,控制发送单元213向第二服务器发送第三指示信息,第三指示信息用于指示第二服务器继续处理第一待处理数据,第二服务器为所述多个服务器中处于工作状态的服务器。In some embodiments, the detection unit 214 is further configured to detect the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in the working state, the control sending unit 213 sends The first server sends second instruction information, and the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the control sending unit 213 sends Sending third instruction information, where the third instruction information is used to instruct the second server to continue processing the first data to be processed, and the second server is a server in a working state among the plurality of servers.
在一些实施例中,当处理结果为成功时,获取单元211,还被配置为获取第二调度周期的第二待处理数据;发送单元213,还被配置为向第一服务器发送用于第一服务器处理第二待处理数据的第四指示信息。In some embodiments, when the processing result is successful, the acquiring unit 211 is further configured to acquire the second data to be processed in the second scheduling period; the sending unit 213 is further configured to send the data used for the first The server processes fourth indication information of the second data to be processed.
在一些实施例中,在向第一服务器发送第四指示信息之前,确定单元191,还被配置为响应于第二设置操作,确定在第二调度周期内多个服务器的工作模式。In some embodiments, before sending the fourth indication information to the first server, the determining unit 191 is further configured to determine the working modes of the multiple servers in the second scheduling period in response to the second setting operation.
在一些实施例中,处理单元216,还被配置为响应于第三设置操作,调整多个服务器的调度周期的时长及调度频率。In some embodiments, the processing unit 216 is further configured to adjust the duration and scheduling frequency of the scheduling periods of the multiple servers in response to the third setting operation.
在通过硬件实现时,本申请实施例中的获取单元211可以集成在通信接口上,配置单元201和处理单元202可以集成在处理器上。具体实现方式如图22所示。When implemented by hardware, the acquisition unit 211 in the embodiment of the present application may be integrated on a communication interface, and the configuration unit 201 and the processing unit 202 may be integrated on a processor. The specific implementation is shown in Figure 22.
图22示出了上述实施例中所涉及的ETL系统的构建装置和数据处理装置的又一种可能的通信装置的结构示意图。该通信装置包括:处理器2202和通信接口2203。处理器2202用于对装置的动作进行控制管理,例如,执行上述处理单元212和处理单元216执行的步骤,和/或用于执行本文所描述的技术的其它过程。通信接口2203用于支持装置与其他网络实体的通信,例如,执行上述获取单元211执行的步骤。该装置还可以包括存储器2201和总线2204,存储器2201用于存储装置的程序代码和数据。Fig. 22 shows a schematic structural diagram of another possible communication device of the construction device of the ETL system and the data processing device involved in the above embodiment. The communication device includes: a processor 2202 and a communication interface 2203 . The processor 2202 is used to control and manage the actions of the device, for example, to execute the steps executed by the processing unit 212 and the processing unit 216 above, and/or to execute other processes of the technologies described herein. The communication interface 2203 is used to support communication between the device and other network entities, for example, to perform the steps performed by the above-mentioned obtaining unit 211 . The device may also include a memory 2201 and a bus 2204, and the memory 2201 is used to store program codes and data of the device.
其中,存储器2201可以是该装置中的存储器等,该存储器可以包括易失性存储器,例如随机存取存储器;该存储器也可以包括非易失性存储器,例如只读存储器,快闪存储器,硬盘或固态硬盘;该存储器还可以包括上述种类的存储器的组合。Wherein, the memory 2201 may be a memory in the device, etc., and the memory may include a volatile memory, such as a random access memory; the memory may also include a non-volatile memory, such as a read-only memory, flash memory, hard disk or Solid-state hard disk; the memory may also include a combination of the above-mentioned types of memory.
上述处理器2202可以是实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。该处理器可以是中央处理器,通用处理器,数字信号处理器,专用集成电路,现场可编程门阵列或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。该处理器也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。The above-mentioned processor 2202 may realize or execute various exemplary logic blocks, modules and circuits described in conjunction with the disclosure of this application. The processor may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute the various illustrative logical blocks, modules and circuits described in connection with the present disclosure. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of DSP and a microprocessor, and the like.
总线2204可以是扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。总线2204可以分为地址总线、数据总线、控制总线等。为便于表示,图22中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 2204 may be an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The bus 2204 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 22 , but it does not mean that there is only one bus or one type of bus.
图22中的装置还可以为芯片。该芯片包括一个或两个以上(包括两个)处理器2202和通信接口2203。The device in Fig. 22 may also be a chip. The chip includes one or more than two (including two) processors 2202 and a communication interface 2203 .
可选的,该芯片还包括存储器2201,存储器2201可以包括只读存储器和随机存取存储器,并向处理器2202提供操作指令和数据。存储器2201的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。Optionally, the chip further includes a memory 2201 . The memory 2201 may include a read-only memory and a random access memory, and provides operation instructions and data to the processor 2202 . A part of the memory 2201 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
在一些实施方式中,存储器2201存储了如下的元素,执行模块或者数据结构,或者他们的子集,或者他们的扩展集。In some implementations, the memory 2201 stores the following elements, execution modules or data structures, or their subsets, or their extended sets.
在本申请实施例中,通过调用存储器2201存储的操作指令(该操作指令可存储在操作系统中),执行相应的操作。In the embodiment of the present application, the corresponding operation is executed by calling the operation instruction stored in the memory 2201 (the operation instruction may be stored in the operating system).
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated according to needs It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the above-described system, device, and unit, reference may be made to the corresponding process in the foregoing method embodiments, and details are not repeated here.
本公开的一些实施例提供了一种计算机可读存储介质(例如,非暂态计算机可读存储介质),该计算机可读存储介质中存储有计算机程序指令,计 算机程序指令在计算机上运行时,使得计算机执行如上述实施例中任一实施例所述的ETL系统的构建方法。Some embodiments of the present disclosure provide a computer-readable storage medium (for example, a non-transitory computer-readable storage medium), where computer program instructions are stored in the computer-readable storage medium. When the computer program instructions are run on a computer, The computer is made to execute the construction method of the ETL system as described in any one of the above embodiments.
示例性的,上述计算机可读存储介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,CD(Compact Disk,压缩盘)、DVD(Digital Versatile Disk,数字通用盘)等),智能卡和闪存器件(例如,EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、卡、棒或钥匙驱动器等)。本公开描述的各种计算机可读存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读存储介质。术语“机器可读存储介质”可包括但不限于,无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。Exemplarily, the above-mentioned computer-readable storage medium may include, but is not limited to: a magnetic storage device (for example, a hard disk, a floppy disk, or a magnetic tape, etc.), an optical disk (for example, a CD (Compact Disk, a compact disk), a DVD (Digital Versatile Disk, Digital Versatile Disk), etc.), smart cards and flash memory devices (for example, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), card, stick or key drive, etc.). Various computer-readable storage media described in this disclosure can represent one or more devices and/or other machine-readable storage media for storing information. The term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
本公开的一些实施例还提供了一种计算机程序产品,例如该计算机程序产品存储在非瞬时性的计算机可读存储介质上。该计算机程序产品包括计算机程序指令,在计算机上执行该计算机程序指令时,该计算机程序指令使计算机执行如上述实施例所述的方法。Some embodiments of the present disclosure also provide a computer program product, for example, the computer program product is stored on a non-transitory computer-readable storage medium. The computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the methods described in the above-mentioned embodiments.
本公开的一些实施例还提供了一种计算机程序。当该计算机程序在计算机上执行时,该计算机程序使计算机执行如上述实施例所述的方法。Some embodiments of the present disclosure also provide a computer program. When the computer program is executed on the computer, the computer program causes the computer to execute the methods described in the above-mentioned embodiments.
上述计算机可读存储介质、计算机程序产品及计算机程序的有益效果和上述一些实施例所述的方法的有益效果相同,此处不再赘述。The beneficial effects of the above computer-readable storage medium, computer program product, and computer program are the same as those of the methods described in some of the above embodiments, and will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限 于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。The above is only a specific embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone familiar with the technical field who thinks of changes or substitutions within the technical scope of the present disclosure should cover all within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.

Claims (19)

  1. 一种ETL系统的构建方法,其中,所述ETL系统包括多个服务器以及多个数据库,所述方法包括:A method for building an ETL system, wherein the ETL system includes a plurality of servers and a plurality of databases, the method comprising:
    对所述多个服务器进行配置,以使得配置后的所述多个服务器具有ETL功能,所述多个服务器各自具有对应的标识;Configuring the multiple servers so that the configured multiple servers have ETL functions, and each of the multiple servers has a corresponding identifier;
    为所述多个服务器配置浏览网际互连协议scan IP地址,以使得所述多个服务器通过所述scan IP地址访问所述多个数据库中的任一数据库,所述多个数据具有高可用性;Configure and browse the Internet Protocol scan IP address for the plurality of servers, so that the plurality of servers access any database in the plurality of databases through the scan IP address, and the plurality of data has high availability;
    通过设置工具为所述多个服务器设置虚拟IP地址,以及所述多个服务器的工作模式,所述虚拟IP用于建立与所述多个服务器之间的通信连接,所述工作模式包括双活模式、主备模式。Set the virtual IP address for the plurality of servers through the setting tool, and the working mode of the plurality of servers, the virtual IP is used to establish a communication connection with the plurality of servers, and the working mode includes active-active mode, active/standby mode.
  2. 根据权利要求1所述的方法,其中,所述多个数据库具有高可用性包括:当第一数据库无法工作时,第二数据库继续工作,所述第一数据库和所述第二数据库为所述多个数据库中不同的数据库。The method according to claim 1, wherein said multiple databases have high availability comprises: when the first database fails to work, the second database continues to work, and said first database and said second database are said multiple databases. different databases within one database.
  3. 根据权利要求1或2所述的方法,其中,在所述双活模式下,所述多个服务器均处于工作状态,当第一服务器无法处理数据时,第二服务器继续处理所述数据,所述第一服务器与所述第二服务器为所述多个服务器中不同的服务器。The method according to claim 1 or 2, wherein, in the active-active mode, the plurality of servers are all in working state, and when the first server cannot process the data, the second server continues to process the data, so The first server and the second server are different servers among the plurality of servers.
  4. 根据权利要求1-3任一项所述的方法,其中,在所述主备模式下,所述多个服务器包括主服务器和备服务器,所述主服务器用于处理数据,所述备服务器用于当所述主服务器无法处理数据时,继续处理所述数据。The method according to any one of claims 1-3, wherein, in the active/standby mode, the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server uses When the main server is unable to process the data, continue to process the data.
  5. 一种数据处理方法,应用于ETL系统,所述ETL系统包括多个服务器,所述多个服务器均配置有ETL工具,所述ETL工具用于处理数据,所述方法包括:A kind of data processing method, is applied to ETL system, described ETL system comprises a plurality of servers, and described a plurality of servers are equipped with ETL tool, and described ETL tool is used for processing data, and described method comprises:
    获取第一调度周期的第一待处理数据;Obtain the first data to be processed in the first scheduling cycle;
    响应于第一设置操作,确定在所述第一调度周期内所述多个服务器的工作模式;In response to a first setting operation, determine the working modes of the plurality of servers within the first scheduling period;
    其中,所述工作模式包括双活模式、主备模式,在所述双活模式下,所述多个服务器均处于工作状态;在所述主备模式下,所述多个服务器包括主服务器和备服务器,所述主服务器处于工作状态,所述备服务器处于休眠状态,当所述主服务器无法处理数据的情况下,所述备服务器从休眠状态转为工作状态;Wherein, the working mode includes a dual-active mode and an active-standby mode, and in the active-active mode, the plurality of servers are in a working state; in the active-standby mode, the plurality of servers include the main server and the active-standby mode. A standby server, the primary server is in a working state, the secondary server is in a dormant state, and when the primary server cannot process data, the secondary server is turned from a dormant state to a working state;
    向第一服务器发送第一指示信息,所述第一指示信息用于指示所述第一服务器处理所述第一待处理数据,所述第一服务器为所述第一调度周期内处 于工作状态的服务器。Sending first instruction information to the first server, where the first instruction information is used to instruct the first server to process the first data to be processed, and the first server is in the working state within the first scheduling period server.
  6. 根据权利要求5所述的方法,其中,所述ETL系统还包括多个数据库,所述多个服务器分别与所述多个数据库通信连接,所述多个数据库之间通信连接,所述多个数据库用于存储所述服务器的配置信息、数据处理进程、调度任务中的一个或多个;所述方法还包括:The method according to claim 5, wherein, the ETL system further comprises a plurality of databases, the plurality of servers are connected to the plurality of databases in communication, the communication connections between the plurality of databases, the plurality of The database is used to store one or more of the configuration information of the server, the data processing process, and the scheduling task; the method also includes:
    检测所述多个数据库的工作状态;detecting the working status of the plurality of databases;
    当检测到第一数据库无法工作时,将所述第一数据库中存储的数据存储至第二数据库,所述第一数据库与所述第二数据库为所述多个数据库中不同的数据库。When it is detected that the first database cannot work, the data stored in the first database is stored in a second database, and the first database and the second database are different databases among the plurality of databases.
  7. 根据权利要求5或6所述的方法,其中,所述方法还包括:The method according to claim 5 or 6, wherein the method further comprises:
    接收来自多个系统在所述第一调度周期的源数据;receiving source data from a plurality of systems for the first scheduling period;
    将多个所述源数据进行合并,得到所述第一待处理数据,并将所述第一待处理数据存储至第一数据库的预设存储区域,所述第一数据库包括多个存储区域,所述多个存储区域用于存储不同调度周期内的待处理数据;combining a plurality of the source data to obtain the first data to be processed, and storing the first data to be processed in a preset storage area of a first database, where the first database includes a plurality of storage areas, The multiple storage areas are used to store data to be processed in different scheduling periods;
    所述获取第一调度周期的第一待处理数据,包括:The acquisition of the first data to be processed in the first scheduling cycle includes:
    从Hive数据库的所述预设存储区域中抽取所述第一待处理数据。Extracting the first data to be processed from the preset storage area of the Hive database.
  8. 根据权利要求5-7任一项所述的方法,其中,所述方法还包括:The method according to any one of claims 5-7, wherein the method further comprises:
    检测所述第一服务器处理所述第一待处理数据的处理结果;Detecting a processing result of processing the first data to be processed by the first server;
    当所述处理结果为失败,且所述第一服务器处于工作状态时,向所述第一服务器发送第二指示信息,所述第二指示信息用于指示所述第一服务器继续处理所述第一待处理数据;When the processing result is failure and the first server is in a working state, send second instruction information to the first server, where the second instruction information is used to instruct the first server to continue processing the first server. - data to be processed;
    当所述处理结果为失败,且所述第一服务器无法处理数据时,向第二服务器发送第三指示信息,所述第三指示信息用于指示所述第二服务器继续处理所述第一待处理数据,所述第二服务器为所述多个服务器中处于工作状态的服务器。When the processing result is failure and the first server cannot process the data, send third indication information to the second server, where the third indication information is used to instruct the second server to continue processing the first pending processing data, the second server is a working server among the plurality of servers.
  9. 根据权利要求5-8任一项所述的方法,其中,所述处理结果为成功时,所述方法还包括:The method according to any one of claims 5-8, wherein, when the processing result is successful, the method further comprises:
    获取第二调度周期的第二待处理数据;Acquiring the second data to be processed in the second scheduling cycle;
    向所述第一服务器发送第四指示信息,所述第四指示信息用于指示所述第一服务器处理所述第二待处理数据。Sending fourth instruction information to the first server, where the fourth instruction information is used to instruct the first server to process the second data to be processed.
  10. 根据权利要求9所述的方法,其中,在向所述第一服务器发送第四指示信息之前,所述方法还包括:The method according to claim 9, wherein, before sending the fourth indication information to the first server, the method further comprises:
    响应于第二设置操作,确定在所述第二调度周期内所述多个服务器的工 作模式。In response to the second setting operation, the working modes of the plurality of servers in the second scheduling period are determined.
  11. 根据权利要求5-10任一项所述的方法,其中,所述方法还包括:The method according to any one of claims 5-10, wherein the method further comprises:
    响应于第三设置操作,确定所述多个服务器的调度周期的时长及调度频率。In response to the third setting operation, the duration and scheduling frequency of the scheduling periods of the plurality of servers are determined.
  12. 一种ETL系统,其中,所述ETL系统包括设置工具及多个服务器,所述设置工具通过虚拟IP与所述多个服务器通信连接;An ETL system, wherein the ETL system includes a configuration tool and a plurality of servers, and the configuration tool communicates with the plurality of servers through a virtual IP;
    所述设置工具用于响应于第一设置操作,确定所述多个服务器的调度任务及调度周期;The setting tool is used to determine the scheduling tasks and scheduling periods of the multiple servers in response to the first setting operation;
    所述设置工具还用于响应于第二设置操作,确定每个调度周期内所述多个服务器的工作模式,所述工作模式包括双活模式、主备模式,在所述双活模式下,所述多个服务器均处于工作状态;在所述主备模式下,所述多个服务器包括主服务器和备服务器,所述主服务器处于工作状态,所述备服务器处于休眠状态;当所述主服务器无法处理数据的情况下,所述备服务器从休眠状态转为工作状态;The setting tool is also used to determine the working modes of the plurality of servers in each scheduling cycle in response to the second setting operation, the working modes include a dual-active mode and an active-standby mode, and in the dual-active mode, The plurality of servers are all in a working state; in the active/standby mode, the plurality of servers include a primary server and a standby server, the primary server is in a working state, and the standby server is in a dormant state; when the primary In the case that the server cannot process data, the standby server changes from a dormant state to a working state;
    所述多个服务器用于接收来自所述设置工具的调度任务,并根据调度任务对数据进行处理,并将处理后的数据存储至Hbase数据库。The multiple servers are used to receive scheduling tasks from the setting tool, process data according to the scheduling tasks, and store the processed data in the Hbase database.
  13. 根据权利要求12所述的系统,其中,所述ETL系统还包括多个数据库,所述多个服务器通过scan IP地址访问所述多个数据库中的任一个,所述多个数据库用于存储所述多个服务器的配置信息、数据处理进程、调度任务中的一个或多个,所述多个数据库具有高可用性。The system according to claim 12, wherein, the ETL system also includes a plurality of databases, and the plurality of servers access any one of the plurality of databases through the scan IP address, and the plurality of databases are used to store the One or more of the configuration information of the multiple servers, the data processing process, and the scheduling task, and the multiple databases have high availability.
  14. 根据权利要求12所述的系统,所述多个数据库具有高可用性包括:当第一数据库无法工作时,第二数据库继续工作,所述第一数据库和所述第二数据库为所述多个数据库中不同的数据库。The system according to claim 12, wherein said plurality of databases having high availability comprises: when a first database fails to work, a second database continues to work, and said first database and said second database are said plurality of databases different databases.
  15. 根据权利要求12-14任一项所述的系统,其中,所述设置工具还用于响应于第三设置操作,调整所述多个服务器的调度任务和/或调度周期。The system according to any one of claims 12-14, wherein the setting tool is further configured to adjust scheduling tasks and/or scheduling periods of the plurality of servers in response to a third setting operation.
  16. 一种ETL的构建装置,其中,所述ETL系统包括多个服务器以及多个数据库,该装置包括:A kind of construction device of ETL, wherein, described ETL system comprises a plurality of servers and a plurality of databases, and this device comprises:
    配置单元,被配置为对所述多个服务器进行配置,以使得配置后的所述多个服务器具有ETL功能,所述多个服务器各自具有对应的标识;A configuration unit configured to configure the multiple servers so that the configured multiple servers have an ETL function, and each of the multiple servers has a corresponding identifier;
    所述配置单元,还被配置为所述多个服务器配置浏览网际互连协议scan IP地址,以使得所述多个服务器通过所述scan IP地址访问所述多个数据库中的任一数据库,所述多个数据具有高可用性;The configuration unit is also configured to configure and browse the Internet Protocol scan IP address for the plurality of servers, so that the plurality of servers access any database in the plurality of databases through the scan IP address, so The above data has high availability;
    处理单元,被配置为通过设置工具为所述多个服务器设置虚拟IP地址, 以及所述多个服务器的工作模式,所述虚拟IP用于建立与所述多个服务器之间的通信连接,所述工作模式包括双活模式、主备模式。The processing unit is configured to set virtual IP addresses and working modes of the multiple servers for the multiple servers through a setting tool, and the virtual IP is used to establish a communication connection with the multiple servers, so The above working modes include active-active mode and active-standby mode.
  17. 一种数据处理装置,其中,应用于ETL系统,所述ETL系统包括多个服务器,所述多个服务器均配置有ETL工具,所述ETL工具用于处理数据,所述装置包括获取单元、确定单元和发送单元:A data processing device, wherein, applied to an ETL system, the ETL system includes a plurality of servers, the plurality of servers are equipped with ETL tools, the ETL tools are used to process data, and the device includes an acquisition unit, a determination Unit and sending unit:
    所述获取单元,被配置为获取第一调度周期的第一待处理数据;The obtaining unit is configured to obtain the first data to be processed in the first scheduling period;
    所述确定单元,被配置为响应于第一设置操作,确定在所述第一调度周期内所述多个服务器的工作模式;The determining unit is configured to determine the working modes of the plurality of servers within the first scheduling period in response to a first setting operation;
    其中,所述工作模式包括双活模式、主备模式,在所述双活模式下,所述多个服务器均处于工作状态;在所述主备模式下,所述多个服务器包括主服务器和备服务器,所述主服务器处于工作状态,所述备服务器处于休眠状态,当所述主服务器无法处理数据的情况下,所述备服务器从休眠状态转为工作状态;Wherein, the working mode includes a dual-active mode and an active-standby mode, and in the active-active mode, the plurality of servers are in a working state; in the active-standby mode, the plurality of servers include the main server and the active-standby mode. A standby server, the primary server is in a working state, the secondary server is in a dormant state, and when the primary server cannot process data, the secondary server is turned from a dormant state to a working state;
    所述发送单元,被配置为向第一服务器发送第一指示信息,所述第一指示信息用于指示所述第一服务器处理所述第一待处理数据,所述第一服务器为所述第一调度周期内处于工作状态的服务器。The sending unit is configured to send first instruction information to a first server, where the first instruction information is used to instruct the first server to process the first data to be processed, and the first server is the Servers in working state within a scheduling period.
  18. 一种处理装置,包括:处理器和通信接口;所述通信接口和所述处理器耦合,所述处理器用于运行计算机程序或指令,以实现如权利要求1-4任一项中所述的构建方法以及如权利要求5-11任一项所述的数据处理方法。A processing device, comprising: a processor and a communication interface; the communication interface is coupled to the processor, and the processor is used to run a computer program or an instruction, so as to realize the process described in any one of claims 1-4 The construction method and the data processing method according to any one of claims 5-11.
  19. 一种计算机可读存储介质,其中,所述计算机可读存储介质中存储有指令,当计算机执行所述指令时,所述计算机执行上述权利要求1-4任一项中所述的构建方法以及如权利要求5-11任一项所述的数据处理方法。A computer-readable storage medium, wherein instructions are stored in the computer-readable storage medium, and when a computer executes the instructions, the computer executes the construction method described in any one of claims 1-4 and The data processing method according to any one of claims 5-11.
PCT/CN2022/076973 2022-02-18 2022-02-18 Etl system construction method and apparatus, data processing method and apparatus, and etl system WO2023155176A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202280000222.6A CN116917884A (en) 2022-02-18 2022-02-18 ETL system construction method and device, data processing method and device and ETL system
PCT/CN2022/076973 WO2023155176A1 (en) 2022-02-18 2022-02-18 Etl system construction method and apparatus, data processing method and apparatus, and etl system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/076973 WO2023155176A1 (en) 2022-02-18 2022-02-18 Etl system construction method and apparatus, data processing method and apparatus, and etl system

Publications (1)

Publication Number Publication Date
WO2023155176A1 true WO2023155176A1 (en) 2023-08-24

Family

ID=87577278

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/076973 WO2023155176A1 (en) 2022-02-18 2022-02-18 Etl system construction method and apparatus, data processing method and apparatus, and etl system

Country Status (2)

Country Link
CN (1) CN116917884A (en)
WO (1) WO2023155176A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050042A (en) * 2014-05-30 2014-09-17 北京先进数通信息技术股份公司 Resource allocation method and resource allocation device for ETL (Extraction-Transformation-Loading) jobs
CN106020857A (en) * 2016-04-06 2016-10-12 杭州沃趣科技股份有限公司 Automatic disposition method used for Oracle Real Application Cluster database cluster
CN110909079A (en) * 2019-11-20 2020-03-24 南方电网数字电网研究院有限公司 Data exchange synchronization method, system, device, server and storage medium
US20200334268A1 (en) * 2019-04-18 2020-10-22 Oracle International Corporation System and method for automatic correction/rejection in an analysis applications environment
CN113608932A (en) * 2021-10-09 2021-11-05 深圳市科力锐科技有限公司 Database drilling method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050042A (en) * 2014-05-30 2014-09-17 北京先进数通信息技术股份公司 Resource allocation method and resource allocation device for ETL (Extraction-Transformation-Loading) jobs
CN106020857A (en) * 2016-04-06 2016-10-12 杭州沃趣科技股份有限公司 Automatic disposition method used for Oracle Real Application Cluster database cluster
US20200334268A1 (en) * 2019-04-18 2020-10-22 Oracle International Corporation System and method for automatic correction/rejection in an analysis applications environment
CN110909079A (en) * 2019-11-20 2020-03-24 南方电网数字电网研究院有限公司 Data exchange synchronization method, system, device, server and storage medium
CN113608932A (en) * 2021-10-09 2021-11-05 深圳市科力锐科技有限公司 Database drilling method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116917884A (en) 2023-10-20

Similar Documents

Publication Publication Date Title
US10379956B2 (en) Fault tolerant distributed tasks using distributed file systems
US11249815B2 (en) Maintaining two-site configuration for workload availability between sites at unlimited distances for products and services
US9852173B1 (en) Systems and methods for using a reaction-based approach to managing shared state storage associated with a distributed database
US11474874B2 (en) Systems and methods for auto-scaling a big data system
US10412158B2 (en) Dynamic allocation of stateful nodes for healing and load balancing
CN110795219A (en) Resource scheduling method and system suitable for multiple computing frameworks
US9906589B2 (en) Shared management service
US10404613B1 (en) Placement of control and data plane resources
CN108270726B (en) Application instance deployment method and device
CN106020854A (en) Applying firmware updates in a system with zero downtime
US10826812B2 (en) Multiple quorum witness
CN109408115A (en) A kind of method and computing system based on migrating objects in container environment
CN104167817A (en) Power equipment real-time information integration system and method
CN108140035B (en) Database replication method and device for distributed system
WO2023226197A1 (en) Cloud native storage method and apparatus based on kubernetes, and device and medium
EP2685693A1 (en) Method for gathering queue information and job information in computation environment
CN115146000A (en) Database data synchronization method and device, electronic equipment and storage medium
CN111352592A (en) Disk read-write control method, device, equipment and computer readable storage medium
WO2023155176A1 (en) Etl system construction method and apparatus, data processing method and apparatus, and etl system
US20230388372A1 (en) Systems and methods for hierarchical failover groups
US20160246648A1 (en) Information technology resource planning
CN105471986B (en) A kind of Constructing data center Scale Revenue Ratio method and device
WO2023061257A1 (en) Database service processing method and apparatus
CN109116818B (en) Real-time data dump method and device during SCADA system upgrade
CN114996081A (en) Batch job progress monitoring method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280000222.6

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22926503

Country of ref document: EP

Kind code of ref document: A1