WO2023155176A1

WO2023155176A1 - Etl system construction method and apparatus, data processing method and apparatus, and etl system

Info

Publication number: WO2023155176A1
Application number: PCT/CN2022/076973
Authority: WO
Inventors: 王建宙; 段季芳; 王瑜; 袁菲; 汤玥; 沈国梁; 王萍; 李园园; 吴建波; 何德材; 吴建民; 王洪
Original assignee: 京东方科技集团股份有限公司; 北京中祥英科技有限公司
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2023-08-24
Also published as: CN116917884A

Abstract

An ETL system construction method and apparatus, a data processing method and apparatus, and an ETL system, which relate to the technical field of data processing and allow rational use of an ETL platform. The ETL system comprises a setting tool and a plurality of servers, wherein the setting tool is in communication connection with the plurality of servers by means of virtual IPs, the setting tool is used for determining scheduling tasks and scheduling cycles of the plurality of servers in response to a first setting operation, and the setting tool is used for determining working modes of the plurality of servers in each scheduling cycle in response to a second setting operation, the working modes comprising an active-active mode and an active-standby mode; and the plurality of servers are used for receiving the scheduling tasks from the setting tool, processing data according to the scheduling tasks, and storing the processed data in an Hbase database. The ETL system provided in the embodiments of the present application ensures the normal operation of an ETL platform by means of the high availability of a plurality of servers and a plurality of databases.

Description

Construction method and device of ETL system, data processing method and device, ETL system

technical field

The present disclosure relates to the technical field of data processing, and in particular to a construction method and device of an ETL system, a data processing method and device, and an ETL system.

Background technique

With the development of the enterprise, various departments of the enterprise (such as business lines, production lines, and product lines) will undertake the construction of various information systems to process their respective businesses. For example, the information system constructed by the production line can integrate the product data uploaded by the equipment of the production line and the product data uploaded manually for analysis, so that developers can judge the cause of product failure based on the analysis results.

In order to integrate the data of various departments and avoid the phenomenon of data islands, it is necessary to use an extract-transform-load (ETL) platform that meets the data integration requirements. The ETL platform can extract, interactively transform, and load data from the source to the destination, so that scattered, messy, and non-uniform data can be integrated to facilitate data analysis by developers. Therefore, how to use the ETL platform reasonably has become a technical problem to be solved urgently.

Contents of the invention

In one aspect, a method for constructing an ETL system is provided. The ETL system includes multiple servers and multiple databases. The method includes: configuring the multiple servers so that the configured multiple servers have ETL functions, and the multiple Each server has a corresponding identification; configure scan IP addresses for multiple servers, so that multiple servers can access any database in multiple databases through the scan IP address, and the multiple databases have high availability; Set a virtual IP address for each server, and the working mode of multiple servers. The virtual IP address is used to establish a communication connection with multiple servers. The working mode of multiple servers includes active-active mode and active-standby mode.

Based on the above technical solution, by setting up multiple servers, the multiple servers can have the ETL function. At the same time, by allowing multiple servers to access the scan IP of any database in multiple databases, when a database fails, other databases can continue to work normally, ensuring high availability among databases. Set the virtual IP address and working mode for multiple servers through the setting tool, so that any server can be accessed through the virtual IP address, that is, through the high availability among multiple servers, the normal operation of the ETL platform is guaranteed.

In some embodiments, the multiple databases having high availability includes: when the first database fails to work, the second database continues to work, and the first database and the second database are different databases among the multiple databases.

In some embodiments, in the active-active mode, multiple servers are in working state, when the first server cannot process the data, the second server continues to process the data, the first server and the second server are among the multiple servers different servers.

In some embodiments, in the active/standby mode, the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server is used to continue processing data when the primary server cannot process data.

In another aspect, a device for constructing an ETL system is provided, including: a configuration unit and a processing unit. The configuration unit is configured to: configure multiple servers, so that the configured multiple servers have ETL functions, and each of the multiple servers has a corresponding identifier; configure scan IP addresses for multiple servers, so that multiple servers Any one of multiple databases can be accessed through the scan IP address, and the multiple databases are highly available. The processing unit is configured to: set a virtual IP address for multiple servers through a setting tool, and the working mode of the multiple servers, the virtual IP address is used to establish a communication connection with the multiple servers, and the working mode of the multiple servers Including active-active mode and active-standby mode.

In another aspect, a data processing method is provided, which is applied to an ETL system. The ETL system includes a plurality of servers, and the plurality of servers are equipped with ETL tools, and the ETL tools are used to process data. The method includes: obtaining the first scheduling period The first data to be processed; in response to the first setting operation, determine the working modes of the multiple servers in the first scheduling period; where the working modes include active-active mode and active-standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; sends the first indication information to the first server, the first indication information is used to instruct the first server to process the first data to be processed, and the first server is in the working state within the first scheduling period server.

Based on the above technical solution, after obtaining the data to be processed in the first call cycle, determine the working modes of multiple servers, and send instruction information for processing data to the server in the working state among the multiple servers, which can make the server ETL processing can be performed on the data to be processed.

In some embodiments, the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of. The method also includes: detecting the working status of multiple databases; when it is detected that the first database cannot work, storing the data stored in the first database to the second database, the first database and the second database are the multiple databases different databases.

In some embodiments, the method further includes: receiving source data from multiple systems in the first scheduling period; combining multiple source data to obtain first data to be processed, and storing the first data to be processed in the first scheduling period A preset storage area of a database, the first database includes multiple storage areas, and the multiple storage areas are used to store data to be processed in different scheduling periods. The above "obtaining the first data to be processed in the first scheduling period" may specifically include: extracting the first data to be processed from a preset storage area of the Hive database.

In some embodiments, the method further includes: detecting the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in a working state, sending a second indication to the first server information, the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the third instruction information is sent to the second server, and the third instruction information uses In order to instruct the second server to continue processing the first data to be processed, the second server is a server in a working state among the plurality of servers.

In some embodiments, when the processing result is successful, the method further includes: acquiring the second data to be processed in the second scheduling period; sending a fourth instruction for the first server to process the second data to be processed to the first server information.

In some embodiments, before sending the fourth indication information to the first server, the method further includes: in response to the second setting operation, determining the working modes of the plurality of servers in the second scheduling period.

In some embodiments, the method further includes: in response to the third setting operation, determining the duration and scheduling frequency of the scheduling periods of the plurality of servers.

In yet another aspect, a data processing device is provided, which is applied to an ETL system. The ETL system includes a plurality of servers, and the plurality of servers are equipped with ETL tools, and the ETL tools are used to process data. The device includes: an acquisition unit configured to In order to obtain the first data to be processed in the first scheduling period; the determining unit is configured to determine the working modes of multiple servers in the first scheduling period in response to the first setting operation; wherein the working modes include active-active mode, master standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; the sending unit is configured to send first indication information to the first server, the first indication information is used to instruct the first server to process the first data to be processed, and the first server is the first Servers in working state during the scheduling period.

In some embodiments, the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of. The device also includes a detection unit configured to: detect the working status of multiple databases; when it is detected that the first database cannot work, store the data stored in the first database to the second database, and the first database and the second database are different databases among the plurality of databases.

In some embodiments, the device further includes a receiving unit and a processing unit; the receiving unit is configured to receive source data from multiple systems in the first scheduling period; the processing unit is configured to combine multiple source data, The first data to be processed is obtained, and the first data to be processed is stored in a preset storage area of the first database. The first database includes multiple storage areas, and the multiple storage areas are used to store the data to be processed in different scheduling periods. The obtaining unit is specifically configured to: extract the first data to be processed from the preset storage area of the Hive database.

In some embodiments, the detection unit is further configured to detect the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in working state, the control sending unit sends the first The server sends second instruction information, and the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the control sending unit sends the third Instruction information, the third instruction information is used to instruct the second server to continue processing the first data to be processed, and the second server is a server in a working state among the plurality of servers.

In some embodiments, when the processing result is successful, the obtaining unit is further configured to obtain the second data to be processed in the second scheduling period; the sending unit is also configured to send the data to the first server for processing by the first server. Fourth indication information of the second data to be processed.

In some embodiments, before sending the fourth indication information to the first server, the determining unit is further configured to determine the working modes of the plurality of servers in the second scheduling period in response to the second setting operation.

In some embodiments, the determining unit is further configured to determine the duration and scheduling frequency of the scheduling periods of the multiple servers in response to the third setting operation.

In another aspect, an ETL system is provided, the ETL system includes a setting tool and multiple servers, the setting tool communicates with multiple servers through a virtual IP, and the setting tool communicates with multiple servers; the setting tool is used to respond to the first The setting operation is to determine the scheduling tasks and scheduling cycles of multiple servers; the setting tool is also used to respond to the second setting operation to determine the working modes of multiple servers in each scheduling cycle. The working modes include active-active mode, active-standby mode, In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state; multiple servers are used to receive scheduling tasks from the scheduler, process the data according to the scheduling tasks, and store the processed data in the Hbase database.

In some embodiments, the ETL system also includes multiple databases, multiple servers access multiple databases through scan IP addresses, multiple databases are used to store one or more of configuration information, data processing processes, and scheduling tasks of multiple servers , multiple databases with high availability.

In some embodiments, the setting tool is further configured to adjust the scheduling tasks and/or scheduling periods of the multiple servers in response to the third setting operation.

In yet another aspect, a computer readable storage medium is provided. The computer-readable storage medium stores computer program instructions. When the computer program instructions run on a computer, the computer executes the ETL system construction method and data processing method as described in any of the above embodiments.

In yet another aspect, a computer readable storage medium is provided. The computer-readable storage medium stores computer program instructions, and when the computer program instructions run on a computer, the computer executes the data processing method as described in any one of the above embodiments.

In yet another aspect, a computer program product is provided. The computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the ETL system construction method as described in any of the above embodiments.

In yet another aspect, a computer program product is provided. The computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the data processing method as described in any of the above embodiments.

In yet another aspect, a computer program is provided. When the computer program is executed on the computer, the computer program causes the computer to execute the ETL system construction method described in any of the above embodiments.

In yet another aspect, a computer program is provided. When the computer program is executed on a computer, the computer program causes the computer to execute the data processing method described in any of the above embodiments.

In yet another aspect, a chip is provided, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run computer programs or instructions, so as to implement the construction method of the ETL system as described in any of the above embodiments.

In yet another aspect, a chip is provided, the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is used to run computer programs or instructions to implement the data processing method as described in any of the above embodiments.

Specifically, the chip provided in this application further includes a memory for storing computer programs or instructions.

It should be noted that all or part of the above computer instructions may be stored on a computer-readable storage medium. Wherein, the computer-readable storage medium may be packaged together with the processor of the device, or may be packaged separately with the processor of the device, which is not limited in the present application.

In this application, the names of the above-mentioned devices do not limit the devices or functional modules themselves. In actual implementation, these devices or functional modules may appear with other names. As long as the functions of each device or functional module are similar to those of the present application, they fall within the scope of the claims of the present application and their equivalent technologies.

Description of drawings

In order to illustrate the technical solutions in the present disclosure more clearly, the following will briefly introduce the accompanying drawings required in some embodiments of the present disclosure. Obviously, the accompanying drawings in the following description are only appendices to some embodiments of the present disclosure. Figures, for those of ordinary skill in the art, other drawings can also be obtained based on these drawings. In addition, the drawings in the following description can be regarded as schematic diagrams, and are not limitations on the actual size of the product involved in the embodiments of the present disclosure, the actual process of the method, the actual timing of signals, and the like.

Fig. 1 is the structural diagram of a kind of ETL system that the application embodiment provides;

FIG. 2 is a schematic diagram of an interface for setting a scheduling cycle provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of another interface for setting the scheduling cycle provided by the embodiment of the present application;

Fig. 4 is a schematic interface diagram of a F5 tool provided by the embodiment of the present application;

Fig. 5 is the structural diagram of another kind of ETL system that the embodiment of the present application provides;

Fig. 6 is the structural diagram of another kind of ETL system that the embodiment of the present application provides;

FIG. 7 is a schematic flow diagram of a method for constructing an ETL system provided in an embodiment of the present application;

FIG. 8 is a schematic flowchart of a data processing method provided in an embodiment of the present application;

FIG. 9 is a schematic flowchart of another data processing method provided by the embodiment of the present application;

FIG. 10 is a schematic flowchart of another data processing method provided in the embodiment of the present application;

FIG. 11 is a schematic flowchart of another data processing method provided in the embodiment of the present application;

FIG. 12 is a schematic flowchart of another data processing method provided in the embodiment of the present application;

FIG. 13 is a schematic flow diagram of a data processing method provided in an embodiment of the present application;

FIG. 14 is a schematic flowchart of a data processing method provided in an embodiment of the present application;

FIG. 15 is a schematic flowchart of a data processing method provided in an embodiment of the present application;

FIG. 16 is a schematic diagram of a login interface provided by an embodiment of the present application;

FIG. 17 is a schematic diagram of a data processing result of a server provided by an embodiment of the present application;

FIG. 18 is a schematic diagram of data processing results of another server provided in the embodiment of the present application;

FIG. 19 is a schematic diagram of data processing results of another server provided in the embodiment of the present application;

FIG. 20 is a schematic diagram of a construction device of an ETL system provided by an embodiment of the present application;

FIG. 21 is a schematic diagram of a data processing device provided by an embodiment of the present application;

FIG. 22 is a schematic diagram of a communication device provided by an embodiment of the present application.

Detailed ways

The technical solutions in some embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are only some of the embodiments of the present disclosure, not all of them. All other embodiments obtained by persons of ordinary skill in the art based on the embodiments provided in the present disclosure belong to the protection scope of the present disclosure.

Throughout the specification and claims, unless the context requires otherwise, the term "comprise" and other forms such as the third person singular "comprises" and the present participle "comprising" are used Interpreted as the meaning of openness and inclusion, that is, "including, but not limited to". In the description of the specification, the terms "one embodiment", "some embodiments", "exemplary embodiments", "example", "specific examples" example)" or "some examples (some examples)" etc. are intended to indicate that specific features, structures, materials or characteristics related to the embodiment or examples are included in at least one embodiment or example of the present disclosure. Schematic representations of the above terms are not necessarily referring to the same embodiment or example. Furthermore, the particular features, structures, materials or characteristics described may be included in any suitable manner in any one or more embodiments or examples.

Hereinafter, the terms "first" and "second" are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as "first" and "second" may explicitly or implicitly include one or more of these features. In the description of the embodiments of the present disclosure, unless otherwise specified, "plurality" means two or more.

In describing some embodiments, the expressions "coupled" and "connected" and their derivatives may be used. For example, the term "connected" may be used in describing some embodiments to indicate that two or more elements are in direct physical or electrical contact with each other. As another example, the term "coupled" may be used when describing some embodiments to indicate that two or more elements are in direct physical or electrical contact. However, the terms "coupled" or "communicatively coupled" may also mean that two or more elements are not in direct contact with each other, but yet still co-operate or interact with each other. The embodiments disclosed herein are not necessarily limited by the context herein.

"At least one of A, B and C" has the same meaning as "at least one of A, B or C" and both include the following combinations of A, B and C: A only, B only, C only, A and B A combination of A and C, a combination of B and C, and a combination of A, B and C.

"A and/or B" includes the following three combinations: A only, B only, and a combination of A and B.

As used herein, the term "if" is optionally interpreted to mean "when" or "at" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrases "if it is determined that ..." or "if [the stated condition or event] is detected" are optionally construed to mean "when determining ..." or "in response to determining ..." depending on the context Or "upon detection of [stated condition or event]" or "in response to detection of [stated condition or event]".

The use of "suitable for" or "configured to" herein means open and inclusive language that does not exclude devices that are suitable for or configured to perform additional tasks or steps.

Additionally, the use of "based on" is meant to be open and inclusive, as a process, step, calculation, or other action that is "based on" one or more stated conditions or values may in practice be based on additional conditions or beyond stated values.

As used herein, "about", "approximately" or "approximately" includes the stated value as well as the average within the acceptable deviation range of the specified value, wherein the acceptable deviation range is as determined by one of ordinary skill in the art. Determined taking into account the measurement in question and the errors associated with the measurement of a particular quantity (ie, limitations of the measurement system).

In the following, nouns involved in the embodiments of the present application are explained for the convenience of readers' understanding.

ETL: Can be used to describe the process of extracting, transforming, and loading data from the source to the destination. ETL can refer to: extracting the required data from different data sources, performing data cleaning and conversion on the extracted data, and then loading the cleaned and converted data into the database.

In one example, a node or server can be configured with an ETL tool that can perform ETL functions. For example, the ETL tool may be an application program or a device capable of performing the function, without limitation.

It should be noted that, in the embodiment of the present application, a node or a server configured with an ETL tool may be referred to as an ETL platform/system, or may be a part of the ETL platform/system.

Database: It can be used to store the data required for the operation of the ETL platform. For example, data extracted from different data sources, cleaned data, converted data, etc., can also be used to store configuration files related to ETL. For example, the database can be used to store job content submitted by ETL to the server, ETL configuration information, ETL scheduling information, and the like. Specifically, reference may be made to the description of the following embodiments.

For example, the database may be a relational database management system (RDBMS). For example, it may be a MySQL database, an Oracle database, a Postgresql database, and the like.

It should be noted that in the embodiment of the present application, there is high availability between multiple databases and multiple servers configured with ETL tools.

High availability: It means that a system is specially designed so that when one device fails, other devices can continue to operate, thereby reducing downtime and maintaining the availability of the system.

For example, taking a system with multiple databases and high availability among the multiple databases as an example, when one of the multiple databases cannot provide storage services, such as a power outage or hardware or software failure of the database, the A healthy database among multiple databases can continue to provide storage services. Thus the system can continue to provide storage services.

For another example, take a system with multiple servers configured with ETL tools, and high availability between the multiple servers as an example, when a hardware or software failure occurs in one of the multiple servers, the multiple servers are in Servers in a normal state can continue to work to ensure that the system can normally provide ETL data processing services.

Normally, most ETL platforms/systems operate based on a single node/single server. When a single node/single server fails, it will affect the normal operation of the ETL platform/system.

In view of this, an embodiment of the present application provides an ETL system, which may include multiple servers configured with ETL tools, and the multiple servers may share one virtual IP address to achieve high availability of the multiple servers. Each of the multiple servers can be connected to multiple databases, and the multiple databases also have high availability. In this way, through the high availability among multiple servers and multiple databases, the problem that the entire ETL system cannot operate normally when a server or database fails is avoided.

The implementation of the embodiment of the present application will be described in detail below in conjunction with the accompanying drawings.

As shown in FIG. 1 , FIG. 1 is a schematic structural diagram of an ETL system provided by an embodiment of the present application. The ETL system may include: a plurality of servers (only two servers are shown in the figure, server 01 and server 02 respectively) and a configuration tool. The setting tool communicates with a plurality of servers respectively.

Wherein, the setting tool can determine the scheduling tasks, scheduling periods and working modes of multiple servers in response to the setting operation. A setup tool can be a setter, a controller, a control tool, etc. The setting tool can be one device or multiple devices. For example, as shown in FIG. 1 , the setting tool may include a scheduler 10 and a preset tool 20 . The scheduler 10 can communicate with multiple servers through virtual IP addresses. The preset tool 20 may be communicatively coupled to a plurality of servers via communication links. The communication link may be a wired communication link or a wireless communication link.

Specifically, the scheduler 10 may be configured to determine scheduling tasks and scheduling periods of multiple servers in response to a setting operation. Scheduling tasks can refer to the data that the server needs to extract, transform and integrate. The scheduling period may refer to the time interval for the server to execute the scheduling task. For example, the caller 10 may assign scheduling tasks to multiple servers at preset time intervals.

In an example, the scheduler 10 may have a corresponding setting interface. The setting interface can determine the scheduling tasks and scheduling periods of multiple servers in response to user operations.

For example, as shown in FIG. 2 and FIG. 3 , it is the setting interface of the scheduling period provided by the embodiment of the present application. The setting interface shown in Figure 2 can be used to set the duration of the scheduling cycle. The setting interface shown in Figure 3 can be used to set the running frequency in the scheduling period.

Wherein, the preset tool 20 can be used to determine the working modes of multiple servers in each scheduling period in response to the setting operation. For example, the preset tool 20 may be a hardware device with a settings page. The setting page can determine the working modes of multiple servers in each scheduling period in response to the user's setting operation. For example, you can set the working mode of multiple servers in the area corresponding to the resource. For example, the node name, address, and server port can respond to the setting operation to determine the server in the working state and the port to receive data. In response to the operation of clicking "node list", multiple servers may be configured and the statuses of the multiple servers may be set as working status. In response to the operation of clicking "add", information of a new server can be added. In response to the operation of clicking "Cancel", the information of the server may be deleted. It should be noted that in the embodiment of the present application, the F5 tool is used as an example for description, and of course, other tools may also be used to replace the functions of the F5 tool, without limitation.

Wherein, the working modes of the multiple servers may include active-active mode (Active-Active) and active-standby mode (Active-Standby).

The active-active mode and active-standby mode of multiple servers will be described below in conjunction with server 01 and server 02 in FIG. 1 .

1. Active-active mode

In the active-active mode, multiple servers can execute the scheduling tasks of the ETL system at the same time. When a server fails to run because of a problem, the setting tool can assign the task run by the server to another server that is running normally, so that the server that is running normally can continue to process the task. Therefore, the normal operation of the ETL system is guaranteed under the condition of fully using the computing resources of multiple servers.

In one example, as shown in FIG. 1 , when the server 01 and the server 02 are in the active-active mode, both the server 01 and the server 02 are in the working state, for example, the server 01 executes the scheduling task 1, and the server 02 executes the scheduling task 2. When the server 01 has a problem and cannot normally execute the scheduling task 1 (such as power failure, hardware failure, etc.), the configuration tool can assign the scheduling task 1 to the server 02, so that the server 02 can continue to execute the scheduling task 1.

2. Active/standby mode

The active-standby mode refers to multiple servers including the active server and the standby server. The master server is used to execute the scheduling tasks of the ETL system. The standby server is used to continue to execute the scheduled task when the primary server cannot execute the scheduled task.

In one example, with reference to Figure 1, server 01 is used as the primary server and server 02 is the backup server. Instruction information is sent, and the instruction information may be used to instruct the server 02 to continue executing the scheduled task 1 . The indication information may include identification information of the scheduled task 1 or data corresponding to the scheduled task 1. When the indication information includes the identification information of the scheduling task 1, the server 02 may acquire the data corresponding to the scheduling task 2 according to the identification information. When the indication information includes the data corresponding to the scheduled task 1, the server 02 may directly process the data.

As can be seen from the above, in the embodiment of the present application, through the above two working modes, when any one of the multiple servers cannot normally execute the scheduling task, other servers can continue to execute the scheduling task, ensuring the normal operation of the ETL system.

Wherein, the plurality of servers are all configured with ETL tools. All the multiple servers can execute the scheduling task.

In an example, the setting page of the preset tool may be as shown in FIG. 4 . The setting page can respond to the developer's setting operations to determine the working modes of multiple servers and the scheduled tasks to be executed.

In a possible embodiment, as shown in FIG. 5 , the ETL system may further include multiple databases (only database 21 and database 22 are shown in the figure). The multiple databases are communicatively connected to multiple servers respectively.

Wherein, the plurality of databases can be used to store information such as job content, configuration information, and scheduling plan of the server. The stored information can be synchronized between the multiple databases. The job content of the server may include the above-mentioned scheduling task and the data processed by the scheduling task. The scheduling plan may include the duration of the scheduling period, the scheduling frequency of the scheduling period, and the like.

In an example, the multiple databases may have the same browsing (scan) IP address, and multiple servers may access any one of the multiple databases through the scanning IP address.

It should be noted that, in this embodiment of the application, both the virtual IP address and the browsing IP address are used to access a device. For example, a configuration tool can access any one of multiple servers through a virtual IP address. There is a corresponding relationship between the virtual IP address and the IP addresses of the plurality of servers. The server can access any database among multiple databases by browsing the IP address. The plurality of servers may also be configured with respective corresponding IP addresses. The multiple databases may also be configured with respective corresponding IP addresses. There is a corresponding relationship between the browsing IP address and the IP addresses of the plurality of databases.

It should be noted that, in the embodiment of the present application, in combination with the database 21 and the database 22 in FIG. service; or, when a hardware or software failure occurs in the database 22, the database 21 can continue to provide storage services.

In yet another possible embodiment, as shown in FIG. 6 , the ETL system provided in the embodiment of the present application may also communicate with a distributed computing (hadoop) system. The hapoop system may include multiple databases, such as a Hive database and a distributed (Hbase) database.

Among them, the Hive database communicates with multiple source systems and ETL systems respectively. The source system can be used to provide raw data, for example, it can provide product history information and bad CODE information.

In an example, the multiple source systems may include a yield management system (yield management system, YMS) system and a management data warehouse (management data warehouse, MDW) system. The YMS system can be used to provide product history information. The MDW system can be used to provide code of error (CODE) information for detected products.

Wherein, the history information of the product may refer to the basic information of the product directly uploaded by the factory equipment to the YMS system. For example, the historical resume information of a product can include: factory (FACTORY), product lot number (LOT_ID), product (such as glass) identification (GLS_ID), event time key (EVENT_TIMEKEY), product type (PRODUCT_TYPE), former process site (OLD_OPER_CODE) One or more of , product model (PRODUCT_ID), equipment identification (EQP_ID), unit identification (UNIT_ID), subunit identification (SUB_UNIT_ID), and previous process investment time (LAST_PROCESS_IN_TIME).

In one example, the historical resume information of a product can be stored in the YMS system in the form of an Oracle table, or can be stored in the YMS system in the form of an array, without limitation. For example, taking the product as glass as an example, the history information of the product may be shown in Table 1.

Table 1

表字段table fields	说明illustrate
FACTORYFACTORY	工厂factory
LOT_IDLOT_ID	LOT IDLOT ID
GLS_IDGLS_ID	GLASS IDGLASS ID

EVENT_TIMEKEYEVENT_TIMEKEY	事件时间event time
PRODUCT_TYPEPRODUCT_TYPE	产品类型product type
OLD_OPER_CODEOLD_OPER_CODE	前工艺站点former craft site
PRODUCT_IDPRODUCT_ID	产品型号Product number
EQP_IDEQP_ID	设备名Equipment name
UNIT_IDUNIT_ID	单元名unit name
SUB_UNIT_IDSUB_UNIT_ID	子单元名subunit name
LAST_PROCESS_IN_TIMELAST_PROCESS_IN_TIME	前工艺投入时间Pre-process investment time
EVENT_TIMEEVENT_TIME	事件时间event time

It should be noted that the table fields in Table 1 are only exemplary, and may also include other fields, such as product size, thickness, and other fields, which are not limited.

Wherein, the bad CODE information of the product may be a problem in the production process of the product. For example, bad CODE information of a product can include factory (FACTORY), site (STEP), product lot number (LOT_ID), product identification (GLS_ID), former process site (PRODUCT_ID), product type (PRODUCT_TYPE), product size (PRODUCT_SIZE), One or more of product model (NEW_MODEL), CODE grade (DEFECT_GRADE), bad CODE (DEFECT_CODE), and CODE detection time (TXN_TIME).

In one example, the bad CODE information of the product can be stored in the MDW system in the form of a table, or can be stored in the MDW system in the form of an array, without limitation. For example, taking the product as glass as an example, the historical defect CODE information of the product can be shown in Table 2.

Table 2

表字段table field	说明illustrate
FACTORYFACTORY	工厂factory
STEPSTEP	站点site
LOT_IDLOT_ID	LOT IDLOT ID
GLS_IDGLS_ID	GLASS IDGLASS ID
PRODUCT_IDPRODUCT_ID	前工艺站点former craft site
PRODUCT_TYPEPRODUCT_TYPE	产品类型product type
PRODUCT_SIZEPRODUCT_SIZE	产品大小product size
NEW_MODEL NEW_MODEL		产品型号2Product model 2
DEFECT_GRADEDEFECT_GRADE	CODE等级CODE level

DEFECT_CODEDEFECT_CODE	不良CODEbad code
TXN_TIMETXN_TIME	CODE检测时间CODE detection time

It should be noted that the table fields in Table 2 are only exemplary, and may also include other table fields, for example, may also include identification information of the detection device, etc., without limitation.

It should be pointed out that the above Table 1 and Table 2 refer to the information corresponding to the same product.

Combining the above Table 1 and Table 2, the Hive database can store the synchronized Hive history information. The Hive history information may include partition fields and fields extracted from Table 1 and Table 2. This partition field can identify the Hive history information after synchronization.

In one example, the Hive database may include multiple storage areas, and the synchronized Hive resume information may be stored in the corresponding storage areas in the form of parquet. Hive history information in different storage areas can have different partition fields. For example, the partition field may be the time field (TIMEDAY) corresponding to the Hive history information.

For example, when the partition field in the synchronized Hive history information is a time field, combined with Table 1 and Table 2, the fields of the synchronized Hive history information can also include factory (FACTORY), product lot number (LOT_ID), product identification ( GLS_ID), event time key (EVENT_TIMEKEY), product type (PRODUCT_TYPE), former process site (OLD_OPER_CODE), product model (PRODUCT_ID), equipment identification (EQP_ID), unit identification (UNIT_ID), subunit identification (SUB_UNIT_ID), former process One or more of the time invested (LAST_PROCESS_IN_TIME). Of course, the synchronized Hive resume information may also include other fields, which are not limited. For example, the synchronized Hive history information can be shown in Table 3.

table 3

表字段table field	说明illustrate
FACTORYFACTORY	工厂factory
LOT_IDLOT_ID	LOT IDLOT ID
GLS_IDGLS_ID	GLASS IDGLASS ID
EVENT_TIMEKEYEVENT_TIMEKEY	事件时间event time
PRODUCT_TYPEPRODUCT_TYPE	产品类型product type
OLD_OPER_CODEOLD_OPER_CODE	前工艺站点former craft site
PRODUCT_IDPRODUCT_ID	产品型号Product number
EQP_IDEQP_ID	设备名Equipment name
UNIT_IDUNIT_ID	单元名unit name
SUB_UNIT_IDSUB_UNIT_ID	子单元名subunit name

LAST_PROCESS_IN_TIMELAST_PROCESS_IN_TIME	前工艺投入时间Pre-process investment time
EVENT_TIMEEVENT_TIME	事件时间event time
TIMEDAYTIMEDAY	分区字段partition field

It should be noted that, among the fields in Table 3, other fields except the partition field are fields synchronized based on the fields in Table 1 and Table 2.

Among them, the Hbase database can be used to store the data processed by the ETL system. For example, the ETL system can obtain the synchronized Hive resume information from the data in the Hive database, process the synchronized Hive resume information, and store the processed data in the Hbase database to facilitate the use of the background system.

For example, the process for the ETL system to process the synchronized Hive history information may include:

1. Read the synchronized Hive history information, and filter the synchronized Hive history information to obtain the processed data.

Wherein, filtering the synchronized Hive resume information may refer to filtering according to the site information and time information of the product, and determining information such as the residence time and device type of each product in the same site.

Specifically, the residence time of the product in the same site (PROCESS_TIME) = the event time of the product (EVENT_TIME) - the input time of the previous process (LAST_LOGGED_IN_TIME). The equipment type of the product (represented by the field STEP_ID) is part of the equipment name (EQP_ID). For example, it can be the first five characters of EQP_ID. For example, if the EQP_ID of the product is AAEWS07, the corresponding STEP_ID can be AAEWS.

2. Write the processed data into the Hbase database according to the preset format.

Wherein, the preset form can be set as required. For example, the preset format may be a format that is convenient for the background system to correspond to.

In one example, the processed data stored in the Hbase database in the form of tables may also be stored in other forms, without limitation. For example, the processed data can be as shown in Table 4.

Table 4

It should be noted that in Table 4, the MD5 field is obtained by encrypting the LOT_ID field with the md5 function to obtain the first three digits of the field. For the MD5 function and the encryption method using the MD5 function, reference may be made to the prior art, and details are not repeated here. SEQ_ID is the spliced value of multiple fields. The plurality of fields may include TRKG, OLD_OPER_CODE, STEP_ID. Among them, TRKG represents history information.

Based on the ETL system shown in FIG. 1 or FIG. 5 or FIG. 6 , the embodiment of the present application provides a method for constructing an ETL system and a data processing method (abbreviated as a data processing method) applied to the ETL system. The following describes the construction method and data processing method of the ETL system respectively:

First, the construction method of ETL system.

As shown in Figure 7, a method for constructing an ETL system provided by the embodiment of the present application may include:

S701. Configure multiple servers, so that the configured multiple servers have an ETL function.

Wherein, the multiple servers may be server 01 and server 02 in FIG. 1 or FIG. 2 . Configuring multiple servers may refer to installing a preset configuration file for multiple servers, and the server after installing the preset configuration file may have an ETL function, for example, the configuration file may be Pentaho Server.

Further, in order to ensure that each server has its own corresponding identification information, the setting tool can also respond to the setting operation and set a cluster (cluster) configuration for multiple servers, so that the multiple servers can have the same cluster ID. In this way, multiple servers can be managed through the cluster ID.

Wherein, each server may also be set with a corresponding ID. For example, the ID of server 01 may be node 1 (node1), and the ID of server 02 may be node2.

S702. Configure scanIP for the multiple servers, so that the multiple servers access any database in the multiple databases through the browsing IP address.

Among them, scan IP can be used to access multiple databases.

In a possible implementation, each server can store the scan IP in response to the developer's setting operation.

S703. Set the virtual IP addresses for the multiple servers and the working modes of the multiple servers through the preset tool.

Wherein, the virtual IP address can be used to establish a communication connection with multiple servers, and other devices such as (scheduler) can access any server in the multiple servers through the virtual IP address. There is a corresponding relationship between the virtual IP and the IP address of each server. For the working modes of multiple servers, reference may be made to the above description, and details are not repeated here.

In a possible implementation manner, when the preset tool has an interface as shown in FIG. 4 , the interface can determine working modes of multiple servers in response to a setting operation.

Based on the technical solution provided by this embodiment, setting is made for multiple servers so that multiple servers can have the ETL function. At the same time, by allowing multiple servers to access the scan IP of any database in multiple databases, when a database fails, other databases can continue to work normally, ensuring high availability among databases. Set the virtual IP address and working mode for multiple servers through preset tools, so that any server can be accessed through the virtual IP address, that is, through the high availability among multiple servers, the normal operation of the ETL platform is guaranteed.

2. Data processing method.

As shown in Figure 8, a data processing method provided by the embodiment of the present application, the method is applied to the setting tool or some devices of the setting tool in the ETL system of Figure 1 or Figure 5 above, such as a scheduler, the method includes:

S801. Acquire first data to be processed in a first scheduling period.

Wherein, the first data to be processed may be Hive history information after synchronization.

In a possible implementation manner, the setting tool may acquire the first data to be processed in the first scheduling period from the Hive database.

It should be noted that, in this embodiment of the application, the server configured with the ETL tool can send the indication information used to indicate the information associated with multiple source systems to the Hive database through the setting tool, so that after the Hive database receives the indication information, The history information and bad CODE information of the products in the first scheduling cycle may be obtained from multiple source systems and integrated to obtain the first data to be processed.

S802. In response to the first setting operation, determine working modes of multiple servers in the first scheduling period.

Wherein, the first setting operation may refer to an operation on a setting page corresponding to the preset tool. For the working modes of multiple servers, reference may be made to the above-mentioned descriptions of the active-active mode and the active-standby mode, and details are not repeated here.

S803. Send the first indication information to the first server. Correspondingly, the first server receives the first indication information.

Wherein, the first instruction information may be used to instruct the first server to process the first data to be processed. The first server is a server in a working state within the first scheduling period among the plurality of servers.

Based on the technical solution in Figure 8, the setting tool can obtain the data to be processed in the first call cycle, determine the working mode of multiple servers, and send instruction information for processing data to the server in the working state among the multiple servers , enabling the server to perform ETL processing on the data to be processed.

In a possible embodiment, as shown in FIG. 9 , the method may further include S901.

S901. The first server acquires first scheduling data according to the first indication information.

In a possible implementation manner, the first indication information may include first data to be processed. In this way, the first server can directly acquire the first data to be processed from the first indication information.

In yet another possible implementation manner, the first indication information may include an identifier or a storage address of the first data to be processed. In this way, the first server can acquire the first data to be processed according to the identifier or storage address of the first data to be processed.

S902. The first server uses an ETL tool to process the first scheduling data to obtain processed first scheduling data.

Wherein, the processed first scheduling data may refer to data after processing the synchronized Hive history information. Specifically, reference may be made to the above description, and details are not repeated here.

In a possible embodiment, as shown in FIG. 10, in the embodiment of the present application, the method may further include:

S1001. Detect working states of multiple databases.

Wherein, detecting the working status of multiple databases may refer to detecting whether the multiple databases can work normally.

In a possible implementation manner, the working status among multiple databases may be detected through information interaction with multiple databases. For example, information can be sent to multiple databases periodically or randomly, and when the response information of the database is received, it is determined that the database is in a normal working state; when the response information of a certain database is not received, it is determined that the database is abnormal.

S1002. When it is detected that the first database cannot work, store the data stored in the first database to the second database.

Wherein, the second database may be a database without failure among the multiple databases. The first database and the second data are different databases among the plurality of databases.

Based on this possible embodiment, through the high availability among multiple databases, the normal operation of the databases can be guaranteed.

In a possible embodiment, as shown in FIG. 11, the method may further include:

S1101. Receive source data of a first scheduling period from multiple systems.

S1102. Merge multiple source data to obtain first data to be processed, and store the first data to be processed in a preset storage area of a preset database.

Among them, the preset database can be used to store the data to be processed. For example, the preset database may be the Hive database in FIG. 6 . A preset database can include multiple storage areas. The multiple storage areas can be used to store data to be processed in different scheduling periods. Data to be processed in different scheduling cycles can have unique identifiers. For example, the identifier can be a partition field.

In S801 above, the method for obtaining the first data to be processed in the first scheduling period may specifically include: extracting the first data to be processed from a preset storage area of the first database.

Based on this possible embodiment, the data of multiple source databases can be integrated to obtain data to be processed in a unified format, so that the server can process the data to be processed.

In a possible embodiment, as shown in FIG. 12, the method may further include:

S1201. Detect a processing result of processing the first data to be processed by the first server.

S1202. When the processing result is failure and the first server is in a working state, send second indication information to the first server.

Wherein, the second instruction information is used to instruct the first server to continue processing the first data to be processed.

S1203. When the processing result is failure and the first server cannot process the data, send third indication information to the second server.

Wherein, the third instruction information is used to instruct the second server to continue processing the first data to be processed.

Based on this possible implementation manner, the data to be processed can be processed in a timely manner to avoid data omission.

In a possible embodiment, when the processing result is successful, as shown in FIG. 12 , the method may further include:

S1204. Acquire second data to be processed in the second scheduling period.

S1205. Send fourth indication information to the first server.

Wherein, the fourth instruction information is used to instruct the second server to process the second data to be processed.

Further, before S1205, the method may also include:

In response to the second setting operation, the working modes of the plurality of servers in the second scheduling period are determined.

The working modes of the multiple servers in the second scheduling period may be the same as or different from the working modes of the multiple servers in the first scheduling period. For example, if the working mode of multiple servers in the first scheduling period is active-active mode, then the working mode of multiple servers in the second scheduling period can be active-active-standby mode or active-active mode, without limitation.

Based on this possible embodiment, the ETL system can continue to process data, which ensures the periodic work of the ETL system.

In a possible embodiment, the method may also include:

In response to the third setting operation, the duration and scheduling frequency of the scheduling periods of the plurality of servers are determined.

The duration and scheduling frequency of the scheduling period may refer to the duration and scheduling frequency of the first scheduling period, or may refer to the duration and scheduling frequency of the second scheduling period. The duration of the scheduling cycle and the scheduling frequency can be set as required. The duration and scheduling frequency of multiple scheduling cycles can be the same or different, and are not limited.

Based on this possible implementation, the working mode of multiple servers, the duration of the scheduling cycle, and the scheduling frequency can be adjusted manually, which is flexible and convenient.

Taking the data processed by the ETL system as product history information and the preset tool as the F5 tool as an example, the data processing method provided by the embodiment of the present application will be described below.

1. Obtain the scheduling task and the working frequency of the scheduling task.

Wherein, the scheduling task is related to user requirements. User needs can refer to determining why a product has a CODE. Scheduling tasks may refer to performing ETL processing on product history information and CODE information. For example, scheduling a task may refer to obtaining a synchronized Hive resume from a Hive database according to a scheduling cycle. And process the acquired data in the synchronized Hive resume table.

Wherein, the synchronized Hive history table may refer to the history table after synchronization of glass history information and bad CODE information. For the fields in the synchronized Hive resume, refer to Table 3 above, and details will not be described here.

In one example, if the product is glass (glass) as an example, bad CODE may be generated during the production process of each glass. For example, scratches in the box, foreign body Gap, abnormal lighting, black spot CODE, etc. User needs can refer to determining the cause of the bad CODE of the glass. Scheduling tasks may refer to performing ETL processing on the history information and bad CODE information of each glass.

2. Establish data synchronization and data processing operations through the ETL system.

In one example, as shown in FIG. 13 , the process of establishing data synchronization and data processing jobs through the ETL system may include: start → obtain scheduled time → data synchronization (date sync) → data processing → update scheduled time → process successfully. The process is described in detail below:

1. Obtain the scheduling time.

Wherein, obtaining the scheduling time may refer to determining the duration of the first scheduling period.

Wherein, the duration of the first scheduling cycle can be set as required. For example, with days as the granularity, the duration of the first scheduling cycle can be 0:00-24:00, or it can be 6:00 of the current day to the next day. 6:00, of course, can also be other time periods without limitation.

2. Data synchronization.

In an example, the data synchronization process may be as shown in FIG. 14 . In Figure 14, table input refers to obtaining source data from the Oracle database, and table output refers to the storage area where the Parquet output component outputs data to the hive database.

3. Data processing.

Among them, the data processing may include: according to the site information and time information, filter the historical history information table of the obtained GLASS, and calculate information such as the stay time and equipment of each GLASS in the same site, and obtain the processed data ;Write the processed data into the Hbase database.

Among them, the residence time of GLASS in the same site (PROCESS_TIME) = the event time of GLASS (EVENT_TIME) - the input time of the previous process (LAST_LOGGED_IN_TIME).

Wherein, the processed data can refer to the table 4 above.

In one example, in order to improve data conversion efficiency, as shown in FIG. 15 , the synchronized Hive history information may be converted in a multi-process manner. The branch corresponding to table input 1 in FIG. 15 may refer to the information related to the device in the converted and synchronized Hive history information. The branch corresponding to table input 2 may refer to the information related to the subunit of the device in the converted and synchronized Hive history information.

Among them, in Fig. 15, table input 1 can refer to the information related to the device (such as EQP_ID) in the synchronized Hive history information, and row to column 1 can refer to converting the device-specific history information of each row into a column display, through The purpose of converting rows to columns is to convert the data into the input format required by the Hbase database. Filtering records may refer to filtering null values (NULL) in data. Table input 2 may refer to the information (such as UNIT_ID) related to the subunit of the device in the synchronized Hive history information. Table output may refer to storing/writing Table 4 into the Hbase database.

Based on the multi-branch synchronous processing method, the data processing efficiency of the ETL platform can be improved.

4. Update the scheduling time.

Wherein, updating the scheduling time may refer to determining the scheduling period of the next scheduled task.

3. Log in to the user console and set the duration of the scheduling cycle.

Among them, the user console can be used to set the scheduling period of multiple servers.

In one example, before the pages shown in FIG. 2 and FIG. 3 are displayed, the developer may log in to the login interface shown in FIG. 16 and enter a user name and password. If the entered user name and corresponding password are correct, the pages shown in Figure 2 and Figure 3 can be displayed.

Wherein, correct user name and password may mean that the input user name and corresponding password are the same as the stored user name and corresponding password.

It should be noted that, in the embodiment of the present application, after multiple servers are configured, communication connections can be established with the multiple servers through a client (such as a computer). The client may be provided with a user console, and the user console may be an application program or a page for controlling multiple servers. In response to the login operation, the client may display a login interface as shown in FIG. 16 .

4. Determine the working modes of multiple servers in response to the setting operations performed on the page corresponding to the F5 tool.

Wherein, the page corresponding to the F5 tool can also be displayed through the client. That is, the client communicates with the F5 tool. For example, the client may be configured with an application or a webpage for controlling the F5 tool, through which the F5 tool may be controlled.

Based on this embodiment, after the scheduling task is determined, data synchronization and data processing jobs are established through the ETL system, and after the data processing is completed, the scheduling time is updated to prepare for the next scheduling task. Ensure the normal operation of the ETL system.

In a possible embodiment, taking two servers (server 01 and server 02) as an example, this embodiment of the present application also provides a data processing effect after the simulated server is down. The specific simulation process can include:

1. Create a test job (job_data_sync), and set the scheduling frequency every minute.

Wherein, as shown in FIG. 17 , it is the operation status of server 01 (IP address is XX.XX.XX.28). As shown in Figure 18, it is the operation status of server 02 (IP address is XX.XX.XX.27).

As shown in Figure 17 and Figure 18, both servers can operate normally.

2. In response to the operation of stopping the server 01, the F5 tool controls the server 02 to stop running, and controls the server 01 to continue running.

Wherein, when the server 02 stops running, the running situation of the server 01 can be shown in FIG. 19 .

As shown in Figure 19, in the case that server 01 and server 02 are highly available, the F5 tool can adjust the server load. Both the server 01 and the server 02 can perform data processing operations. When a server goes down, the ETL job will be executed normally on another server to meet the high availability requirements of ETL in the actual production environment.

It should be pointed out that the various embodiments of the present application may refer to each other, for example, the same or similar steps, method embodiments, system embodiments and device embodiments may refer to each other without limitation.

The embodiment of the present application can divide the functional modules or functional units of the construction device of the ETL system according to the above method example, for example, each functional module or functional unit can be divided corresponding to each function, or two or more than two functions can be integrated in a processing module. The above-mentioned integrated modules can be implemented not only in the form of hardware, but also in the form of software function modules or functional units. Wherein, the division of modules or units in the embodiment of the present application is schematic, and is only a logical function division, and there may be another division manner in actual implementation.

As shown in Figure 20, it is a schematic structural diagram of an ETL system construction device provided by the embodiment of the present application, and the device includes: a configuration unit 201 and a processing unit 202.

Wherein, the configuration unit 201 is configured to: configure multiple servers so that the configured multiple servers have ETL functions, and multiple servers each have a corresponding identification; configure scan IP addresses for multiple servers, so that Multiple servers can access any database in multiple databases through the scan IP address, and the multiple databases have high availability. The processing unit 202 is configured to: set a virtual IP address for the multiple servers through a preset tool, and a working mode of the multiple servers, the virtual IP address is used to establish a communication connection with the multiple servers, and the multiple servers The working mode includes active-active mode and active-standby mode.

As shown in Figure 21, it is a schematic structural diagram of a data processing device provided in the embodiment of the present application, which is applied to an ETL system. The ETL system includes multiple servers, and the multiple servers are all equipped with ETL tools. The ETL tools are used to process data. , the apparatus includes: an acquiring unit 211 , a determining unit 212 , and a sending unit 213 .

Wherein, the obtaining unit 210 is configured to obtain the first data to be processed in the first scheduling period. The determining unit 212 is configured to determine the working modes of the multiple servers in the first scheduling period in response to the first setting operation; wherein the working modes include a dual-active mode and an active-standby mode. In the active-active mode, multiple servers are in working state; in the active-standby mode, multiple servers include the primary server and the standby server, the primary server is in the working state, and the standby server is in the dormant state; when the primary server cannot process data , the standby server changes from the dormant state to the working state. The sending unit 213 is configured to send first indication information to the first server, where the first indication information is used to instruct the first server to process the first data to be processed, and the first server is a server in a working state within the first scheduling period.

In some embodiments, the ETL system also includes a plurality of databases, and the plurality of servers are respectively connected to the plurality of databases in communication, and the plurality of databases are connected in communication, and the plurality of databases are used to store configuration information, data processing processes, and scheduling tasks of the servers. one or more of. As shown in Figure 21, the device also includes a detection unit 214 configured to: detect the working status of multiple databases; when it is detected that the first database cannot work, store the data stored in the first database to the second database, The first database and the second database are different databases among the plurality of databases.

In some embodiments, as shown in FIG. 21 , the device further includes a receiving unit 215 and a processing unit 216 . The receiving unit 215 is configured to receive source data from multiple systems in the first scheduling period. The processing unit 216 is configured to combine multiple source data to obtain first data to be processed, and store the first data to be processed in a preset storage area of the first database. The first database includes multiple storage areas, and multiple Each storage area is used to store data to be processed in different scheduling periods. The obtaining unit 211 is specifically configured to extract the first data to be processed from a preset storage area of the Hive database.

In some embodiments, the detection unit 214 is further configured to detect the processing result of the first server processing the first data to be processed; when the processing result is failure and the first server is in the working state, the control sending unit 213 sends The first server sends second instruction information, and the second instruction information is used to instruct the first server to continue processing the first data to be processed; when the processing result is failure and the first server cannot process the data, the control sending unit 213 sends Sending third instruction information, where the third instruction information is used to instruct the second server to continue processing the first data to be processed, and the second server is a server in a working state among the plurality of servers.

In some embodiments, when the processing result is successful, the acquiring unit 211 is further configured to acquire the second data to be processed in the second scheduling period; the sending unit 213 is further configured to send the data used for the first The server processes fourth indication information of the second data to be processed.

In some embodiments, before sending the fourth indication information to the first server, the determining unit 191 is further configured to determine the working modes of the multiple servers in the second scheduling period in response to the second setting operation.

In some embodiments, the processing unit 216 is further configured to adjust the duration and scheduling frequency of the scheduling periods of the multiple servers in response to the third setting operation.

When implemented by hardware, the acquisition unit 211 in the embodiment of the present application may be integrated on a communication interface, and the configuration unit 201 and the processing unit 202 may be integrated on a processor. The specific implementation is shown in Figure 22.

Fig. 22 shows a schematic structural diagram of another possible communication device of the construction device of the ETL system and the data processing device involved in the above embodiment. The communication device includes: a processor 2202 and a communication interface 2203 . The processor 2202 is used to control and manage the actions of the device, for example, to execute the steps executed by the processing unit 212 and the processing unit 216 above, and/or to execute other processes of the technologies described herein. The communication interface 2203 is used to support communication between the device and other network entities, for example, to perform the steps performed by the above-mentioned obtaining unit 211 . The device may also include a memory 2201 and a bus 2204, and the memory 2201 is used to store program codes and data of the device.

Wherein, the memory 2201 may be a memory in the device, etc., and the memory may include a volatile memory, such as a random access memory; the memory may also include a non-volatile memory, such as a read-only memory, flash memory, hard disk or Solid-state hard disk; the memory may also include a combination of the above-mentioned types of memory.

The above-mentioned processor 2202 may realize or execute various exemplary logic blocks, modules and circuits described in conjunction with the disclosure of this application. The processor may be a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It can implement or execute the various illustrative logical blocks, modules and circuits described in connection with the present disclosure. The processor may also be a combination of computing functions, for example, a combination of one or more microprocessors, a combination of DSP and a microprocessor, and the like.

The bus 2204 may be an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus or the like. The bus 2204 can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in FIG. 22 , but it does not mean that there is only one bus or one type of bus.

The device in Fig. 22 may also be a chip. The chip includes one or more than two (including two) processors 2202 and a communication interface 2203 .

Optionally, the chip further includes a memory 2201 . The memory 2201 may include a read-only memory and a random access memory, and provides operation instructions and data to the processor 2202 . A part of the memory 2201 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).

In some implementations, the memory 2201 stores the following elements, execution modules or data structures, or their subsets, or their extended sets.

In the embodiment of the present application, the corresponding operation is executed by calling the operation instruction stored in the memory 2201 (the operation instruction may be stored in the operating system).

Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated according to needs It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. For the specific working process of the above-described system, device, and unit, reference may be made to the corresponding process in the foregoing method embodiments, and details are not repeated here.

Some embodiments of the present disclosure provide a computer-readable storage medium (for example, a non-transitory computer-readable storage medium), where computer program instructions are stored in the computer-readable storage medium. When the computer program instructions are run on a computer, The computer is made to execute the construction method of the ETL system as described in any one of the above embodiments.

Exemplarily, the above-mentioned computer-readable storage medium may include, but is not limited to: a magnetic storage device (for example, a hard disk, a floppy disk, or a magnetic tape, etc.), an optical disk (for example, a CD (Compact Disk, a compact disk), a DVD (Digital Versatile Disk, Digital Versatile Disk), etc.), smart cards and flash memory devices (for example, EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), card, stick or key drive, etc.). Various computer-readable storage media described in this disclosure can represent one or more devices and/or other machine-readable storage media for storing information. The term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.

Some embodiments of the present disclosure also provide a computer program product, for example, the computer program product is stored on a non-transitory computer-readable storage medium. The computer program product includes computer program instructions, and when the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the methods described in the above-mentioned embodiments.

Some embodiments of the present disclosure also provide a computer program. When the computer program is executed on the computer, the computer program causes the computer to execute the methods described in the above-mentioned embodiments.

The beneficial effects of the above computer-readable storage medium, computer program product, and computer program are the same as those of the methods described in some of the above embodiments, and will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.

The above is only a specific embodiment of the present disclosure, but the scope of protection of the present disclosure is not limited thereto. Anyone familiar with the technical field who thinks of changes or substitutions within the technical scope of the present disclosure should cover all within the protection scope of the present disclosure. Therefore, the protection scope of the present disclosure should be determined by the protection scope of the claims.

Claims

A method for building an ETL system, wherein the ETL system includes a plurality of servers and a plurality of databases, the method comprising:

Configuring the multiple servers so that the configured multiple servers have ETL functions, and each of the multiple servers has a corresponding identifier;

Configure and browse the Internet Protocol scan IP address for the plurality of servers, so that the plurality of servers access any database in the plurality of databases through the scan IP address, and the plurality of data has high availability;

Set the virtual IP address for the plurality of servers through the setting tool, and the working mode of the plurality of servers, the virtual IP is used to establish a communication connection with the plurality of servers, and the working mode includes active-active mode, active/standby mode.
The method according to claim 1, wherein said multiple databases have high availability comprises: when the first database fails to work, the second database continues to work, and said first database and said second database are said multiple databases. different databases within one database.
The method according to claim 1 or 2, wherein, in the active-active mode, the plurality of servers are all in working state, and when the first server cannot process the data, the second server continues to process the data, so The first server and the second server are different servers among the plurality of servers.
The method according to any one of claims 1-3, wherein, in the active/standby mode, the multiple servers include a primary server and a standby server, the primary server is used to process data, and the standby server uses When the main server is unable to process the data, continue to process the data.
A kind of data processing method, is applied to ETL system, described ETL system comprises a plurality of servers, and described a plurality of servers are equipped with ETL tool, and described ETL tool is used for processing data, and described method comprises:

Obtain the first data to be processed in the first scheduling cycle;

In response to a first setting operation, determine the working modes of the plurality of servers within the first scheduling period;

Wherein, the working mode includes a dual-active mode and an active-standby mode, and in the active-active mode, the plurality of servers are in a working state; in the active-standby mode, the plurality of servers include the main server and the active-standby mode. A standby server, the primary server is in a working state, the secondary server is in a dormant state, and when the primary server cannot process data, the secondary server is turned from a dormant state to a working state;

Sending first instruction information to the first server, where the first instruction information is used to instruct the first server to process the first data to be processed, and the first server is in the working state within the first scheduling period server.
The method according to claim 5, wherein, the ETL system further comprises a plurality of databases, the plurality of servers are connected to the plurality of databases in communication, the communication connections between the plurality of databases, the plurality of The database is used to store one or more of the configuration information of the server, the data processing process, and the scheduling task; the method also includes:

detecting the working status of the plurality of databases;

When it is detected that the first database cannot work, the data stored in the first database is stored in a second database, and the first database and the second database are different databases among the plurality of databases.
The method according to claim 5 or 6, wherein the method further comprises:

receiving source data from a plurality of systems for the first scheduling period;

combining a plurality of the source data to obtain the first data to be processed, and storing the first data to be processed in a preset storage area of a first database, where the first database includes a plurality of storage areas, The multiple storage areas are used to store data to be processed in different scheduling periods;

The acquisition of the first data to be processed in the first scheduling cycle includes:

Extracting the first data to be processed from the preset storage area of the Hive database.
The method according to any one of claims 5-7, wherein the method further comprises:

Detecting a processing result of processing the first data to be processed by the first server;

When the processing result is failure and the first server is in a working state, send second instruction information to the first server, where the second instruction information is used to instruct the first server to continue processing the first server. - data to be processed;

When the processing result is failure and the first server cannot process the data, send third indication information to the second server, where the third indication information is used to instruct the second server to continue processing the first pending processing data, the second server is a working server among the plurality of servers.
The method according to any one of claims 5-8, wherein, when the processing result is successful, the method further comprises:

Acquiring the second data to be processed in the second scheduling cycle;

Sending fourth instruction information to the first server, where the fourth instruction information is used to instruct the first server to process the second data to be processed.
The method according to claim 9, wherein, before sending the fourth indication information to the first server, the method further comprises:

In response to the second setting operation, the working modes of the plurality of servers in the second scheduling period are determined.
The method according to any one of claims 5-10, wherein the method further comprises:

In response to the third setting operation, the duration and scheduling frequency of the scheduling periods of the plurality of servers are determined.
An ETL system, wherein the ETL system includes a configuration tool and a plurality of servers, and the configuration tool communicates with the plurality of servers through a virtual IP;

The setting tool is used to determine the scheduling tasks and scheduling periods of the multiple servers in response to the first setting operation;

The setting tool is also used to determine the working modes of the plurality of servers in each scheduling cycle in response to the second setting operation, the working modes include a dual-active mode and an active-standby mode, and in the dual-active mode, The plurality of servers are all in a working state; in the active/standby mode, the plurality of servers include a primary server and a standby server, the primary server is in a working state, and the standby server is in a dormant state; when the primary In the case that the server cannot process data, the standby server changes from a dormant state to a working state;

The multiple servers are used to receive scheduling tasks from the setting tool, process data according to the scheduling tasks, and store the processed data in the Hbase database.
The system according to claim 12, wherein, the ETL system also includes a plurality of databases, and the plurality of servers access any one of the plurality of databases through the scan IP address, and the plurality of databases are used to store the One or more of the configuration information of the multiple servers, the data processing process, and the scheduling task, and the multiple databases have high availability.
The system according to claim 12, wherein said plurality of databases having high availability comprises: when a first database fails to work, a second database continues to work, and said first database and said second database are said plurality of databases different databases.
The system according to any one of claims 12-14, wherein the setting tool is further configured to adjust scheduling tasks and/or scheduling periods of the plurality of servers in response to a third setting operation.
A kind of construction device of ETL, wherein, described ETL system comprises a plurality of servers and a plurality of databases, and this device comprises:

A configuration unit configured to configure the multiple servers so that the configured multiple servers have an ETL function, and each of the multiple servers has a corresponding identifier;

The configuration unit is also configured to configure and browse the Internet Protocol scan IP address for the plurality of servers, so that the plurality of servers access any database in the plurality of databases through the scan IP address, so The above data has high availability;

The processing unit is configured to set virtual IP addresses and working modes of the multiple servers for the multiple servers through a setting tool, and the virtual IP is used to establish a communication connection with the multiple servers, so The above working modes include active-active mode and active-standby mode.
A data processing device, wherein, applied to an ETL system, the ETL system includes a plurality of servers, the plurality of servers are equipped with ETL tools, the ETL tools are used to process data, and the device includes an acquisition unit, a determination Unit and sending unit:

The obtaining unit is configured to obtain the first data to be processed in the first scheduling period;

The determining unit is configured to determine the working modes of the plurality of servers within the first scheduling period in response to a first setting operation;

Wherein, the working mode includes a dual-active mode and an active-standby mode, and in the active-active mode, the plurality of servers are in a working state; in the active-standby mode, the plurality of servers include the main server and the active-standby mode. A standby server, the primary server is in a working state, the secondary server is in a dormant state, and when the primary server cannot process data, the secondary server is turned from a dormant state to a working state;

The sending unit is configured to send first instruction information to a first server, where the first instruction information is used to instruct the first server to process the first data to be processed, and the first server is the Servers in working state within a scheduling period.
A processing device, comprising: a processor and a communication interface; the communication interface is coupled to the processor, and the processor is used to run a computer program or an instruction, so as to realize the process described in any one of claims 1-4 The construction method and the data processing method according to any one of claims 5-11.
A computer-readable storage medium, wherein instructions are stored in the computer-readable storage medium, and when a computer executes the instructions, the computer executes the construction method described in any one of claims 1-4 and The data processing method according to any one of claims 5-11.