Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. In addition, technical features of various embodiments or individual embodiments provided by the invention can be arbitrarily combined with each other to form a feasible technical solution, but must be realized by a person skilled in the art, and when the technical solution combination is contradictory or cannot be realized, the technical solution combination is not considered to exist and is not within the protection scope of the present invention.
The traditional mode of informatization system construction is to construct ETL and a distributed application platform (specifically, WebService) as two independent subsystems. In the traditional method, the data processing and publishing processes are separated, information resources obtained after processing cannot be published in time, so that the time and fund of the information resources are wasted, and a system mode topological graph is shown in fig. 2 to show the difference of the two modes. As can be seen from fig. 2, in the prior art, the functions of ETL and WebService are separated, and ETL processes source data into target data and then loads the target data; and the WebService issues related services through a network (web) according to the requested services to meet the corresponding requirements of the client.
Based on the above situation, the present patent aims to distribute the data processing flow configured in the ETL to a distributed application platform (specifically, WebService) service. The external application can subscribe and access the platform service of the distributed application program, complete the data processing flow call, and acquire and display data. The function can be integrated into a third party access service for joint invocation. To achieve the object, an embodiment of the present invention provides an ETL-based data service method, referring to fig. 1, the method including:
101. sending the acquired data to an ETL data exchange platform to obtain a processed data result;
102. and providing the data service in an ETL WebService form for the data result on an ETL data exchange platform, and issuing the WebService.
The acquired data are sent to an ETL data exchange platform, data services in an ETL WebService form are provided for the data results on the ETL data exchange platform, WebService issuing is carried out, and the steps are integrated together to complete the data exchange in a one-stop mode. The obtaining of the processed data result includes: the data is extracted, cleaned, converted, filtered, connected, searched for, replaced, ordered, aggregated, desensitized and/or combined.
On the basis of the above embodiment, in the ETL-based data service method provided in the embodiment of the present invention, the manner of acquiring the acquired data includes: and acquiring data from the WebService distributed application platform, the database, the JMS and/or the general file.
On the basis of the above embodiment, the ETL-based data service method provided in the embodiment of the present invention includes: text files, Excel files, XML files, and/or dataset files.
Specifically, the database data currently supports 24 database types, Access (. mdb,. accdb), DB2V5, DB2V9, db2v9.7, DM5, DM6, DM7, FoxPro (. dbf), greenplus, Informix10, Informix7.3, MySQL3, MySQL4, MySQL5, Oracle10, Oracle11, Oracle8, Oracle9, SQLServer2000, SQLServer2005, SQLServer2008, Sybase11, Sybase12, Sybase15, and the like. Access (. mdb,. accdb), FoxPro (. dbf) must provide their ODBC data source name. The database names of Oracle10, Oracle11, Oracle8, and Oracle9 are actually their service names. MySQL3, MySQL4, MySQL5 databases do not have the notion of schema, so the schema name does not exist within the tree display structure of the newly created data source.
The text file data mainly provides an access function of the text file data, and the text file data with a fixed format can be provided for the ETL engine to be processed after being analyzed in a tabular form. And various setting options such as character sets, line separators, column separators, text delimiters and the like are provided, so that the text file is conveniently split. And provides a document encoded character set and line separator detection function.
The CSV file data mainly provides an access function of the CSV file data, and may be provided to the ETL engine for processing after the CSV file in a fixed format is parsed in a table form. Provides setting options such as character sets and provides a character set detection function of file encoding.
The Excel file data provides an access function of the Excel file data, and the Excel file in a fixed format or any format can be analyzed. The fixed format refers to EXCEL being a simple table, similar to a table in a relational database, and column information can be obtained from an EXCEL file. When the Excel file in any format is analyzed, the column information is specified by a user, the system reads each row of data of the Excel file and fills the data into a column defined by the user, if the column defined by the user is exceeded, the exceeded part is discarded, and if the column defined by the user is less than the column defined by the user, a null value is filled.
The XML document data may use a specified XML document as a data source in the ETL process, and the XML data source may be used as a data set of an XML data reading component.
Data Set file Data, also known as DDS file, is an abbreviation for Damon Data Set. The DDS is a unique file format of ETL, supporting data compression. The DDS file stores complete column information and message record information obtained in the conversion process.
The DBF file data provides an access function of the DBF file data and can analyze the DBF file with a fixed format. Column information and data in the DBF file are read.
The WebService data uses a WebService site as a data source of the ETL.
The JMS data defines information for connecting the JMS server.
The Mail data defines information for connecting to the LDAP server.
The LDAP data defines information for connecting to the LDAP server.
Hbase data defines the information linking the Hbase database.
The access function of the JSON data can analyze the JSON file and read column information and data in the JSON file.
The MongoDB data defines information that connects to a MongoDB data server.
The Elasticsearch data defines information for connecting to an Elasticsearch data server.
On the basis of the above embodiment, the ETL-based data service method provided in the embodiment of the present invention, where the data service in the form of ETL WebService is provided to the data result on an ETL data exchange platform, includes: providing data source management, data node conversion, data node operation, function and variable calling, scheduling, monitoring and warning, authority management and/or version management service for received data.
Specifically, the data source management is to store external data to which ETLs need to be connected when data is read or written. ETL supports the management of database data sources, JMS data sources, file datasets (text files, Excel files, XML files, dataset files, etc.), and WebServices data sources. And operations of creating, modifying, deleting and the like of data sources and data sets are supported. The method comprises the steps of supporting the whole import and export operation of metadata of a data source and a data set; individual data source metadata imports export operations.
The data node conversion represents a flow process associated with data processing, and is composed of a data reading node, a data loading node, a data conversion node, a correct line and an error line. A conversion that can be performed must contain more than one node. The start and end points of the transition may be any nodes.
And connecting lines in conversion are used for connecting different nodes, and the direction of the connecting lines represents the flow direction of data. The connection lines are divided into correct lines and incorrect lines. The correct line represents the flow of data that can be correctly processed by the node. The error line indicates the flow of data that cannot be properly processed by the component. The data on the error line should be raw input data, the column information of which includes all input columns, and columns indicating the type of error and the error message may be added.
The nodes in the conversion are functional entities for data processing, and a user can open a node attribute configuration dialog box at any time to modify and store the attributes, namely, the configuration information reading and displaying of one node is independent of other nodes (namely, the node configuration dialog box can be opened without connecting an input node). The configuration information may be saved at any time, and if the configuration is incorrect or incomplete, the user may be prompted, but not prevented from saving. When the nodes are configured, the information related to the database is obtained from the ETL metadata base, and a data source does not need to be connected. Conversion is also referred to as data flow, because once it is started, the nodes are simultaneously executing, data flows continuously from one node to another, and the conversion stops after all data has been processed.
Data node jobs are a flow that controls the execution sequence and process of transitions and other job nodes. A job includes nodes and connecting lines, and a user can control transitions, and precedence and dependencies performed between other job nodes through the job, so the job is also referred to as a control flow.
The operation is composed of operation nodes and operation lines. A job may start or end with any job node. A job must contain at least one job node, and if the job contains a plurality of job nodes, the plurality of job nodes may or may not have a connection therebetween, that is, the connection is not necessary. A worker node may have any number of input and output connections. Jobs may be executed nested, i.e., one job may also be executed as a node in another job.
The connecting lines in the operation represent the execution sequence of the operation nodes, and are divided into success lines, failure lines, completion lines and condition lines. The success line indicates that the subsequent node is continuously executed if the execution of the job node is successful, the failure line indicates that the subsequent node is continuously executed after the execution of the job node is failed, the completion line indicates that the subsequent node is continuously executed no matter whether the execution of the job is successful or failed, and the condition line indicates that the subsequent node is executed when a certain condition is met.
Calling functions and variables uses functions to process data and expand system functions. In addition to using system functions, ETL also supports user-defined functions.
Scheduling is divided into two categories, execution once and repeated execution. The created schedule may be set on the job or the conversion node.
Monitoring and alerting is based on the consideration that not all flows can see the running process in the foreground, e.g. scheduled execution. Then the running process of the flow, i.e. the background flow, can be viewed through the monitored historical running instance.
ETL monitoring is a module used to view the conversion or job run log created by the current logged-in user. The current running instance and the historical running instance can be viewed separately. The conversion or job has a current running instance and a historical running instance below it. The current running instance refers to an instance which is running and is not finished yet, and the historical running instance refers to an instance which is running and finished. The historical run instance here displays up to 100 pieces of data.
And running a process monitoring tree interface to show the running process. If the newly running conversion or operation exists, the operation is monitored in real time, and the running process monitoring tree interface synchronously displays the running conversion or operation.
Rights management may enable management of ETLs by creating users and roles and assigning different rights to them. Permissions are the ability of the system to perform certain operations that are predefined. A role is a solution to rights management, being a collection of a set of rights. The user is a member that can access the ETL. Permissions can be divided into two categories: function rights and object rights.
The version management operation objects mainly include entire metadata, a single project, a single conversion, a single job, a single function, a single variable, a single global user function, and a single global user variable. The main functions of version management include backing up the current version, restoring the historical version, deleting the historical version and restoring the deleted object.
On the basis of the above embodiment, the ETL-based data service method provided in the embodiment of the present invention, where the data result is provided with a data service in the form of ETL WebService on an ETL data exchange platform, further includes: the method for providing the data service in the form of ETL WebService for the received data by adopting the visualized ETL data exchange platform specifically comprises the following steps: service configuration, service deployment, process design, release design, creation of users, user authorization, and service verification.
Specifically, taking the example that ETL corresponds to WebService, the file content of the service configuration is as follows:
HOST: an ETL WebService service name or IP address; PORT: an ETL WebService service port number; USERNAME: an ETL user account number; PASSSWORD: ETL user password.
The service deployment deploys ETL WebService. war to an application program directory of an application server, for example, a/tomcat/webcaps directory, and executes web _ monitor _ start.
And the flow design uses an ETL visual data processing flow designer to design the data processing flow on line. Referring specifically to fig. 3, in fig. 3, by clicking on the table/view, the text file may be output by default.
And the publishing design executes an ETL WebService publishing guide, and a data publishing plan is created in an online visual mode. And inputting related configuration contents according to a guide, selecting corresponding conversion in the distribution process conversion, selecting corresponding nodes and outputting. Node output if a process end node is selected and there is no output, then the save can be clicked without configuring the output. Referring specifically to fig. 4, in fig. 4, the flow end node is selected, so there is no output, i.e., no output is configured. As can be seen from fig. 4, the conversion name, the conversion node and the node output are blank, and after the selection is completed, the save and publish button 401 is clicked to save and publish. On the basis, after the process WebService is released by right clicking, the new ETL WebService is clicked to create a new process. Referring to fig. 5 specifically, in the interface of fig. 5, a flow WebService setting is clicked in a selection page on the left side (a specific position is to select an attribute category that you want to view in a list), and a flow WebService release name (S) is displayed in a flow WebService setting box on the right side, where the release name is a web Test; shown in the release conversion flow (T) is "webTest". The "conversion". test; browsing (B) for selecting a computer related path; the table/view is selected at the selection transformation flow node (N) and the default output is selected at the selection node output (O). And (C) whether the (A) is used for counting the called flow execution information is counted, if the WebService calling flow execution information needs to be concerned, the item can be selected. The issue (I) set as the independent Webservice method is used to set whether the issue method is the independent Webservice method. If the' call flow (E) is selected, the ETL WebService does not return the node data result, and the whole flow is executed. If the flow is not selected, the configured flow node is executed, the relevant flow data of the node is returned, and the flow is stopped at the node and cannot be completely executed.
And the creating user is used for creating a management user and carrying out identity authentication when the service is called. Referring specifically to fig. 6, in fig. 6, clicking on the create user button 601 creates a user in the lower interface bar, specifically the user name is u 1.
User authorization is used to authorize a user. Specifically referring to fig. 7, a user name u1 is input in the user name (N), a password is input in the password (P), the configured flow webTest is checked, and the storage, restart, ETL and WebService services are confirmed.
Service authentication is used to authenticate the service results.
On the basis of the foregoing embodiment, the ETL-based data service method provided in the embodiment of the present invention, after providing the data service in the form of ETL WebService to the data result on the ETL data exchange platform, further includes: and normalizing a data result after the ETL WebService form data service.
On the basis of the above embodiment, the method for data service based on ETL provided in the embodiment of the present invention, wherein normalizing the data result after the data service in the form of ETL WebService includes: an array specification, a JSON specification, and/or an XML specification.
The specification result of the array specification is a two-dimensional string array, which is specifically shown as follows (taking WebService as an example):
interface:
public String[][]getFlowArrayResult(String webServiceFlowName,String password,String[]paramNames,String[]paramValues,int maxResultCount);
the meaning of the parameters is as follows:
webServiceFlowName: the name of the above configured process webservice.
The username: the created management account is input.
password: the created management password is input.
paramNames: an array of parameter names for the incoming parameters is entered. And if no parameter exists, no parameter is input.
paramValue: the parameter values of the incoming parameters are input. And if no parameter exists, no parameter is input.
pageStart: the starting number of nodes is obtained.
pageSize: and acquiring the total number of the node data.
When pageStart and pageSize are both 0 or-1, all data of the flow node is returned, and the flow is terminated at the node.
The returned result of the JSON specification is a string in JSON format. Specifically, the following is shown (taking WebService as an example):
interface:
public String getFlowJsonResult(String webServiceFlowName,String password,String jsonParams,int maxResultCount);
the meaning of the parameters is as follows:
webServiceFlowName:
the username: the created management account is input.
password: the created management password is input.
json params: input parameters, incoming in json format, no parameter input { }. If the variables are configured, the json format input is
{"V_BEGIN":"\"begin\"","V_END":"\"end\""}。
pageStart: the starting number of nodes is obtained.
pageSize: and acquiring the total number of the node data.
When pageStart and pageSize are both 0 or-1, all data of the flow node is returned, and the flow is terminated at the node.
The XML specification is similar to the JSON specification parameters, and the returned result is XML. Specifically, the following is shown (taking WebService as an example):
interface:
public String getFlowXMLResult(String webServiceFlowName,String username,String password,StringjsonParams,int pageStart,int pageSize)。
under the condition of the service provided by each embodiment, the embodiment of the invention also provides a customized service for the user, the user can issue a customized method, input the ETL WebService service name 'testMethod' in a customized manner, select a return type json or xml, and input parameters by corresponding methods of the following three check boxes. The ETL and ETL WebService check service methods are restarted as well. Specifically referring to fig. 8, at this time, a testMethod method is added in the independent Webservice method issuing (I) (the rest of fig. 8 is the same as that in fig. 5, and is not described here again), and the call is performed by using the soap ui, which may specifically refer to fig. 9. As can be seen in FIG. 9, args0, args1 correspond to username, passswerd; args2 corresponds to json params; args3 and args4 correspond to pageStart and pageSize.
The ETL-based data service method provided by the embodiment of the invention can organically combine data acquisition, cleaning conversion and data release service in a one-stop manner, so that the whole data acquisition and release process is convenient to operate and simple to deploy, and the effect of maximally utilizing information resources is achieved.
The implementation basis of the various embodiments of the present invention is realized by programmed processing performed by a device having a processor function. Therefore, in engineering practice, the technical solutions and functions thereof of the embodiments of the present invention can be packaged into various modules. Based on this reality, on the basis of the above embodiments, embodiments of the present invention provide an ETL-based data service apparatus for executing the ETL-based data service method in the above method embodiments. Referring to fig. 10, the apparatus includes:
a data acquisition module 1001, configured to send acquired data to an ETL data exchange platform to obtain a processed data result;
the data service module 1002 is configured to provide an ETL WebService-like data service for the data result on an ETL data exchange platform, and perform WebService publishing;
the acquired data are sent to an ETL data exchange platform, data services in an ETL WebService form are provided for the data results on the ETL data exchange platform, WebService issuing is carried out, and the steps are integrated together to complete the data exchange in a one-stop mode.
The ETL-based data service device provided by the embodiment of the invention adopts the data acquisition module and the data service module, and can organically combine data acquisition, cleaning conversion and data release service in a one-stop manner, so that the whole data acquisition and release process is convenient to operate and simple to deploy, and the effect of maximally utilizing information resources is achieved.
The method of the embodiment of the invention is realized by depending on the electronic equipment, so that the related electronic equipment is necessarily introduced. To this end, an embodiment of the present invention provides an electronic apparatus, as shown in fig. 11, including: the system comprises at least one processor (processor)1101, a communication Interface (Communications Interface)1104, at least one memory (memory)1102 and a communication bus 1103, wherein the at least one processor 1101, the communication Interface 1104 and the at least one memory 1102 are in communication with each other through the communication bus 1103. The at least one processor 1101 may invoke logic instructions in the at least one memory 1102 to perform the following method: sending the acquired data to an ETL data exchange platform to obtain a processed data result; providing data service in an ETL WebService form for the data result on an ETL data exchange platform, and issuing the WebService; the acquired data are sent to an ETL data exchange platform, data services in an ETL WebService form are provided for the data results on the ETL data exchange platform, WebService issuing is carried out, and the steps are integrated together to complete the data exchange in a one-stop mode.
Furthermore, the logic instructions in the at least one memory 1102 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. Examples include: sending the acquired data to an ETL data exchange platform to obtain a processed data result; providing data service in an ETL WebService form for the data result on an ETL data exchange platform, and issuing the WebService; the acquired data are sent to an ETL data exchange platform, data services in an ETL WebService form are provided for the data results on the ETL data exchange platform, WebService issuing is carried out, and the steps are integrated together to complete the data exchange in a one-stop mode. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.