CN116932575B

CN116932575B - Spark-based cross-data source operation method, device and storage medium

Info

Publication number: CN116932575B
Application number: CN202311171389.1A
Authority: CN
Inventors: 古乐乐; 徐楠
Original assignee: Great Wall Securities Co ltd
Current assignee: Great Wall Securities Co ltd
Priority date: 2023-09-12
Filing date: 2023-09-12
Publication date: 2023-12-15
Anticipated expiration: 2043-09-12
Also published as: CN116932575A

Abstract

The application relates to the technical field of database operation, and discloses a Spark-based cross-data-source operation method, equipment and a storage medium. The method comprises the following steps: creating Spark session, obtaining a plurality of written SQL scripts and pre-analyzing the SQL scripts; if the data table to be executed in the SQL script is a mapping library table, registering the mapping library table to be executed as a temporary view of Spark session; replacing a mapping library table to be executed in the SQL script with a temporary view to obtain a new SQL script, and submitting the new SQL script to a sparkSQL engine for execution; acquiring a first result set output by a sparkSQL engine; and according to the operation type of the cross-data source, calling a corresponding operation interface of the external data source to process the first result set. The application reduces the code development amount and the development complexity of Spark-based cross-data-source operation and improves the performance of the cross-data-source operation.

Description

Spark-based cross-data source operation method, device and storage medium

Technical Field

The present application relates to the field of database operations, and in particular, to a Spark-based cross-data source operation method, device, and storage medium.

Background

Spark is an Apache open source community top level project, and is a unified analysis engine for processing mass data by adopting a distributed model. The Spark SQL engine is an important module of Spark, and decomposes standard SQL into Spark high-level interface DataFrame to perform distributed computation with unified format and interface, which opens up various programming languages and common algorithm libraries. sparkSQL has own data warehouse, but can also use a simple configuration to access Hive number warehouse, so that a user can conveniently upgrade an engine on a Hive number warehouse model to bring higher performance and flexible computing interfaces, which is a most common number warehouse building scheme at present, namely, hive management number warehouse metadata is adopted, and Spark is used as a unified analysis engine.

In the scenario of actually using Spark for data processing, many operations across data sources are involved, such as a columnar database HBase, a K-V database Redis, a message middleware Kafka, and some relational databases that provide standard JDBC interfaces, such as Mysql, sqlServer, oracle, etc. Sometimes, data of a large amount of users need to be put into HBase and Oracle to be provided for clients to perform high-concurrency query (such as stock earning of each day of the users), sometimes, related query needs to be performed by reading dimension table data from Oracle and Mysql and tables in Hive number bins, and the like. In particular to three operations of inquiring, inserting and deleting different external data sources, but all the three operation processes are realized by writing a large amount of complex codes by a user, so that the operation efficiency is low and the realized performance is not high.

Disclosure of Invention

The application mainly aims to provide a Spark-based cross-data source operation method, equipment and a storage medium, and aims to solve the technical problems that the operation efficiency is low and the implementation performance is low due to the fact that a large amount of complex codes are required to be written in the existing Spark-based cross-data source operation.

The first aspect of the present application provides a Spark-based cross-data source operation method, which includes:

creating Spark session, obtaining a plurality of written SQL scripts and respectively pre-analyzing the SQL scripts;

judging whether a data table to be executed in the pre-parsed SQL script is a mapping library table of an external data source or not;

if the data table to be executed in the SQL script is a mapping library table, registering the mapping library table to be executed as a temporary view of the Spark session;

replacing a mapping library table to be executed in the SQL script with the temporary view to obtain a new SQL script, and submitting the new SQL script to a sparkSQL engine for execution;

acquiring a first result set output by the sparkSQL engine;

and according to the operation type of the cross-data source, calling a corresponding operation interface of the external data source to process the first result set until each SQL script is executed and then closing the Spark session.

Optionally, in a first implementation manner of the first aspect of the present application, after the determining whether the data table to be executed in the pre-parsed SQL script is a mapping library table of an external data source, the method further includes:

if the data table to be executed in the SQL script is a non-mapping library table, determining the grammar type of the SQL sentence in the SQL script;

if the SQL statement is a deleting operation and a preset SQL dialect is adopted, submitting a SELECT clause in the SQL dialect to the sparkSQL engine for execution and acquiring a returned second result set;

sequentially judging the matching degree of the field in the second result set and the main key of the external data source;

and determining the primary key to be deleted in the external data source according to the matching degree of the field in the second result set and the primary key of the external data source, and deleting the primary key.

Optionally, in a second implementation manner of the first aspect of the present application, the determining, according to a matching degree between the field in the second result set and the primary key of the external data source, the primary key to be deleted in the external data source and performing deletion processing includes:

if the fields in the second result set are completely matched with the main key of the external data source, splicing the fields in the second result set, which are completely matched with the main key of the external data source, into a spliced main key, and calling a deletion interface of the external data source to delete the spliced main key;

if the fields in the second result set are matched with the main key prefix of the external data source, splicing the fields matched with the main key prefix of the external data source in the second result set into a spliced main key prefix by iterating the fields in the second result set; and scanning the external data source according to the spliced main key prefix, obtaining a main key set in the external data source, and calling a deletion interface of the external data source to delete the main key set.

Optionally, in a third implementation manner of the first aspect of the present application, before the acquiring the written several SQL scripts, the method further includes:

and creating a mapping library table of the external data source, wherein the mapping library table is used for mapping the data structure in the external data source into a custom table structure and storing the relevant service configuration of the external data source.

Optionally, in a fourth implementation manner of the first aspect of the present application, the related service configuration includes: one or more of IP port, field mapping, and condition filtering.

Optionally, in a fifth implementation manner of the first aspect of the present application, the SQL script is developed together according to a mapping library table of the external data source and a data table of the SparkSQL internal data source.

Optionally, in a sixth implementation manner of the first aspect of the present application, the calling, according to a cross-data source operation type, a corresponding operation interface of the external data source to process the first result set includes:

if the cross-data source operation type is an insert operation, calling an insert interface of the external data source to store the first result set into the external data source;

if the cross-data source operation type is query operation, a query interface of the external data source is called to return the first result set;

and if the cross-data source operation type is a deleting operation, calling a deleting interface of the external data source to delete the data conforming to the first result set in the external data source.

Optionally, in a seventh implementation manner of the first aspect of the present application, a sentence pattern used by the SQL dialect is:

DELETE FROM T0 [ SELECT clause ], where T0 is a data table in an external data source.

A second aspect of the present application provides a computer device comprising: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the computer device to perform the steps of the Spark-based cross-data source operation method described above.

A third aspect of the present application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the steps of the Spark-based cross-data source operation method described above.

Compared with the prior art, the application has the beneficial effects that: according to the method, the data structure in the external data source is automatically converted into the mapping library table through a certain mapping rule in an SQL mode, the mapping library table is registered to be a temporary view in the Spark SQL, the mapping library table to be executed in the SQL script is replaced by the temporary view, and the mapping library table is submitted to be executed by a Spark SQL engine, so that query, insertion and deletion operations can be performed on the mapping library table through SQL grammar, complex codes do not need to be written, the development amount and the development complexity of codes based on Spark operation across the data source are reduced, and the performance of operation across the data source is improved.

Drawings

FIG. 1 is a flowchart of a Spark-based cross-data source operation method according to an embodiment of the present application;

FIG. 2 is a flowchart of another embodiment of a Spark-based cross-data source operation method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element. In the present application, "at least one (item)" means one or more, "a plurality" means two or more, "at least two (items)" means two or three and more, "and/or" for describing an association relationship of an association object, and three kinds of relationships may exist, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of (a) or a similar expression thereof means any combination of these items. For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c".

Referring to fig. 1, fig. 1 is a flowchart of an embodiment of a Spark-based cross-data source operation method according to an embodiment of the present application. In this embodiment, the Spark-based cross-data source operation method includes:

101. creating Spark session, obtaining a plurality of written SQL scripts and respectively pre-analyzing the SQL scripts;

in the embodiment, the SQL script is written by a user in advance according to the actual service requirement, and the code quantity is small and is used for realizing the operation of crossing the data sources. Different data table formats may be used by different SQL scripts, and the corresponding processing modes of the SQL scripts adopting different data table formats are different, so that the embodiment first pre-analyzes the SQL scripts before submitting the SQL scripts to the SparkSQL execution engine, so that the format of the data table to be executed in the SQL scripts is known in advance, and the corresponding modes are adopted for processing according to the format of the data table.

In one embodiment, to reduce code development and complexity, the present embodiment introduces a new data table format: a mapping library table that needs to be created before writing the SQL script. Mapping a library table refers to mapping tables or data structures in unstructured or weakly structured external data sources (e.g., HBase library, dis library) into custom table structures to form new database tables. The custom table structure can be completely the same, partially the same or completely different from the data structure of the external data source, and the library name and the table name can be selected according to the service requirement. In addition, the mapping library table also stores relevant service configuration of the external data source, and the service configuration comprises: one or more of IP port, field mapping, and conditional filtering, performing project (field mapping), and adding filter (conditional filtering) to improve the performance of external data source access. The mapping library table not only can normalize the data structure of the external data source, but also can further reduce the data quantity to be read and improve the operation performance.

In addition, other parameters are further configured according to the characteristics of different types of external data sources. For example, the mapping library table of the HBase database designates a splicing rule of a corresponding namespace, table name, column cluster and rowkey (row key of HBase), and the mapping library table of the Redis designates a hash mode, a rediskey prefix, a primary key, a hash field combination field, a splicing rule of field names in the hash, and the like, and the splicing rule is that values of different fields are spliced and then used as field names of the hash, the Redis key prefix and the primary key can be spliced into a unique Redis key, so that a plurality of records can be stored in one Redis key, thereby improving the query efficiency.

In an embodiment, in order to reduce the code development amount and the development complexity, the embodiment is developed together with the written SQL script according to the mapping library table of the external data source and the data table of the sparkSQL internal data source. When the SQL script is written, the mapping library table of the external data source and the data table of the sparkSQL internal data source are used for carrying out federal query, insertion, deletion and other operations, so that the code development amount can be reduced, and the operation performance can be improved.

102. Judging whether a data table to be executed in the pre-parsed SQL script is a mapping library table of an external data source or not;

103. if the data table to be executed in the SQL script is a mapping library table, registering the mapping library table to be executed as a temporary view of Spark session;

104. replacing a mapping library table to be executed in the SQL script with a temporary view to obtain a new SQL script, and submitting the new SQL script to a sparkSQL engine for execution;

in this embodiment, after pre-parsing each SQL script, it is further determined whether the data table to be executed in the SQL script is a mapping library table of an external data source, for example, whether the data table to be executed in the SQL script is a mapping library table may be determined according to a naming rule of the table, and if the data table to be executed in the SQL script is the mapping library table, the mapping library table to be executed is registered as a temporary view of Spark session. The view, like a real data table, contains a series of columns and rows with names that can be used for queries, but the view has only metadata and does not exist in the database in the form of a dataset that stores the data. The view created in the temporary view is used in the current session, cannot be used across sessions, and is automatically deleted after the current Spark session is closed.

After the mapping library table is registered as the temporary view of the Spark session, all mapping library tables of the external data source to be executed in the SQL script are replaced by the temporary view, so that a new SQL script is obtained, and then the new SQL script is submitted to the Spark SQL engine for execution. Registering the data table in the external data source as a temporary view in SparkSQL can directly participate in SQL calculations including query insertion and deletion, etc., without the need to write complex code. Meanwhile, a mapping library table of the external data source is created and registered as a temporary view of sparkSQL, so that when the external data source is executed, the external data source can be scanned in a full table and an execution result set can be quickly output according to field mapping and condition filtering configured in the mapping library table.

105. Acquiring a first result set output by a sparkSQL engine;

106. and according to the operation type of the cross-data source, calling a corresponding operation interface of the external data source to process the first result set until each SQL script is executed and then closing the Spark session.

In this embodiment, after the execution of the SparkSQL engine is submitted by the SQL script, the first result set output by the SparkSQL engine is waited, and then the corresponding operation interface of the external data source can be called to process the first result set according to the operation type of the cross-data source, for example, data is queried from the external data source, data is inserted into the external data source, or data is deleted from the external data source.

In one embodiment, the step 106 further includes:

if the cross-data source operation type is an insert operation, an insert interface of an external data source is called to store a first result set in the external data source;

if the cross-data source operation type is query operation, a query interface of an external data source is called to return a first result set;

and if the cross-data source operation type is a deleting operation, calling a deleting interface of the external data source to delete the data which accords with the first result set in the external data source.

According to the embodiment, the data structure in the external data source is automatically converted into the mapping library table through a certain mapping rule in an SQL mode, the mapping library table is registered to be a temporary view in the Spark SQL, the mapping library table to be executed in the SQL script is replaced by the temporary view, and the mapping library table is submitted to be executed by a Spark SQL engine, so that query, insertion and deletion operations can be performed on the mapping library table through SQL grammar, complex codes do not need to be written, the development amount and the development complexity of codes based on Spark cross-data-source operations are reduced, and the performance of cross-data-source operations is improved.

Referring to fig. 2, fig. 2 is a flowchart of another embodiment of a Spark-based cross-data source operation method according to an embodiment of the present application. In this embodiment, the Spark-based cross-data source operation method includes:

201. creating Spark session, obtaining a plurality of written SQL scripts and respectively pre-analyzing the SQL scripts;

202. judging whether a data table to be executed in the pre-parsed SQL script is a mapping library table of an external data source or not;

203. if the data table to be executed in the SQL script is a non-mapping library table, determining the grammar type of the SQL sentence in the SQL script;

in this embodiment, for the case that the external data source needs to change or delete the error data such as the dirty data, if the data table stored in the external data source is at TB level or above, for example, stock exchange data of securities companies, hundreds of billions are recorded in daily transactions, if the data is rewritten after being deleted in a certain day, it may take tens of minutes to acquire the primary key by filtering the data after full table scanning and then delete the primary key, and the processing speed needs to be improved.

For the deletion operation of the external data source below the TB level, for example, the method mentioned in the foregoing embodiment may be adopted, for the data deletion operation of the HBase table, a mapping library table of the HBase table may be created first and registered as a temporary view of SparkSQL, so that when the operation is performed, the HBase table may be scanned in whole, then the operation such as filtering is performed on the SQL, the SparkSQL engine returns a result set (DataFrame) with a default rowkey field of the HBase (a row key of the HBase), and then the HBase deletion interface is directly called to delete the rowkey.

In this embodiment, for the deletion operation of the external data source of TB level and above, the quick deletion is performed by using the SQL dialect.

204. If the SQL statement is a deleting operation and the preset SQL dialect is adopted, submitting the SELECT clause in the SQL dialect to a spark SQL engine for execution and acquiring a returned second result set;

in one embodiment, the SQL dialect uses the following sentence patterns: DELETE FROM T0 [ SELECT clause ], where T0 is a data table in an external data source. The SELECT clause is not particularly limited in syntax, and the SELECT clause may be a complex key query, and the query table may be a data table in an internal data source of SparkSQL or a mapping library table of an external data source.

205. Sequentially judging the matching degree of the field in the second result set and the main key of the external data source;

206. and determining the main key to be deleted in the external data source according to the matching degree of the field in the second result set and the main key of the external data source, and deleting the main key.

In this embodiment, before performing the deletion processing, the matching degree between the field in the second result set and the primary key of the external data source needs to be determined first, including three situations of complete matching, partial matching, and no matching, where, for each situation, the processing manner adopted in this embodiment is different, where the complete matching and the partial matching may perform the deletion operation, and if no matching indicates that there is no corresponding data resulting in deletion failure.

In one embodiment, the step 207 further includes:

(1) If the fields in the second result set are completely matched with the main key of the external data source, splicing the fields in the second result set, which are completely matched with the main key of the external data source, into a spliced main key, and calling a deleting interface of the external data source to delete the spliced main key;

in this embodiment, for the case where the primary keys are completely matched, the matched fields may be spliced into one spliced primary key, and then deleted. The columns of primary keys of the external data source may together form a primary key in sequence, like the joint primary key of MySQL. For example, the columns of the primary key of the HBase library table may be concatenated to appear as one RowKey, such that each column of the HBase library table may be considered as actually part of an integral primary key.

For example, the HBase library table is a history table, has mass data of TB/PB level, needs to delete data of a certain day, and is known that the rowkey of the HBase library table is formed by splicing a field custid and a field busi_date; in addition, the value of the current+busi_date, namely the value of rowkey, can be autonomously constructed in the Hive user table of the spark SQL internal data source;

the SQL dialect is adopted for deleting operation, and the method is concretely as follows:

delete from hbase library table selected tested, '20201220' as busi_date from hive user table;

and executing the SQL dialect through a spark SQL engine to obtain rowkey of each query condition in accordance with SQL, and then accessing a hbase deletion interface to delete the columnfomity in the corresponding library table in batches according to each rowkey value in the hbase library table.

(2) If the fields in the second result set are matched with the main key prefixes of the external data sources, splicing the fields matched with the main key prefixes of the external data sources in the second result set into a spliced main key prefix by iterating the fields in the second result set; and scanning the external data source according to the spliced main key prefix, obtaining a main key set in the external data source, and calling a deletion interface of the external data source to delete the main key set.

In this embodiment, for the case of partial matching of the primary key, that is, the case of matching the primary key prefix, the matched fields may be spliced into a spliced primary key prefix, then the primary key set in the external data source is obtained according to the spliced primary key prefix, and finally the deletion operation is performed.

For example, each column in the HBase library table is assigned to a certain column family. The column family is part of the tablespace Schema, and the columns are not. Column names are prefixed by a column family, for example, the families of the family history, courses:math belong to the family of the family. The rowkey set of the hbase library table can be obtained quickly through the main key prefix, and then the deletion interface is called to perform quick deletion processing.

The embodiment realizes the rapid deletion of the data in the appointed range in the mass data of the external data source, and simultaneously introduces the custom SQL dialect, thereby being convenient for realizing the deletion without writing codes and greatly improving the performance of the operation of crossing the data source.

Fig. 3 is a schematic diagram of a computer device according to an embodiment of the present application, where the computer device 500 may have a relatively large difference due to configuration or performance, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations in the computer device 500. Still further, the processor 510 may be arranged to communicate with a storage medium 530 to execute a series of instruction operations in the storage medium 530 on the computer device 500.

The computer device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Serve, mac OS X, unix, linux, freeBSD, and the like. Those skilled in the art will appreciate that the computer device architecture shown in FIG. 3 is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The present application also provides a computer device, where the computer device includes a memory and a processor, where the memory stores computer readable instructions that, when executed by the processor, cause the processor to perform the steps of the Spark-based cross-data source operation method in the foregoing embodiments. The present application also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of a Spark-based cross-data source operation method.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. The Spark-based cross-data source operation method is characterized by comprising the following steps of:

acquiring a first result set output by the sparkSQL engine;

according to the operation type of the cross-data source, a corresponding operation interface of the external data source is called to process the first result set until each SQL script is executed and then the Spark session is closed;

2. The Spark-based cross-data source operation method according to claim 1, further comprising, before the acquiring the written SQL scripts:

3. The Spark-based cross-data source operation method of claim 2, wherein the related service configuration comprises: one or more of IP port, field mapping, and condition filtering.

4. The Spark-based cross-data source operation method according to claim 1, wherein the SQL script is developed together according to a mapping library table of the external data source and a data table of the Spark SQL internal data source.

5. The Spark-based cross-data source operation method according to claim 1, wherein the invoking the corresponding operation interface of the external data source to process the first result set according to the cross-data source operation type comprises:

6. The Spark-based cross-data source operation method according to claim 1, wherein the SQL dialect uses the following sentence patterns:

7. A computer device, the computer device comprising: a memory and at least one processor, the memory having instructions stored therein;

the at least one processor invoking the instructions in the memory to cause the computer device to perform the Spark-based cross-data source operation method of any of claims 1-6.

8. A computer readable storage medium having instructions stored thereon, which when executed by a processor implement the Spark-based cross-data source operation method of any of claims 1-6.