CN116204575A

CN116204575A - Method, device, equipment and computer storage medium for importing data into database

Info

Publication number: CN116204575A
Application number: CN202310257630.6A
Authority: CN
Inventors: 蒋松
Original assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Current assignee: China Construction Bank Corp; CCB Finetech Co Ltd
Priority date: 2023-03-09
Filing date: 2023-03-09
Publication date: 2023-06-02

Abstract

The method is characterized in that each segmented file is used as a transmission task to promote the parallelism degree of data transmission, when the transmission task is executed, the data reading position in the target segmented file is required to be positioned according to the task offset of the task, the database import instruction of the batch is generated based on the data read by the data reading position, the generated database import instruction is sent to the target database to realize the import of the data, and meanwhile, the task offset is updated so that the follow-up data can be known to be transmitted, so that the breakpoint retransmission function of the data is realized, the data is required to be deleted to be re-executed when the task fails, and the data import efficiency is improved.

Description

Method, device, equipment and computer storage medium for importing data into database

Technical Field

The present invention relates to the field of computer technologies, and in particular, to the field of database technologies, and provides a method, an apparatus, a device, and a computer storage medium for importing data into a database.

Background

Along with the popularization of big data technology, more and more enterprises introduce big data technology stacks, process and mine mass data by using big data technology, and then transmit calculation results to a downstream system for use.

In the related art, the calculation result is provided to the downstream system mainly in two ways, one way is that after the calculation result data is obtained, the calculation result data is unloaded to the local and distributed to the downstream system through a file transmission tool, so that the downstream system is loaded into a database through a data loading tool, but the process flow of importing the calculation result data into the database in the way is long, the error is easy, and the time delay is high. The other way is to write directly into the downstream database through the big data technology stack, but in the process of writing into the database of the downstream system, when the data fails halfway, all the data must be deleted first and then the data is imported by re-header, so that a great deal of time is wasted, and the importing efficiency is extremely low.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a computer storage medium for importing data into a database, which are used for realizing a breakpoint retransmission function when the data is imported into the database and improving the efficiency of data importing.

In one aspect, there is provided a method of importing data into a database, the method comprising:

reading a target data file to be imported based on a preset source data storage path;

performing slicing processing on the target data file, and updating a transmission task list based on the acquired plurality of slicing files; in the transmission task list, each transmission task corresponds to a fragment file;

circularly traversing the transmission task list until the transmission task list is empty; reading a transmission task every time, and importing target fragment files corresponding to the transmission task into a target database in batches, wherein the importing process of each batch comprises the following steps:

positioning to a data reading position in the target fragment file according to the corresponding task offset, and generating a database importing instruction of the batch based on the data read by the data reading position; the task offset is used for indicating the data which is imported in the target fragment file;

And sending the generated database import instruction to the target database, and updating the task offset.

In one aspect, there is provided an apparatus for importing data into a database, the apparatus comprising:

the data reading unit is used for reading a target data file to be imported based on a preset source data storage path;

the slicing processing unit is used for carrying out slicing processing on the target data file and updating a transmission task list based on the acquired plurality of slicing files; in the transmission task list, each transmission task corresponds to a fragment file;

the parallel transmission unit is used for circularly traversing the transmission task list until the transmission task list is empty; reading a transmission task every time, and importing target fragment files corresponding to the transmission task into a target database in batches, wherein the importing process of each batch comprises the following steps:

In a possible embodiment, the data reading unit is further configured to:

responding to database importing operation, and obtaining task operation parameters required by the database importing task, wherein the task operation parameters comprise data source information and source data operation information;

determining the data source type of the data source to be imported based on the data source information;

if the data source type is an internal data source and based on the source data operation information, determining that database operation is not required for the source data, updating a storage path of the internal data source into a source data storage path;

and if the data source type is an external data source and based on the source data operation information, determining that database operation is not required for the source data, reading the source data from the external data source and storing the source data into the source data storage path.

In a possible embodiment, the data reading unit is further configured to:

if the source data is an internal data source and the database operation needs to be carried out on the source data based on the source data operation information, after the source data is read from the internal data source and the database operation is executed, the source data after the operation is stored into the internal data source, and the storage path of the internal data source is updated to be a source data storage path;

If the data source type is an external data source and based on the source data operation information, it is determined that database operation is not required for the source data, the source data is read from the external data source, the database operation is executed, and then the operated source data is stored in the source data storage path.

In one possible implementation, the source data storage path indicates a storage path of a distributed file system; the data reading unit is specifically configured to:

and reading source data from the external data source through a big data calculation engine and storing the source data into the distributed file system.

In a possible implementation manner, the slicing processing unit is specifically configured to:

if the reference field required by the slicing process is specified, determining a slicing reference value based on the value of the reference field in the target data file;

if the reference field required by the slicing process is not specified, determining the slicing reference value based on the value of the primary key in the target data file;

determining a slicing dividing point when the target data file is subjected to slicing processing based on the slicing reference value and the designated slicing quantity;

And based on the slicing dividing points, slicing the target data file to correspondingly obtain a plurality of slicing files.

In a possible embodiment, the parallel transmission unit is specifically configured to:

reading one transmission task at each traversal, and adding the transmission task to a task execution pool;

and reading the transmission task to be executed from the task execution pool through idle target processes in the plurality of parallel transmission threads, and importing target fragment files corresponding to the transmission task into a target database in batches.

positioning to a data reading position in the target fragment file according to the corresponding task offset;

determining whether the data reading position is the tail of the target fragment file;

if not, sequentially reading data based on the data reading positions, and generating a database importing instruction of the batch based on the read data every time the data is read;

transmitting the generated database import instruction to the target database, including:

and if the number of the generated database import instructions reaches the upper limit value of the number of the batches, transmitting the generated database import instructions to the target database.

if yes, determining whether the updated transmission task list has incomplete transmission tasks or not;

if the task state does not exist, the task state of the importing task of the database is updated to be a successful state;

if yes, continuing to read the new transmission task, and adding the new transmission task into a task execution pool.

In a possible embodiment, the parallel transmission unit is further configured to:

if the database importing task fails, outputting a task failure indication;

responding to a retransmission instruction initiated for the task failure instruction, and continuing to read a transmission task from the transmission task list;

and positioning to a data reading position in the target fragment file based on the task offset corresponding to the transmission task, and continuously executing the transmission task based on the data reading position.

In one aspect, a computer device is provided comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods described above when the computer program is executed.

In one aspect, a computer storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of any of the methods described above.

In one aspect, a computer program product is provided that includes a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor executes the computer program so that the computer device performs the steps of any of the methods described above.

In the embodiment of the present application, by performing slicing processing on a target data file to be transmitted, each slicing file is used as a transmission task, and when the transmission task is executed, the slicing file needs to be located to a data reading position in the target slicing file according to a task offset of the task, and based on data read by the data reading position, a database import instruction of the batch is generated, so that the generated database import instruction is sent to a target database to implement import of the data, and meanwhile, the task offset is updated so that a follow-up task can know which data has been transmitted, thereby locating to the follow-up data reading position. By recording the task offset, when one transmission task fails or the whole database fails to import the task, the data reading position during retransmission can be positioned based on the task offset, so that the breakpoint retransmission function of the data is realized, the situation that the data needs to be deleted to re-execute the transmission when the task fails is avoided, and the efficiency of data import is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described below, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for a person having ordinary skill in the art.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 2 is a diagram of a system architecture for importing data into a database according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for importing data into a database according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a data reading process according to an embodiment of the present disclosure;

fig. 5 is a schematic flow chart of a slicing process according to an embodiment of the present application;

fig. 6 is a flow chart of a data transmission process according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an apparatus for importing data into a database according to an embodiment of the present application;

fig. 8 is a schematic diagram of a composition structure of a computer device according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure. Embodiments and features of embodiments in this application may be combined with each other arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:

derivative: refers to the process of importing data into a database, simply referred to as a derivative.

Distributed file system (Distributed File System, DFS): the physical storage resources managed by the file system are not necessarily directly connected to the local node, but are connected with the node through a computer network; or a complete hierarchical file system formed by combining a plurality of different logical disk partitions or labels. The DFS provides a logical tree file system structure for resources distributed at any position on the network, thereby facilitating user access to shared files distributed on the network. The role of the individual DFS shared folders is to be relative to access points through other shared folders on the network. One typical DFS is HDFS (hadoop distributed file system).

Spark: is a general memory parallel computing framework developed by the university of california, berkeley division AMP (Algorithms Machines People) laboratory.

Structured query language (Structured Query Language, SQL): is a special purpose programming language, a database query and programming language, for accessing data and querying, updating and managing relational database systems.

hive: is a Hadoop-based data warehouse tool for data extraction, transformation and loading, which is an sql component that can store, query and analyze large-scale data stored in Hadoop.

Sqoop: the method is an open source tool and is mainly used for data transmission between Hadoop and a traditional database.

parquet file: is a column storage file format, commonly used in large data components such as Pig, spark, hive, and the suffix is. Par.

Hash function (Hash): or hash, is the transformation of an arbitrary length input by a hashing algorithm into a fixed length output, which is a hash value, or hash value. This conversion is a compressed mapping, i.e. the hash value is typically much smaller in space than the input, different inputs may be hashed to the same output, so it is not possible to determine a unique input value from the hash value. In short, it is a function of compressing messages of arbitrary length to a message digest of a fixed length, by which strings or other types of data that are difficult to compare can be mapped to an integer.

Breakpoint continuous transmission: in the downloading or uploading process, the downloading or uploading task (a file or a compressed packet) is artificially divided into a plurality of parts, each part is uploaded or downloaded by adopting a thread, and if a network fault is encountered, the part which is already uploaded or downloaded can be used for continuously uploading the part which is not yet finished, and the part which is not yet downloaded is not necessarily uploaded from the beginning.

JS object numbered musical notation (JavaScript Object Notation, json): is a lightweight data exchange format.

The following is a brief description of the concept of the technical solution of the embodiments of the present application.

At present, when the data result calculated by a big data technical stack such as hive or spark is provided for a downstream system, the adopted derivative mode does not support the breakpoint retransmission function, and when the derivative is performed, all data must be deleted first and then imported by a heavy head for the sake of data integrity, so that a great deal of time is wasted, and the importing efficiency is extremely low.

Based on this, the embodiment of the application provides a method for importing data into a database, in the method, by performing slicing processing on a target data file to be transmitted, each sliced file is used as a transmission task, when the transmission task is executed, the sliced file needs to be located to a data reading position in the target sliced file according to a task offset of the task, and based on the data read by the data reading position, a database importing instruction of the batch is generated, so that the generated database importing instruction is sent to the target database to realize importing of the data, and meanwhile, the task offset is updated so that the follow-up data can be known to be transmitted, so that the follow-up data reading position is located. By recording the task offset, when one transmission task fails or the whole database fails to import the task, the data reading position during retransmission can be positioned based on the task offset, so that the data breakpoint retransmission function is realized, the need of deleting the data to re-execute the transmission when the task fails is avoided, and the efficiency of data import is improved.

In addition, the derivative process in the embodiment of the application is developed based on the spark big data technology stack, and the parallel and rapid data import to the target database can be realized according to the configuration of a user, so that the data import efficiency is improved, and the time required by data import is reduced.

In the following, some simple descriptions are provided for application scenarios applicable to the technical solutions of the embodiments of the present application, and it should be noted that the application scenarios described below are only used to illustrate the embodiments of the present application and are not limiting. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The scheme provided by the embodiment of the application can be suitable for derivative scenes. As shown in fig. 1, an application scenario is schematically provided in an embodiment of the present application, where a data source device 101, a derivative device 102, and a target database 103 may be included in the scenario.

The data source device 101 is a device for providing a data source, which may be a database adopting a human random storage mode, and is used for storing data results calculated by a big data technology stack such as hive or spark, for example, may be a local storage database of a big data computing cluster, or may be a special database for storing data results, or the like.

The derivative device 102 is a device for implementing the derivative process in the embodiment of the present application, and may be a terminal device capable of implementing the derivative process, or may be a server device, for example, an independent physical server, or may be a server cluster or a distributed system formed by multiple physical servers, or may be a cloud server that provides a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, that is, a content distribution network (Content Delivery Network, CDN), and a cloud server that is a basic cloud computing service such as a big data and an artificial intelligence platform, but is not limited thereto.

In one possible implementation, the data source device 101 and the derivative device 102 may be completely different devices, and the data source device 101 belongs to an external data source of the derivative device 102.

In a possible embodiment, the data source device 101 and the derivative device 102 may be implemented as the same device, i.e. the derivative device 102 itself is a distributed cluster, and can implement distributed big data calculation, and the engine oil itself is provided as a result of the calculation, and the derivative device itself is a data source, so that the data source device 101 belongs to an internal data source of the derivative device 102.

The target database 103 is a target database for data import, and may be a database adopting any possible data structure, and is used for storing the calculation result of each big data component, and supplying the calculation result to a downstream system for use.

The derivative device 102 may include one or more processors, memory, and/or the like to interface with an I/O interface for implementing the interaction. In addition, the derivative device 102 may also be configured with a database that may be used to store, among other things, the task operation parameters required in the derivative process, as well as the status data in the task operation. The memory of the derivative device 102 may further store program instructions that are required to be executed by each of the methods for importing data into a database according to the embodiments of the present application, where the program instructions when executed by a processor can be used to implement the procedures for importing data into a database according to the embodiments of the present application.

In practical application, if a database import instruction is initiated, the derivative device 102 may acquire a target data file to be transmitted from the data source device 101, and perform slicing processing on the target data file to be transmitted, and use each sliced file as a transmission task, so that when the transmission task is executed, the data needs to be located to a data reading position in the target sliced file according to a task offset of the task, and based on the data read in the data reading position, generate a database import instruction of the batch, so as to send the generated database import instruction to a target database to implement import of the data, and update the task offset so that a subsequent task can know which data has been transmitted, thereby locating a subsequent data reading position. By recording the task offset, when one transmission task fails or the whole database fails to import the task, the data reading position during retransmission can be positioned based on the task offset, so that the data breakpoint retransmission function is realized, the need of deleting the data to re-execute the transmission when the task fails is avoided, and the efficiency of data import is improved.

In the embodiment of the present application, the data source device 101, the derivative device 102 and the target database 103 may be directly or indirectly connected through one or more networks. The network may be a wired network, or may be a Wireless network, for example, a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may be other possible networks, which is not limited in this embodiment of the present application.

As shown in fig. 2, a system architecture diagram for importing data into a database according to an embodiment of the present application may include a data source, a derivative tool, and a target database, where the derivative tool mainly includes a task data module, a data source reading module, a data slicing module, and a data parallel transmission module.

(1) Task data module

The task data module is used for storing task operation parameters and task state data in task operation, wherein the task operation parameters are used for importing task operation into the database, the task operation parameters can comprise data source information, target database information, parallelism, a slicing algorithm, a data slicing catalog and the like, and the task state data can comprise processing states of each stage, task offset of each slicing file and the like.

The task data module penetrates through the whole life cycle of the database import task, records the stage of the database import task and the state and progress of each stage, particularly records the task offset of each fragmented file in the data parallel transmission stage, is the basis for realizing breakpoint continuous transmission, and can resume execution according to the related task data of each transmission task when the task fails to retry.

(2) Data source reading module

The data source reading module reads data from the data source through a spark and other big data technology stack and stores the data on the HDFS, and supports an internal data source (namely, a big data cluster where the data source is located in a data transmission task) and an external data source. Wherein the internal data sources include: hive table, HDFS file in parquet file format with schema, external data sources include: various relational database data tables, and the data source reading module also supports execution of custom sql operations for relevant data processing.

The data source read module is upstream of the data slicing module, which passes the read data into the data slicing module.

(3) Data slicing module

The data slicing module is used for receiving the data read by the data source reading module, dividing the data into individual slicing files according to the slicing algorithm, and supplying the slicing files to the data transmission module for parallel transmission.

Specifically, taking the source data storage path as the ds_hdfs_path as an example, the module reads data from the ds_hdfs_path, and stores the data on different slices through a slicer, and divides the data into a specified number of files through the above flow.

(4) Data parallel transmission module

The data parallel transmission module is used for reading the fragmented data files according to parameters such as parallelism, file format and the like, writing the data into the lower target database, and recording the offset or the completion state of each fragmented file in the writing process.

Specifically, the data parallel transmission module transmits fragmented data to a designated target database table, a fragmented file transmission task pool is generated according to designated parallelism, batch data (batch_size) transmission can be simultaneously conducted on a plurality of fragmented files and imported into the target database table, generally, parallelism cannot be larger than the number of fragmented files, after each batch is successfully submitted, task offset of the fragmented files in the task data module is updated first, so that breakpoint execution is continued from the task offset when the task fails, the fragmented file transmission state is updated to be successful after the fragmented file transmission is completed, and the task state is updated to be successful after all fragmented files are successfully transmitted.

The method for importing data into a database according to the exemplary embodiment of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenario and system architecture described above, and it should be noted that the application scenario is only shown for the convenience of understanding the spirit and principles of the present application, and embodiments of the present application are not limited in this respect.

Referring to fig. 3, a flow chart of a method for importing data into a database according to an embodiment of the present application is shown, and a specific implementation flow of the method is as follows:

step 301: and reading the target data file to be imported based on a preset source data storage path.

In this embodiment, the process of step 301 may be performed by the data source reading module shown in fig. 2. When the process of importing data into the database is performed, the source data storage path information may be read from the task data module, so that the source data storage path may be obtained, and then each data file to be imported may be read from a storage space corresponding to the source data storage path.

The process of importing the data into the database may be performed after the big data is calculated, and then after the big data is calculated, the calculation result may be stored into a preset source data storage path, so as to facilitate the process of importing the data into the database by reading the data from the source data storage path. Or, the method may be executed at the same time of big data calculation, and the source data storage path may be used as a storage path of big data calculation results, where the big data calculation results are stored in the path, or in the process of big data calculation, the time-varying results are migrated to the source data storage path at the same time.

In one possible implementation, the process of data reading may be performed by a flow shown in fig. 4, which may be performed before step 301 or may be performed simultaneously with step 301, which is not limited in this embodiment of the present application. Referring to fig. 4, a flow chart of a data reading process according to an embodiment of the present application is shown.

Step 401: and responding to the database importing operation, and obtaining task operation parameters required by the database importing task, wherein the task operation parameters comprise data source information and source data operation information.

In one possible implementation manner, after the big data calculation is completed, the user is notified that the big data calculation process is completed, and based on the completion instruction, when the user confirms that the big data calculation process needs to be imported to the target database, the database import operation is initiated for the calculation results, and the relevant task operation parameters can be configured while the operation is performed.

In one possible implementation manner, the database importing operation may also be performed before performing the big data calculation, that is, when the big data calculation is configured, the calculation result is imported into the target database after the configuration calculation is completed, and then the data importing operation may also be considered as an operation of initiating the big data calculation.

Of course, the database import operation may be initiated in other cases, which is not limited in the embodiments of the present application.

In this embodiment of the present application, the task operation parameters are configured for the current database import task, and are used to indicate relevant task parameters of the current database import task, including but not limited to the following parameters:

(1) Task information such as task name, task status, etc.

(2) Data source information indicating data source related information such as data source type, source data storage path, etc. For example, the ds_hdfs_path field is used to indicate the source data storage path.

(3) And the target database information is used for indicating information related to a target database to which the data is to be imported, such as information of a database type, a data type, a database storage path, a database importing instruction, a data source type and the like, wherein the data source attribute information is different according to different data source types.

(4) And the parallelism is used for indicating the concurrency degree of the transmission tasks when the database import tasks are executed, for example, when the parallelism is set to be 10, the transmission tasks can be executed through 10 data transmission threads respectively, and the execution efficiency of the tasks is improved.

(5) And the slicing algorithm is used for indicating the slicing algorithm adopted in the slicing processing.

(6) And the data slicing directory is used for indicating the storage directory of the slicing files obtained by the slicing processing.

The task data in the embodiments of the present application may be described using any possible data structure, such as a Json structure, etc.

Step 402: based on the data source information, a data source type of the data source to be imported is determined.

Step 403: and if the data source type is the internal data source, loading the internal data source information.

In this embodiment, when the data source is of a different type, the execution logic may be different, and the parameter configuration may be correspondingly different, so, in practical application, in order to avoid loading too many parameters, after determining the data source type, the corresponding data source information may be selectively loaded according to the data source type.

Step 404: based on the source data operation information, it is determined whether a database operation is required for the source data.

In the embodiment of the present application, in the process of importing data into the database, the database operation is supported, for example, an sql may be predefined, and when the data is imported, the sql may be first executed for the data, so as to implement corresponding data processing. Therefore, it is also necessary to determine whether sql has been specified before placing the data in the specified source data storage path.

Step 405: if the determination in step 404 is no, that is, it is determined that no database operation is required for the source data, the storage path of the internal data source is updated to the source data storage path.

Specifically, if the sql is not specified, the sql is not required to be executed, and since the data source is an internal data source, the data can be directly read, and in order to reduce the migration process of the data, the storage path of the internal data source can be directly updated to the source data storage path, so that the data can be directly read from the internal data source. For example, the ds_hdfs_path field is updated to the storage path of the internal data source.

Step 406: if the determination in step 404 is yes, that is, it is determined that the database operation needs to be performed on the source data, after the source data is read from the internal data source and the database operation is performed, the source data after the operation is stored in the internal data source, and the storage path of the internal data source is updated to be the source data storage path.

Specifically, if it is determined that the sql is specified, the sql needs to be executed, then the result is saved in the internal data source, and then the storage path of the internal data source is updated to be the source data storage path. Alternatively, a new storage path is created in the internal storage for storing the data after the sql is executed, and the storage path is used as a source data storage path, for example, a path for saving the data after the sql is executed to the ds_hdfs_path.

Step 407: and if the data source type is an external data source, loading external data source information.

Step 408: based on the source data operation information, it is determined whether a database operation is required for the source data.

Step 409: if the determination in step 408 is no, that is, it is determined that no database operation is required for the source data, the source data is read from the external data source and stored in the source data storage path.

Similarly, if the specified sql operation is not required, the read data may be directly saved to the path indicated by ds_hdfs_path.

Step 410: if the determination in step 408 is yes, that is, it is determined that the database operation is not required for the source data, the source data is read from the external data source and the database operation is performed, and then the source data after the operation is stored in the source data storage path.

Similarly, if the specified sql operation needs to be performed, the sql is performed and then the data is saved to the path indicated by ds_hdfs_path.

In this embodiment of the present application, when storing source data, the source data may be implemented in conjunction with a distributed file system, for example, an HDFS, that is, the source data storage path may be a storage path in the HDFS. Then, when data is read, the source data can be read from the external data source by the big data calculation engine and stored into the HDFS. For example, the data files may be read from the data source by a spark or Sqoop tool and stored in the HDFS, for example, in the ds_hdfs_path of the HDFS, and then, during the process of database import, each data file to be imported may be read from the ds_hdfs_path.

Before the import task is performed, configuration parameter setting related to the data source reading module needs to be performed, wherein the configuration parameters may include one or a combination of more of the following parameters:

(1) The data source type can be judged according to the data source type, and the attribute information of the data source is different from the data source type.

(2) Indication information indicating whether to execute the specified sql may be judged accordingly whether to specify the sql.

(3) And a source data storage path for storing data files to be imported.

Reference is now made to fig. 3.

Step 302: performing slicing processing on the target data file, and updating a transmission task list based on the acquired plurality of slicing files; in the transmission task list, each transmission task corresponds to a fragment file.

In an embodiment of the present application, the process of step 302 may be performed by the data slicing module shown in fig. 2.

In order to improve the parallelism of data in the database importing process, in the embodiment of the application, a single data file is split into a plurality of split files, and then the split files can be simultaneously transmitted at the same time, so that the data importing efficiency is improved. After the target data file is read from the source data storage path, the data contained in the target data file can be stored on different fragments through the fragmenter, so that the target data file is divided into a specified number of fragment files for parallel transmission by a subsequent data transmission module.

Specifically, the task operation parameters described above may include a slicing algorithm required for slicing, and may further specify a reference field required for slicing, that is, a slice division is performed based on the field when slicing is performed. Reference is made to fig. 5, which is a schematic flow chart of a slicing process provided in an embodiment of the present application.

Step 501: the target data file is read from the source data storage path.

Step 502: it is determined whether a reference field required for the fragmentation process is specified.

Before the importing task is performed, configuration parameter setting related to the data slicing module is required, and the parameters can be uniformly stored in the task data module, wherein the configuration parameters can comprise one or more of the following parameters:

(1) The field on which the fragment depends is judged according to whether the reference field is specified, if so, the specified reference field can be read from the field, and if not, the reference field can be a default value.

(2) The number of fragments indicates that one data file needs to be split into a plurality of fragmented data files.

(3) The slicing data file storing directory is used for indicating the storing directory of the sliced data files after slicing.

Step 503: if the result of step 502 is yes, that is, the reference field required for the slicing process is specified, the slicing reference value is determined based on the value of the reference field in the target data file.

In one possible implementation manner, the slice reference value may be a hash value, and then hash calculation is performed on the value of the reference field in the target data file to obtain the hash value of the reference field, and the hash value is used as the slice reference value.

Of course, other digest algorithms may be used to obtain the slice reference value, which is not limited in this embodiment of the present application.

Step 504: if the result of step 502 is no, that is, the reference field required for the slicing process is not specified, the slicing reference value is determined based on the value of the primary key in the target data file.

For example, if the first column is the primary key, the first column field is used as the fragment field, the hash calculation is performed on the value of the first column field, and the obtained hash value is used as the fragment reference value.

Step 505: and determining the slicing dividing point when the target data file is subjected to slicing processing based on the slicing reference value and the designated slicing quantity.

Specifically, if the length of the slicing reference value is fixed, the slicing reference value can be divided into multiple sections according to the designated slicing number, the dividing point between the two sections is the slicing dividing point, and each section can correspond to a part in the target data file, so that the purpose of slicing is achieved.

In one possible implementation manner, the process of slicing and dividing may be performed in a mode of taking a module, that is, taking a module from a slice reference value and a specified number of slices, to obtain slices to which each piece of data in the target data file belongs.

Step 506: and based on the slicing dividing points, slicing the target data file to correspondingly obtain a plurality of slicing files.

Specifically, based on the partition points, it is possible to know which data partitions belong to the same partition, and then store the data into the same partition file, so as to obtain each partition file corresponding to the target data file, generate corresponding transmission tasks based on each partition file, and update the transmission tasks into the transmission task list.

When the method is applied practically, the slicing file catalogue and the transmission task list can be configured, the slicing files obtained after slicing are stored in the slicing file catalogue, and correspondingly, when the method is used for transmission, the side-path slicing file catalogue generates corresponding transmission tasks for the slicing files and adds the transmission tasks into the transmission task list.

Reference is now made to fig. 3.

Step 303: traversing the transmission task list circularly until the transmission task list is empty; reading a transmission task every time, and importing target fragment files corresponding to the transmission task into a target database in batches, wherein the importing process of each batch comprises the following steps:

S3031: positioning to a data reading position in the target fragment file according to the corresponding task offset, and generating a database importing instruction of the batch based on the data read by the data reading position; the task offset is used for indicating the data which is imported in the target fragment file.

S3032: and sending the generated database import instruction to a target database, and updating the task offset.

Specifically, in order to improve the efficiency of data import, the fragmented files may adopt a parallel transmission mode, so that a transmission task to be executed may be stored in the task execution pool, and thus, when the corresponding transmission thread is idle, the transmission task may be taken out from the task execution pool and executed. Therefore, for each transmission task in the transmission task list, one transmission task is read from each traversal, the transmission task is added into the task execution pool, and further, the transmission task to be executed can be read from the task execution pool through idle target processes in a plurality of parallel transmission threads, and target fragment files corresponding to the transmission task are imported into the target database in batches.

Before the importing task is performed, configuration parameter settings related to the data parallel transmission module are required to be performed, and the parameters can also be uniformly stored in the task data module, wherein the configuration parameters can comprise one or more of the following parameters:

(1) The parallelism is used for indicating the parallelism in data transmission, for example, the parallelism value is configured to be 3, and data can be imported through 3 transmission threads at the same time.

(2) Batch size (batch_size), indicating the number of instruction records per batch transferred, e.g., setting batch_size to 1000, indicates that a database transaction commit occurs once the number of instruction records per batch reaches 1000.

Wherein, each transmission task stores state data of the transmission task correspondingly, and these parameters may be stored in a task data module in a unified manner, which is referred to as a list of transmission tasks of a fragmented file, where the list may include a path of the fragmented data file of each fragmented data file, a task offset of the fragmented data file, to indicate an offset of the transmitted fragmented data file, and a transmission state of the fragmented data file.

The state data of each transmission task added to the transmission task list is configured correspondingly, the state data of each transmission task indicates the file path, the task offset and the transmission state of the segmented file aimed at by the transmission task, when the transmission task is not carried out at first, the task offset is zero, and the task offset is updated according to the actual transmission process, so that even if the imported task is abnormal and needs to be transmitted again, the transmission is started without a re-header, but the data is positioned to the transmitted data position according to the task offset, and the transmission is continued, and when all data of the segmented file are imported successfully, the transmission state is set to be successful.

In the embodiment of the present application, by performing slicing processing on a target data file to be transmitted, each slicing file is used as a transmission task, and when the transmission task is executed, the slicing file needs to be located to a data reading position in the target slicing file according to a task offset of the task, and based on data read by the data reading position, a database import instruction of the batch is generated, so that the generated database import instruction is sent to a target database to implement import of the data, and meanwhile, the task offset is updated so that a follow-up task can know which data has been transmitted, thereby locating to the follow-up data reading position. By recording the task offset, when one transmission task fails or the whole database fails to import the task, the data reading position during retransmission can be positioned based on the task offset, so that the data breakpoint retransmission function is realized, the need of deleting the data to re-execute the transmission when the task fails is avoided, and the efficiency of data import is improved.

The process of step 303 may be performed by the parallel transmission module shown in fig. 2.

The following describes the transmission process of the data parallel transmission module in detail. Fig. 6 is a schematic flow chart of a data transmission process according to an embodiment of the present application.

Step 601: and loading task data information.

Step 602: and confirming whether the database import task runs for the first time.

Specifically, the division between the first run and the non-first run is whether the database import task is executed for the first time, if the database import task fails to execute, the execution is performed again, and if the database import task fails to execute, the execution is not the first run.

Step 603: if the result of step 602 is yes, that is, the database import task is the first operation, the fragment file directory is traversed, a transmission task list is generated, and task state data is updated.

Specifically, when the database import task is first run, a corresponding transmission task list needs to be created, that is, corresponding p_file_ tks is generated according to the corresponding segmented file directory of each segmented file obtained after the segmentation processing, and task state data is updated.

Step 604: if the result of step 602 is no, or after the transmission task list is generated, the incomplete transmission task may be read from the transmission task list.

In the embodiment of the application, in the process of executing the database import task, the task may fail, and if the task is detected to fail, the retransmission attempt may be automatically performed. Or outputting a task failure indication to the user, wherein the user can select whether to retransmit according to the task failure indication, if retransmission is required, retransmission operation is triggered, and correspondingly, the device can respond to the retransmission indication initiated according to the task failure indication, in this case, the task is not executed for the first time, the transmission task can be continuously read from the transmission task list, and the transmission task can be positioned to a data reading position in the target fragmented file based on the task offset corresponding to the transmission task, and the transmission task can be continuously executed based on the data reading position.

Step 605: and putting the read transmission tasks into a task execution pool so as to be executed by each transmission thread to realize the importing of the corresponding fragmented files.

Step 606: when a transmission thread executes a transmission task, the data reading position in the corresponding target fragment file is positioned according to the task offset of the transmission task.

Step 607: it is determined whether the data read location is the end of the target tile file.

Step 608: if the result of step 607 is no, that is, the data reading position is not the end, the data are sequentially read based on the data reading position, and each time the data are read, the database import instruction of the batch is generated based on the read data.

Specifically, the process of database import is essentially to insert data into the target database, and then the database import instruction may insert statements into the database table of the target database.

Step 609: and judging whether the number of the generated database import instructions reaches the upper limit value of the number of the batch.

Step 610: if the result of step 609 is yes, the generated database import instruction is sent to the target database, and the task offset is updated.

Specifically, a database transaction is generated once based on the generated database import instruction in a packing mode, and the database transaction is submitted to a value target database.

If the result of step 609 is no, the execution jumps to 608, i.e. the database import instruction is continuously generated, until the number of generated database import instructions reaches the upper limit of the number of the batch.

Step 611: if the result of step 607 is yes, i.e. it is at the end, it is determined whether there are more outstanding transmission tasks in the updated transmission task list.

If the result of step 611 is yes, i.e. there is an incomplete transmission task, the step 604 is skipped, i.e. the new transmission task is continuously read, and the new transmission task is added to the task execution pool.

Step 612: if the result of step 611 is no, that is, all the transmission tasks are completed, the task state of the database import task is updated to a successful state, and the database import task is ended.

In summary, in the embodiment of the application, a derivative scheme for importing data into a relational database data table in a large batch based on spark big data technology stack development is provided, which can realize parallel and rapid importing of data into a target database according to user configuration, support continuous transmission and re-running of a break point after failure, greatly improve the efficiency of retrying failure of importing large data into the database, and mask complex big data and distributed technology by simplified derivative flow through parameterized configuration, so that service requirements can be responded rapidly.

Referring to fig. 7, based on the same inventive concept, an apparatus 70 for importing data into a database is further provided, and the apparatus includes:

a data reading unit 701, configured to read a target data file to be imported based on a preset source data storage path;

a slicing processing unit 702, configured to perform slicing processing on the target data file, and update the transmission task list based on the obtained multiple slicing files; in the transmission task list, each transmission task corresponds to a fragment file;

a parallel transmission unit 703, configured to cycle through the transmission task list until the transmission task list is empty; reading a transmission task every time, and importing target fragment files corresponding to the transmission task into a target database in batches, wherein the importing process of each batch comprises the following steps:

and sending the generated database import instruction to a target database, and updating the task offset.

In a possible implementation, the data reading unit 701 is further configured to:

responding to the database importing operation, and obtaining task operation parameters required by the database importing task, wherein the task operation parameters comprise data source information and source data operation information;

if the data source type is an external data source and based on the source data operation information, it is determined that no database operation is required for the source data, the source data is read from the external data source and stored in the source data storage path.

if the source data is an internal data source and the database operation needs to be performed on the source data based on the source data operation information, after the source data is read from the internal data source and the database operation is performed, the source data after the operation is stored into the internal data source, and the storage path of the internal data source is updated to be a source data storage path;

If the data source type is an external data source and based on the source data operation information, it is determined that no database operation is required for the source data, the source data is read from the external data source, the database operation is executed, and then the operated source data is stored in the source data storage path.

In one possible implementation, the source data storage path indicates a storage path of the distributed file system; the data reading unit 701 is specifically configured to:

source data is read from an external data source by a big data calculation engine and stored in a distributed file system.

In a possible implementation manner, the slice processing unit 702 is specifically configured to:

if the reference field required by the slicing process is not specified, determining a slicing reference value based on the value of the primary key in the target data file;

In a possible implementation, the parallel transmission unit 703 is specifically configured to:

reading one transmission task in each traversal, and adding the transmission task into a task execution pool;

and reading the transmission task to be executed from the task execution pool through idle target processes in the plurality of parallel transmission threads, and importing the target fragment file corresponding to the transmission task into the target database in batches.

if the data is not the last, sequentially reading the data based on the data reading position, and generating a database importing instruction of the batch based on the read data every time the data is read;

the generated database import instruction is sent to the target database, which comprises the following steps:

and if the number of the generated database import instructions reaches the upper limit value of the number of the batches, sending the generated database import instructions to the target database.

If yes, determining whether the updated transmission task list has incomplete transmission tasks;

if so, continuing to read the new transmission task, and adding the new transmission task into the task execution pool.

In a possible implementation, the parallel transmission unit 703 is further configured to:

if the database importing task fails, outputting a task failure indication;

responding to retransmission indication initiated for task failure indication, and continuing to read the transmission task from the transmission task list;

Through the device, through the record of the task offset, when one transmission task fails to transmit or the whole database fails to import the task, the data reading position during retransmission can be positioned based on the task offset, so that the breakpoint retransmission function of the data is realized, the situation that the data needs to be deleted to re-execute the transmission when the task fails is avoided, and the efficiency of data import is improved.

The apparatus may be used to perform the methods shown in the embodiments of the present application, so the descriptions of the foregoing embodiments may be referred to for the functions that can be implemented by each functional module of the apparatus, and are not repeated.

Referring to fig. 8, based on the same technical concept, the embodiment of the application further provides a computer device. In one embodiment, the computer device may be a server shown in fig. 1 or a cloud-corresponding device shown in fig. 2, where the computer device includes a memory 801, a communication module 803, and one or more processors 802 as shown in fig. 8.

A memory 801 for storing a computer program for execution by the processor 802. The memory 801 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, a program required for running an instant communication function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

The memory 801 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 801 may also be a nonvolatile memory (non-volatile memory), such as a read-only memory, a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 801, is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 801 may be a combination of the above memories.

The processor 802 may include one or more central processing units (central processing unit, CPU) or digital processing units, etc. A processor 802 for implementing the above-described method of importing data into a database when calling a computer program stored in the memory 801.

The communication module 803 is used for communicating with a terminal device and other servers.

The specific connection medium between the memory 801, the communication module 803, and the processor 802 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 801 and the processor 802 are connected through the bus 804 in fig. 8, the bus 804 is depicted in a bold line in fig. 8, and the connection manner between other components is only schematically illustrated, but not limited to. The bus 804 may be classified as an address bus, a data bus, a control bus, or the like. For ease of description, only one thick line is depicted in fig. 8, but only one bus or one type of bus is not depicted.

The memory 801 stores therein a computer storage medium having stored therein computer executable instructions for implementing the method for importing data into a database according to the embodiments of the present application, and the processor 802 is configured to perform the method for importing data into a database according to the embodiments described above.

Based on the same inventive concept, the present embodiments also provide a storage medium storing a computer program, which when run on a computer, causes the computer to perform the steps in the method of importing data into a database according to various exemplary embodiments of the present application described above in the present specification.

In some possible embodiments, aspects of the method for importing data into a database provided herein may also be implemented as a form of a computer program product, which includes a computer program for causing a computer device to perform the steps of the method for importing data into a database according to the various exemplary embodiments of the present application described herein, when the program product is run on a computer device, for example, the computer device may perform the steps of the embodiments.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and comprise a computer program and may run on a computer device. However, the program product of the present application is not limited thereto, and in the present application, the readable storage medium may be any tangible medium that can contain, or store the program, including a computer program, for use by or in connection with a command execution system, apparatus, or device.

The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages.

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.

Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to or suggested that these operations must be performed in this particular order or that all of the illustrated operations must be performed in order to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A method of importing data into a database, the method comprising:

2. The method of claim 1, wherein prior to reading the target data file to be imported based on the preset source data storage path, the method further comprises:

3. The method of claim 2, wherein after the determining the data source type of the data source to be imported based on the data source information, the method further comprises:

4. The method of claim 2, wherein the source data storage path indicates a storage path of a distributed file system;

the reading of source data from an external data source and storing to the source data storage path comprises:

5. The method of claim 1, wherein the fragmenting the target data file comprises:

6. The method according to any one of claims 1 to 5, wherein reading one transmission task at each traversal, and importing the target shard file corresponding to the transmission task into the target database in batches, includes:

7. The method of any one of claims 1 to 5, wherein locating the data reading position in the target tile file according to the corresponding task offset and generating the database import instruction of the batch based on the data read by the data reading position includes:

8. The method of claim 7, wherein after determining whether the data read location is the end of the target shard file, the method further comprises:

9. The method of any one of claims 1-5, wherein after sending the generated database import instruction to the target database and updating the task offset, the method further comprises:

If the database importing task fails, outputting a task failure indication;

10. An apparatus for importing data into a database, the apparatus comprising:

11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that,

the processor, when executing the computer program, implements the steps of the method of any one of claims 1 to 9.

12. A computer storage medium having a computer program stored thereon, characterized in that,

the computer program implementing the steps of the method of any one of claims 1 to 9 when executed by a processor.

13. A computer program product comprising a computer program, characterized in that,