Data synchronization method and device capable of realizing rapid loading of data
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of data synchronization, in particular to a data synchronization method and device capable of realizing rapid loading of data.
[ background of the invention ]
Currently, a typical data synchronization process generally includes two steps: firstly, reading data from a data source and putting the data into a cache, namely reading the data; and secondly, reading data from the cache and loading the data into a target database, namely data loading. Assuming that the source end is a relational Database, a typical Data synchronization process is shown in fig. 1, and a Data loading process generally calls a jdbc (java Data Base connectivity) or odbc (open Data Base connectivity) interface to access a destination Database. JDBC or ODBC are generic database programming interfaces that provide a complete set of programming interfaces that are independent of any particular database, so that applications do not need to be bound to different programming interfaces for each database. When a database manufacturer releases the database, the database manufacturer simultaneously releases a driver which conforms to the JDBC and ODBC standards so that an application program can access the database by using the driver. Since JDBC or ODBC is a common standard, the operation flow when loading data by calling JDBC or ODBC interface is generally fixed, as shown in fig. 2.
The data loading flow shown in fig. 2 is not problematic for general database applications, but the performance is often difficult to meet the user requirement when loading large data volume, because JDBC and ODBC interfaces generally do not support batch data loading commands specific to database products to ensure universality, and it is difficult to implement fast data loading. To realize the rapid loading of data, a batch data loading command specific to a database is used for realizing the rapid loading of the data. The special batch data loading command of the database is generally provided for a database administrator to be directly used for rapidly importing and exporting data, and the interior of the special batch data loading command is optimized more, compared with a universal interface, the special batch data loading command has the advantages that the performance is several times to dozens of times faster than that of a mode using JDBC and ODBC interfaces, and rapid batch loading of data can be carried out; but the disadvantage is that the data is required to be present on the disk in the form of files with specified format when in use, and the batch data loading commands and required data formats of different destination databases are different and have no universality. The following is an example of a batch data load command for a mysql database:
load data infile"/data/mysql/test.txt"into table t1fieldsterminatedby',';
the command indicates that data in the/data/mysql/test.txt file is loaded into the t1 table, the file test.txt is a text file, and the fields are separated by a separator "comma". The support of the above-mentioned type of bulk loading by the current common data integration tool requires that the loaded data file must already exist on the disk, and the bulk loading of data from the data source to the destination table cannot be directly performed. If a user wants to use batch loading, the user must export the source table data into a file first, and then uses a batch loading command to load the data, and the two steps must be executed in series, so that the use is troublesome, the efficiency is low, and the performance loss is large.
In view of the above, it is an urgent problem in the art to overcome the above-mentioned drawbacks of the prior art.
[ summary of the invention ]
The technical problems to be solved by the invention are as follows:
when batch loading is used in data synchronization, the data is required to be stored on a disk in a file form of a specified format, the source table data must be exported to the file, and then the data loading is carried out by using a batch loading command, the two steps of serial execution are complex in process, low in efficiency and large in performance loss; the batch loading commands and the required data formats of the databases with different purposes are different and have no universality.
The invention achieves the above purpose by the following technical scheme:
in a first aspect, the present invention provides a data synchronization method capable of implementing fast loading of data, where a producer thread, a task queue, and a consumer thread are arranged in a data loading node, and the data synchronization method includes:
reading data from a data source and putting the read data into a cache;
the producer thread generates a plurality of data files in batches from the data in the cache, and puts the generated data files into a task queue; wherein each data file contains one or more pieces of data;
the consumer thread calls a batch data loading command specific to a target database, and loads data files to be loaded in the task queue to the target database one by one;
and the data file generation process corresponding to the producer thread and the data batch loading process corresponding to the consumer thread are carried out in parallel.
Preferably, during the data synchronization process, the method further comprises:
counting the reading speed of data from a data source and the loading speed of data loaded to a target database in the synchronization process, and comparing the reading speed with the loading speed;
if the reading speed is equivalent to the loading speed, generating a virtual pipeline file in the linux system after the data are placed into the cache when the data are synchronized next time, and reading the data in the cache into the pipeline file; then positioning a special batch loading command of a target database to the pipeline file, and loading data in the pipeline file to the target database by using the batch loading command;
and if the reading speed and the loading speed are not proper, after the data are placed into the cache in the next data synchronization, the producer thread generates a plurality of data files in batches from the data in the cache, and places the generated data files into the task queue.
Preferably, when generating the data file, the producer thread generates the data file in the corresponding format according to the format requirement of the destination database.
Preferably, the producer thread generates a plurality of data files in batches from the data in the cache, and puts the generated data files into the task queue, specifically including:
the producer thread reads data to be loaded from the cache and writes the read data into a temporarily generated data file;
and the producer thread judges whether the data volume in the current data file reaches a preset threshold value, and if the data volume in the current data file does not reach the preset threshold value, the producer thread continues to read the next data to be loaded from the cache and write the next data into the current data file.
Preferably, if the data volume in the current data file reaches a preset threshold value, continuously checking whether the task number in the task queue reaches the maximum task number limit;
if the number of the tasks in the task queue does not reach the maximum task number limit, the producer thread puts the currently written data file into the task queue;
and if the number of the tasks in the task queue reaches the maximum task number limit, the producer thread waits for the consumer thread to take out the tasks from the task queue, and then puts the currently written data file into the task queue.
Preferably, the consumer thread calls a batch data loading command specific to the destination database, and loads the data files to be loaded in the task queue to the destination database one by one, and the method specifically includes:
the consumer thread takes out the data files to be loaded from the task queue and generates a batch data loading command for loading the current data files according to the target database;
the consumer thread executes the batch data loading command, and then loads the currently taken data file to a target database;
and the consumer thread deletes the loaded data file, checks whether the data file to be loaded still exists in the task queue, and continues to load the next data file if the data file to be loaded exists in the task queue.
Preferably, at least two consumer threads are provided, and when data loading is performed, different data files in the task queue are loaded to the destination database by each consumer thread in a parallel mode.
Preferably, the data source is a relational database, a message queue, an XML file, a KV database, or a document database.
Preferably, the destination database is a relational database.
In a second aspect, the present invention provides a data synchronization apparatus capable of implementing data fast loading, including at least one processor and a memory, where the at least one processor and the memory are connected through a data bus, and the memory stores instructions executable by the at least one processor, where the instructions are used to complete the data synchronization method capable of implementing data fast loading according to the first aspect after being executed by the processor.
The invention has the beneficial effects that:
the data synchronization method provided by the invention improves the data loading process, a producer thread, a task queue and a consumer thread are arranged in a data loading node, a multi-thread model of a producer and a consumer is adopted, data files with required formats are continuously generated in batches during data loading, and a batch data loading command special for a target database is called to load data, the generation of the data files and the execution of the data loading command can be carried out in parallel, and the two steps are combined into one step, so that the performance loss during serial execution is reduced, the synchronization efficiency is improved, and the flow configuration is simplified.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a diagram illustrating a conventional data synchronization process;
FIG. 2 is a flow chart of data loading when loading data using JDBC or ODBC interfaces in a conventional data synchronization process;
FIG. 3 is a diagram illustrating a data synchronization process and a multithreading mode according to an embodiment of the present invention;
fig. 4 is a flowchart of a data synchronization method for fast loading data according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating operations of a producer thread during data loading according to an embodiment of the present invention;
FIG. 6 is a flowchart illustrating the operation of a consumer thread during data loading according to an embodiment of the present invention;
fig. 7 is an architecture diagram of a data synchronization apparatus capable of implementing fast data loading according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In the embodiments of the present invention, the symbol "/" indicates the meaning of having both functions, and the symbol "a and/or B" indicates that the combination between the preceding and following objects connected by the symbol includes three cases of "a", "B", "a and B".
In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other. The invention will be described in detail below with reference to the figures and examples.
Example 1:
the embodiment of the invention provides a data synchronization method capable of realizing rapid loading of data, wherein a target database (namely a target end) is a relational database, and a data source end is not limited and can be various types of data sources, such as a relational database, a message queue, an XML file, a KV database, a document database and the like.
Compared with the traditional data synchronization process, the method mainly improves the data loading step, so that the method can utilize the batch loading command specific to various databases, thereby improving the performance. As shown in FIG. 3, in the present invention, the data loading node does not simply fetch data from the data cache and then call the jdbc/odbc interface to write the data into the destination database, but adopts a producer-consumer concurrence model, and multiple threads work in parallel, that is, a producer thread, a task queue and a consumer thread are arranged in the data loading node. Based on the model provided in fig. 3, the data synchronization method provided in the embodiment of the present invention may refer to fig. 4, and mainly includes the following steps:
step 201, reading data from a data source, and placing the read data into a cache.
This step is data reading, and the data in the data source is read into the cache one by one, which is the same as the data reading step in the conventional data synchronization process, and is not described herein again.
Step 202, the producer thread generates a plurality of data files in batches from the data in the cache, and puts the generated data files into a task queue; wherein each data file contains one or more pieces of data.
Step 203, the consumer thread calls a batch data loading command specific to the target database, and loads the data files to be loaded in the task queue to the target database one by one.
Wherein, the step 202 and the step 203 together form a data loading process: step 202 is the generation of a data file, completed by the producer thread; step 203 is the execution of a bulk data load command, completed by the consumer thread. When generating the data file, the producer thread can generate the data file with a corresponding format according to the format requirement of the destination database.
When generating a data file, the producer thread may generate a plurality of data files in batch continuously, for example, one data file is generated for every 10000 pieces of data in the cache; after the producer thread generates a data file and writes the data file into the task queue, the producer thread continues to generate a next data file; and when the producer thread generates the data file, the consumer thread executes a batch data loading command on the existing data files to be loaded in the task queue, and loads the data files to the target database one by one. That is to say, the data file generation process corresponding to the producer thread and the data batch loading process corresponding to the consumer thread are performed in parallel, and the two threads do not interfere with each other, so that the performance loss in the conventional serial execution can be reduced, and the loading efficiency can be improved.
In the traditional data loading process, the source table data needs to be exported to a file, then a batch loading command is used for data loading, two steps of serial execution are performed, and only a single file is generated at the moment, so that the file memory is too large, too much disk space is occupied, the data loading time of the single file is too long, and the problem that the state of the interface statistical information is not updated timely is easily caused. According to the embodiment of the invention, a plurality of data files can be continuously generated in batches, and the loading command is respectively executed for each data file, so that the problems that the single file generated is overlarge, the space of a disk is excessively occupied, and the data loading time of the single file is excessively long, so that the state of the statistical information of the interface is not updated timely are avoided. In addition, the invention combines the traditional two-step operation into one-step operation in practice by dynamically generating the data file and executing the batch data loading command during data loading, thereby simplifying the flow configuration.
The data synchronization method provided by the embodiment of the invention mainly improves the data loading process, a producer thread, a task queue and a consumer thread are arranged in a data loading node, a producer-consumer concurrent model is adopted, a plurality of threads work in a parallel and cooperative mode, data files in a required format are continuously generated in batches during data loading, a batch data loading command specific to a target database can be called to load data, the generation of the data files and the execution of the data loading command can be carried out in parallel, and the two steps are combined into one step, so that the performance loss during serial execution is reduced, the synchronization efficiency is improved, and the flow configuration is simplified.
In addition to the "producer-consumer" multi-threaded task processing model shown in fig. 3, pipeline files can be generated to achieve fast batch loading of data. Then, in the data synchronization process, the method further includes:
and counting the reading speed of the data from the data source and the loading speed of the data loaded into the destination database in the synchronization process, and comparing the reading speed with the loading speed. The specific reading speed and the loading speed can be obtained through related test messages; or counting the data volume read from the data source and the data volume loaded into the destination database within a specified time, and calculating to obtain the corresponding reading speed and loading speed. The reading speed and the loading speed are mainly related to the hardware configuration of the source end server and the destination end server and the occupied resources of the service execution content of the source end server and the destination end server, and the more services run by the source end server and the destination end server, the more the rate is affected.
If the reading speed and the loading speed are equivalent after comparison, a virtual pipeline file can be generated in the linux system firstly after data is read from the data source and put into the cache when the data is synchronized next time, and the data in the cache is read into the pipeline file; and then positioning a batch loading command specific to the target database to the pipeline file, and loading the data in the pipeline file to the target database by using the batch loading command. Wherein, the speeds are similar, a preset difference value can be set specifically, if the difference between the two speeds is smaller than the preset difference value, the two speeds can be considered to be equivalent; if the difference is larger than the preset difference, the two speeds are considered not to be appropriate. In addition, the pipeline file is a virtual file which is simulated by the linux system server by using the memory of the linux system server and does not actually exist, so that the loading mode does not need to generate an intermediate file, batch loading of data can be realized by using a virtual transmission mode, the loading process is further simplified, and the loading efficiency is improved.
If the reading speed and the loading speed are found to be not appropriate after the comparison, after the data is read from the data source and put into the cache at the next data synchronization, the producer thread generates a plurality of data files in batches from the data in the cache, and puts the generated data files into the task queue, that is, continuously executing the step 202 and the step 203 downwards according to the illustration in fig. 3 and fig. 4, and completing the fast batch loading of the data by using a 'producer-consumer' multithread task processing model.
The pipeline file is mainly generated by using the pipeline technology in the linux system server, so that the corresponding data loading mode can be directly selected according to the type of the system besides the comparison speed. If the system is a linux system, a pipeline file loading mode can be directly adopted, the process of generating the intermediate file is skipped, and data loading is finished; if the system is not a linux system, such as a windows system, data loading is realized by dynamically generating data files and executing a batch loading command by adopting a 'producer-consumer' multi-thread task processing model. In any loading mode, the data can be loaded quickly by using a batch data loading command specific to the target database.
Further, as shown in step 202, the main function of the producer thread is to generate a temporary data file from the data in the buffer and place the generated data file into the task queue. The operation flow can refer to fig. 5, and mainly includes the following steps:
first, the producer thread reads a piece of data to be loaded from a cache.
Secondly, the producer thread writes the read data into a temporarily generated data file; for convenience of description herein, the currently generated data file may be referred to as data file a.
Thirdly, the producer thread determines whether the amount of data in the current data file a reaches a preset threshold, i.e. reaches a maximum limit, for example 500000 pieces of data. The size limit of the data file is set here because the data file loaded when the consumer thread executes the fast load command must be complete; if the size of the data file is not limited, but the entire data is written directly into a data file, the consumer thread must wait for the producer thread to write all the data to begin loading. In this way, although multithreading is enabled, sequential execution is essential, which is disadvantageous for improving performance, and therefore, a method of writing data to a data file in batches is adopted. In short, the preset threshold is set to generate data files in batches, so that parallel execution of data file generation and data batch loading is guaranteed.
And if the data volume in the current data file a does not reach the preset threshold value, the producer thread continues to read the next data to be loaded from the cache and write the next data into the current data file a, and then the judging process is continuously repeated until the data volume in the current data file a reaches the preset threshold value.
If the data amount in the current data file a reaches the preset threshold, the data file a is written completely, and the data cannot be written into the data file a any more, and then the fourth step is executed downwards.
Fourthly, the producer thread checks whether the number of the tasks in the task queue reaches the maximum task number limit, namely, whether the task queue is full is judged. The reason for setting the task number limit for the task queue is to prevent the tasks waiting in the task queue from occupying too much disk space. Because the speed of writing the local file is much faster than the speed of reading the file loading database, namely the speed of generating the task is much faster than the speed of executing the task, the task queue is not empty under most conditions, the condition that the waiting is not needed when a consumer thread obtains the task without setting the limit of the number of tasks of the task queue to be too large can be ensured, and the concurrency of the system is ensured. In an embodiment of the present invention, the maximum number of tasks limit of the task queue may be set to 3.
And if the number of the tasks in the task queue does not reach the maximum task number limit, the producer thread puts the currently written data file a into the task queue and continues to execute the fifth step downwards.
And if the number of the tasks in the task queue reaches the maximum task number limit, the producer thread needs to wait for the consumer thread to take out the tasks from the task queue, then put the currently written data file a into the task queue, and continue to execute the fifth step downwards.
Fifthly, judging whether the reading of the data in the cache is finished. If the cache has data, the producer thread continues to read the next piece of data to be loaded from the cache and writes the data into the newly generated data file b, namely, the next batch of data is written, and the process is repeatedly executed until the synchronization is completed; if the cache has no data, the whole synchronization process is finished, and the operation is finished.
Further, as shown in step 203, the main function of the consumer thread is to call a batch data loading command specific to the destination database, and load the data file to be loaded in the task queue into the destination database. Since the speed of generating the data file (i.e., "write file" in fig. 3) is generally faster than the speed of loading the file into the database (i.e., "read file" in fig. 3), the number of consumer threads can be increased to further improve the performance of data loading, that is, at least two consumer threads are provided, and when data loading is performed, each consumer thread loads different data files in the task queue into the destination database in a parallel manner. The extent to which this approach can continue to improve loading performance depends on the concurrency performance of the destination database, and the extent of improvement may vary from database to database.
No matter what the set number of the consumer threads is, the operation flow of each consumer thread can refer to fig. 6, which mainly includes the following steps:
firstly, the consumer thread takes out a data file to be loaded from a task queue; for convenience of description, the currently fetched data file may be referred to as data file a.
Second, the consumer thread generates a batch data load command for loading the current data file a according to the destination database. Wherein the batch data load command is specific to the current destination database.
Thirdly, the consumer thread executes the batch data loading command, and then loads the currently taken data file a to a target database.
Fourthly, the consumer thread deletes the loaded data file a.
Fifth, the consumer thread checks whether there are more tasks in the task queue, i.e., whether there are more data files to load. If so, continuing to load the next data file, for example, if the next data file to be loaded in the task queue is b, continuing to load the data file b to the destination database by the consumer thread according to the process; and if no new task exists in the task queue, the whole synchronization process is finished, and the operation is finished.
In summary, the data synchronization method provided by the embodiment of the present invention has the following advantages:
when the data is loaded, a multi-thread model of a producer-consumer is adopted, a batch data loading command special for a target database can be called to load the data, the generation of the data file and the execution of the data loading command can be carried out in parallel, the performance loss during serial execution is reduced, and the synchronization efficiency is improved;
during data loading, a mode of dynamically generating data files and executing a quick loading command is adopted, and two steps of operation are actually combined into one step of operation, so that the flow configuration is simplified;
when the data files are generated, a plurality of data files are continuously generated in batches, and loading commands are respectively executed on each file, so that the problems that a single file is generated to be overlarge, and the space of a disk is excessively occupied and the data loading time of the single file is excessively long are solved;
when the reading speed and the loading speed are the same or when the method is applied to a linux system, batch loading of data can be completed in a pipeline file mode directly without generating an intermediate file, the loading process is further simplified, and the loading efficiency is improved;
at least two consumer threads can be set, when data loading is carried out, different data files in the task queue are respectively loaded to the target database by each consumer thread in a parallel mode, and the data loading performance is further improved by increasing the number of the consumer threads.
Example 2:
on the basis of the data synchronization method capable of implementing fast data loading provided in embodiment 1, the present invention further provides a data synchronization apparatus capable of implementing fast data loading, which is used to implement the method described above, and as shown in fig. 7, is a schematic diagram of an apparatus architecture in an embodiment of the present invention. The data synchronization device capable of realizing data fast loading of the embodiment comprises one or more processors 21 and a memory 22. In fig. 7, one processor 21 is taken as an example.
The processor 21 and the memory 22 may be connected by a bus or other means, and fig. 7 illustrates the connection by a bus as an example.
The memory 22, which is a non-volatile computer-readable storage medium for implementing a data synchronization method for fast loading data, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules, such as the data synchronization method for fast loading data in embodiment 1. The processor 21 executes various functional applications and data processing of the data synchronization apparatus capable of realizing fast loading of data, that is, implements the data synchronization method capable of realizing fast loading of data according to embodiment 1, by executing the nonvolatile software program, instructions, and modules stored in the memory 22.
The memory 22 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 22 may optionally include memory located remotely from the processor 21, and these remote memories may be connected to the processor 21 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The program instructions/modules are stored in the memory 22, and when executed by the one or more processors 21, perform the data synchronization method capable of implementing data fast loading in the above embodiment 1, for example, perform the steps shown in fig. 4 to fig. 6 described above.
Those of ordinary skill in the art will appreciate that all or part of the steps of the various methods of the embodiments may be implemented by associated hardware as instructed by a program, which may be stored on a computer-readable storage medium, which may include: a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic or optical disk, or the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.