CN115658638A

CN115658638A - File importing method of distributed database

Info

Publication number: CN115658638A
Application number: CN202211275213.6A
Authority: CN
Inventors: 王嘉豪; 张黎敏
Original assignee: Beijing Oceanbase Technology Co Ltd
Current assignee: Beijing Oceanbase Technology Co Ltd
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2023-01-31

Abstract

The present specification provides a file importing method for a distributed database, including: the method comprises the steps that a first thread responds to a database import task aiming at a database file, buffer segments of the database file are read from a buffer, and each buffer segment contains partial data of the database file; the first thread, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data, distributing the database import subtask to a subtask processing thread group, and recording the residual incomplete data to form the complete data in the next buffer segment read by the first thread; and under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, processing the distributed database import subtask to send the complete data indicated by the database import subtask to the corresponding node in the distributed database.

Description

File importing method of distributed database

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for importing files into a distributed database.

Background

The data import function is one of the common functions of a database, and is often applied to scenarios such as data migration and regression testing. Under the condition that the distributed database utilizes the database file to conduct data import, data in the same database file may be written into different nodes respectively, and meanwhile, along with the expansion of the scale of the distributed database and the increase of the size of the database file, the requirement of the distributed database for the data import efficiency of each database file is increased.

In the related art, a database file is generally pre-segmented to generate a plurality of small files, and a plurality of small files are imported by a plurality of clients to improve data import efficiency, however, the segmentation operation for the database file causes additional resource consumption and occupation of storage space, which greatly limits the data import efficiency.

Disclosure of Invention

In view of this, the present specification provides a method and an apparatus for importing files from a distributed database to solve the deficiencies in the related art.

Specifically, the specification is realized through the following technical scheme:

according to a first aspect of embodiments of the present specification, there is provided a file import method for a distributed database, including:

the method comprises the steps that a first thread responds to a database import task aiming at a database file, buffer segments of the database file are read from a buffer, and each buffer segment contains partial data of the database file;

the first thread, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import sub-task aiming at the complete data, distributing the sub-task import sub-task to a sub-task processing thread group, and recording the residual incomplete data to form complete data in a next buffer segment read by the first thread;

and under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, processing the distributed database import subtask to send the complete data indicated by the database import subtask to the corresponding node in the distributed database.

According to a second aspect of embodiments of the present specification, there is provided a file import method for a distributed database, including:

the first thread, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data and distributing the database import subtask to a subtask processing thread group, so that a second thread processes the distributed database import subtask under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, and sends the complete data indicated by the database import subtask to a corresponding node in the distributed database; and recording the remaining incomplete data for constituting complete data in a later buffered segment read by the first thread.

According to a third aspect of embodiments of the present specification, a file importing apparatus for a distributed database includes:

a buffer segment reading unit configured to cause a first thread to read, in response to a database import task for a database file, buffer segments of the database file from a buffer, each buffer segment containing partial data of the database file;

a buffer segment processing unit to cause the first thread to, for each buffer segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import sub-task aiming at the complete data, distributing the sub-task import sub-task to a sub-task processing thread group, and recording the residual incomplete data to form complete data in a next buffer segment read by the first thread;

and the database import subtask processing unit is used for processing the distributed database import subtask by the second thread under the condition that the second thread is determined to belong to the subtask processing thread group, so that the complete data indicated by the database import subtask is sent to the corresponding node in the distributed database.

According to a fourth aspect of embodiments of the present specification, a file importing apparatus of a distributed database includes:

a buffer segment processing unit to cause the first thread to, for each buffer segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data and distributing the database import subtask to a subtask processing thread group, so that a second thread processes the distributed database import subtask under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, and sends the complete data indicated by the database import subtask to a corresponding node in the distributed database; and recording the remaining incomplete data for constituting complete data in a later buffer segment read by the first thread.

According to a fifth aspect of embodiments herein, there is provided a computer readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps of the method according to the first and second aspects.

According to a sixth aspect of embodiments herein, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first and second aspects when executing the program.

In the technical scheme provided by the specification, the corresponding buffer segments in the buffer area are subjected to delimitation operation, so that the first thread reads the database file and also realizes the integral segmentation of the database file, the resource consumption required for individually segmenting the file and the occupation of a disk space required for storing the segmented file are avoided, meanwhile, the database import subtask corresponding to complete data is processed based on the subtask processing thread group, and the efficiency of importing the data of the database file into corresponding nodes in the distributed database is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present disclosure, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is an architectural diagram of a file import system for a distributed database, according to an exemplary embodiment of the present disclosure;

FIG. 2 is a flowchart illustrating a method for importing files into a distributed database according to an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a delimiting operation shown in an exemplary embodiment of the present description;

FIG. 4 is a schematic diagram of another delimiting operation shown in an exemplary embodiment of the present description;

FIG. 5 is a diagram illustrating an example embodiment of the present specification showing the import of a database file into a distributed database;

FIG. 6 is a flowchart illustrating another method for importing files into a distributed database according to an exemplary embodiment of the present disclosure;

FIG. 7 is a schematic block diagram of an electronic device shown in an exemplary embodiment of the present description;

fig. 8 is a schematic structural diagram of a file importing apparatus of a distributed database according to an exemplary embodiment of the present specification;

fig. 9 is a schematic structural diagram of another file importing apparatus for a distributed database according to an exemplary embodiment of the present specification.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with aspects of the present description.

The terminology used in the description herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present specification. The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination," depending on the context.

In the related art, the data import function may be embodied as that the database imports related data according to the content of the external file, and meanwhile, according to different architectures adopted by the database, the data import function of the database may be applied to the data import situation of a stand-alone database and the data import situation of a distributed database, where the data import speeds in the two situations are affected by different factors: in the first case, the data import speed mainly depends on the performance of the database writing storage medium; in the second case, assuming that the database adopts a partition table design, data of the partition table may be distributed on different nodes corresponding to the distributed database, so that the data corresponding to the external file may be fragmented and sent to nodes according to partitions by using threads in the following text, and thus the data importing speed mainly depends on the fragmentation efficiency of the threads. The specific definition and implementation of the partition table are basically disclosed in the related art, and are not described in detail in this application.

However, for the second case described above: when the scale of the distributed database is gradually enlarged, the number of corresponding nodes and the number of partitions of the partition table are increased, so that a thread for performing a fragmentation operation on the data of the external file is easily a bottleneck of a data import speed, and a plurality of file segments generated after the fragmentation operation additionally occupy a large amount of storage space, thereby further improving the threshold of the distributed database for efficiently importing the data. Based on this, the present specification proposes the following technical solutions to solve the above problems.

Fig. 1 is a schematic architecture diagram of a file import system of a distributed database according to an exemplary embodiment of the present specification. As shown in fig. 1, the system includes a distributed database 11 and a client 12.

The distributed database 11 includes M nodes to store corresponding database data, each node may be configured as a server connected to the client 12, and each node may be a physical server including an independent host or a virtual server carried by a host cluster. Wherein M is a positive integer greater than 1. In the operation process of the system, any node in the distributed database 11 may act as a server to respond to an instruction initiated by the client 12 to perform a corresponding database data operation on any node.

The client 12 is an electronic device that can access the nodes of the distributed database 11. The client 12 runs a first thread 121 and a subtask thread group 122 containing N second threads. Wherein N is a positive integer greater than or equal to 1. During the operation of the system, the first thread 121 may read a database file (i.e., the external file), and distribute data contained in the database file to the subtask thread group 122 in the form of a plurality of database import subtasks, so that a thread in the subtask thread group 122 may send complete data indicated by the database import subtask to a corresponding node in the distributed database 11.

Although the distributed database 11 and the client 12 are depicted separately in fig. 1, the client 12 may be deployed at any node in the distributed database 11. Of course, the client 12 may also be deployed in an electronic device independent of all nodes, and the electronic device may be, for example: a mobile phone, a tablet device, a notebook computer, a pda (Personal Digital Assistants), a wearable device (such as smart glasses, a smart watch, etc.), and one or more embodiments of the present disclosure are not limited thereto.

The technical solution of the present specification is explained below with reference to the embodiment shown in fig. 2. Fig. 2 is a flowchart illustrating a file importing method for a distributed database according to an exemplary embodiment of the present specification, where as shown in fig. 2, the method may include the following steps:

s201, responding to a database import task aiming at a database file, a first thread reads buffer segments of the database file from a buffer, wherein each buffer segment contains partial data of the database file.

When a user actively triggers or automatically triggers a preset condition when the client executes other tasks so as to lead data contained in a database file to be imported into a distributed database, the client where the first thread is located can be enabled to initiate a database import task to the first thread, so that the first thread can determine the database file in response to the database import task and read a buffer area corresponding to the database file in a memory. The database file may be derived from other databases, for example, derived from the other databases, or the database file may have any other sources, which is not limited in this specification.

S202, the first thread, for each buffered segment read: and executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data, distributing the database import subtask to a subtask processing thread group, and recording the residual incomplete data to form the complete data in the next buffer segment read by the first thread. Due to the uncertainty of the size of the database file and the characteristic that the size of the storage space of the buffer area is fixed and unchanged, the data of each buffer segment read from the buffer area by the first thread may be incomplete, which may cause the data missing problem. Therefore, the first thread needs to perform the delimiting operation on the buffer segment, so as to determine the complete data and the remaining incomplete data in the buffer segment, and the complete data is used as a single database import subtask to be distributed to the subtask processing thread group. Specifically, the above-mentioned delimiting operation may have any one of the following results: 1. the buffer segments are all complete data; 2. the buffer segment has both complete data and incomplete data. The "incomplete data" may be data in which a part of the above buffer segment exists that cannot satisfy the minimum unit of the database execution data (e.g., a record of a row of a data table). For the above 2 nd result, a further division can be made in the buffer segment storage location according to the data: for example, if the buffer sequentially stores data from the head to the tail (assuming a low address to a high address) according to the storage order, the buffer segment may have "complete data-incomplete data", "incomplete data-complete data", and "incomplete data-complete data-incomplete data". It will be appreciated by those skilled in the art that in either case, the presence of incomplete data in a buffered segment necessarily means the presence of incomplete data in another buffered segment. Specifically, if the tail of the buffer corresponding to the current buffering segment has incomplete data a, another incomplete data B corresponding to the incomplete data a will also exist at the head of the buffer corresponding to the last buffering segment read by the first thread. Therefore, the first thread can record incomplete data obtained by executing the delimitation operation in the buffer segment, and complete data is formed by the recorded incomplete data and the incomplete data in the next buffer segment read by the first thread.

In an embodiment, in the case that the first thread determines that there is remaining incomplete data in the read buffer segment, the first thread may add the incomplete data to a head of a buffer corresponding to a subsequent buffer segment read by the first thread. In this embodiment, the incomplete data of the latter buffer segment that should be at the head of the corresponding buffer area will constitute new complete data with the added incomplete data. Taking fig. 3 as an example, there are two buffer areas, namely a first buffer area and a second buffer area, where the head of the first buffer area stores complete data, and the tail of the first buffer area stores incomplete data, so that the first thread can add the incomplete data at the tail of the first buffer area to the head of the second buffer area, thereby ensuring that the original incomplete data and the added incomplete data in the second buffer area are continuous, and further facilitating the two to form complete data in the second buffer area.

In an actual scenario, due to the limitation of cost and other factors, the number of the buffers is not increased without limit, but is formed by a plurality of fixed buffers, and further, in case that the database import sub-task corresponding to any buffer is completed, the buffer can continue to be multiplexed, i.e. write a new buffer segment for reading by the first thread. Therefore, the buffer area corresponding to the current buffering segment read by the first thread and the buffer area corresponding to the read later buffering segment have the same or different situations.

In an embodiment, the buffer corresponding to the current buffer segment read by the first thread is a first buffer, and the buffer corresponding to the read subsequent buffer segment is a second buffer. Under the condition that the first buffer area is the same as the second buffer area, the rest incomplete data is moved to the head part of the current buffer area; and writing the residual incomplete data into the head of the second buffer area and clearing the residual incomplete data from the first buffer area when the first buffer area is different from the second buffer area. Taking fig. 4 as an example, in this embodiment, if the first buffer is the same as the second buffer, the remaining incomplete data in the buffer only needs to be moved from the tail portion to the head portion of the corresponding buffer. The two different situations are discussed in the previous embodiment with reference to fig. 3, and are not described herein again.

As mentioned above, the database file may be manually written and obtained according to the specific rules, or the database may be actively exported and obtained. The specific rule may be determined according to the data type and specific format in the database file. According to the different structural classifications of data, the data types can include the following three types: structured data, unstructured data, and semi-structured data. The storage and arrangement of the data of the database have natural regularity, so that the data can be regarded as the structured data; as for unstructured data, the data structure is irregular or incomplete, and there is no predefined data model (such as audio, video), so that it is not suitable for the database file; semi-structured data, which is data between fully structured data and fully unstructured data, does not conform to the structure of the data model associated with relational databases or other forms of data tables, but contains relevant tags to separate semantic elements and to stratify records and fields. Therefore, it is also called a self-describing structure, the format and the content of the data are mixed together without obvious distinction, and the data is suitable for being used as the database file to lead the contained content into the database.

In an embodiment, the data in the database file may be semi-structured data, and the first thread may determine whether the read current buffer segment is complete according to a preset data integrity rule, where the data integrity rule includes: and under the condition that the tail of the buffer area corresponding to the current buffer segment read by the first thread is a line separator, determining that the data of the buffer segment is complete, otherwise, determining that the data of the buffer segment is incomplete, and executing delimitation operation according to a judgment result. The integrity rule corresponds to the specific rule: for example, a particular rule for a database file specifies each row of data of the database file as a complete piece of data, then the corresponding integrity rule may be determined based on whether the end of the buffer corresponding to the current buffered segment is a row separator. It will be understood by those skilled in the art that the database file may be in Comma-Separated Values (CSV), javaScript Object Notation (JSON), extensible Markup Language (XML), and other formats. Of course, the data integrity rules corresponding to database files of different formats are substantially consistent, and the main difference lies in the determination manner of the line separator, for example, the line separator of a CSV file may be "\ n", the line separator of a JSON file may be corresponding right brackets "}", and the line separator of an XML file may be a corresponding end tag, which is not limited in this specification.

In a word, since the delimiting operation is executed by the first thread according to the data integrity rule with respect to the buffer segment of the buffer, the semi-structured data is completely divided into a plurality of complete data in the reading process, and the data is directly processed by the following second thread to be sent to the corresponding database, so that the problems that in the related art, a database file partially containing the semi-structured data cannot be directly divided, and the divided file needs to occupy an additional disk space are solved.

And S203, under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, processing the distributed database import subtask to send the complete data indicated by the database import subtask to the corresponding node in the distributed database.

And when the first thread finishes the delimitation operation aiming at a certain buffer segment, creates a corresponding database import subtask aiming at corresponding complete data and distributes the database import subtask to the subtask processing thread group, the database import subtask can be processed by a second thread in the subtask processing thread group. Specifically, the second thread may send the complete data indicated by the database import subtask to the corresponding node in the distributed database. The distributed database can introduce the design of the partition table, so that the database files can be partitioned in different ways.

In an embodiment, the database file relates to a plurality of partitions of the distributed database on different nodes, the second thread may perform parsing and data type conversion processing on the complete data, perform partition calculation on the processed complete data to obtain partition data segments corresponding to the partitions in the complete data, serialize each partition data segment, and send the serialized partition data segments to the node where the corresponding partition is located, so that each node imports the received partition data segment into the distributed database. Because the data that the second thread only needs to process is solved into the data indicated in the acquired database import subtask (i.e. partial data of the database file), the number of threads included in the subtask processing thread group can be properly increased to ensure that more threads process the data of the same database file at the same time, and thus the effect of importing the data of a large number of database files into the database in parallel is achieved. Meanwhile, by combining the delimitation mode of the first thread on the semi-structured files with the formats of CSV and the like, the scheme of the embodiment can solve the problem that a large number of semi-structured files cannot be imported in parallel in the related technology, and greatly improves the import efficiency of the files. In addition, the principles and implementations of data parsing, type conversion, partition calculation, and serialization techniques are basically disclosed in the related art, and are not described herein again.

In the process that the second thread sends the serialized partition data segments to the nodes where the corresponding partitions are located, the mode of sending the partition data segments by the second thread can be changed according to actual needs.

In an embodiment, the second thread may directly send the partition data segment to a node where a corresponding partition is located, so as to ensure real-time performance of data in a corresponding database.

In another embodiment, if the user does not require the database to have higher real-time performance, the second thread may store the partition data segments into a preset space corresponding to the corresponding partition, and send the partition data segments in the preset space to the node where the corresponding partition is located in batch under the condition that the preset space meets a preset sending condition. The preset sending condition may be that data stored in a preset space reaches a preset threshold, or a storage duration of data stored earliest in the preset space exceeds a preset duration, and the like, which is not limited in this specification.

The scheme of the present specification may set the obtaining manner of the above buffer segment according to an actual situation, so as to change a storage manner of the local device for the database file, for example, in fact, the buffer segment may be data written into the buffer after the first thread performs streaming reading on the database file, where the streaming reading only needs to perform one-time reading on a file on a disk or a network, and compared with a manner that a corresponding database file needs to be scanned multiple times for resolving semi-structured data in the related art, a lower IO overhead is required, thereby avoiding a problem that IO becomes a bottleneck of a data import speed when the database file is in a network NFS or an object storage system.

The file import method is discussed below with reference to fig. 5, taking an example of importing data of a database file in the CSV format into a distributed database.

In one embodiment, assume that there is a client initiating a database import task for a CSV formatted database file (filename "database file. CSV"). Wherein, the contents of the database file are as follows: "

User ID1, user name 1, household registration 1 and contact way 1;

user ID2, user name 2, household registration 2, contact means 2;

user ID3, user name 3, household registration 3 and contact information 3;

……

user ID99, user name 99, household registration 99, contact 99;

user ID100, user name 100, household registration 100, contact address 100; "

Correspondingly, it is assumed that the distributed database stores a "user information table," and the distributed database splits the user information table into 3 partitions, namely "partition 1", "partition 2", and "partition 3", based on a Hash (Hash) partition, and each partition corresponds to a different node. Meanwhile, the rule corresponding to the Hash partition stipulates that: dividing the user ID by the data table record of 3 to 1 and storing the data table record in the partition 1; dividing the user ID by the data table records of 3 to 2 and storing the data table records in the partition 2; the user ID divided by the data table record of 3 to 0 is stored in partition 3. In an embodiment, as shown in fig. 5, in response to a database import task for the database file, the first thread reads the first buffer segment of the database file from the first buffer, and obtains: "user ID1, user name 1, household registration 1, contact means 1;

……

a user ID50, a user name 50, a household registration 50, a contact address 50;

user ID51, user name 51, user "

Obviously, the first thread may determine that the data in the first buffer segment is incomplete by a data integrity rule corresponding to a delimiting operation (assuming that the rule is that, in a case where the end of the buffer area corresponding to the current buffer segment read by the first thread is a semicolon, the data of the buffer segment is determined to be complete, otherwise, the data of the buffer segment is determined to be incomplete), and based on the rule, take the data from the head of the first buffer segment to the last semicolon as complete data (i.e., "user ID1, user name 1, household 1, contact 1; \82303030303060;" user ID50, user name 50, household 50, contact 50;), and take the data after the complete data as the remaining incomplete data (i.e., "user ID51, user name 51, user").

Aiming at the complete data of the first buffer segment, a first thread can create a corresponding database import subtask and allocate the database import subtask to a subtask processing thread group, assume that two idle threads, namely a second thread 1 and a second thread 2, exist in the subtask processing thread group, further assume that the subtask processing thread group specifically allocates the database import subtask to the second thread 1 for execution, and at this time, the second thread 1 can perform a series of operations, such as analysis, data type conversion processing, partition calculation and the like, on the complete data (namely, "user ID1, user name 1, household address 1, contact manner 1; \8230; 8230; user ID50, user name 50, household address 50, contact manner 50") corresponding to the database import subtask to obtain the following partition data segments corresponding to three partitions: the partition data segment corresponding to partition 1 (i.e., "user ID1, user name 1, household registration 1, contact address 4; user ID4, user name 4, household registration 4, contact address 4; \8230;" user ID49, user name 49, household registration 49, contact address 49; "), the partition data segment corresponding to partition 2 (i.e.," user ID2, user name 2, household registration 2, contact address 2; user ID5, user name 5, household registration 5, contact address 5; \8230; "user ID50, user name 50, household registration 50, contact address 50;) and the partition data segment corresponding to partition 3 (i.e.," user ID3, user name 3, household registration 3, contact address 3; user ID6, user name 6, household registration 6, contact address 6; \\8230; "user ID48, user name 48, contact address 48"). After each partition data segment is serialized, the data segments can be respectively sent to the nodes where the corresponding partitions are located.

While the second thread 1 processes the complete data, the first thread may write the remaining incomplete data in the first buffer segment into the head of the second buffer region and clear the incomplete data from the first buffer region, and then, the data subsequently read by the first thread may be spliced with the incomplete data (i.e., "user ID51, user name 51, user account 51"), so as to obtain complete data corresponding to the second buffer segment (i.e., "user ID51, user name 51, account 51, contact means 51; \8230; \ 8230; user ID100, user name 100, account 100, contact means 100;).

For the complete data of the second buffer segment, the first thread may create a corresponding database import sub-task and allocate the database import sub-task to a sub-task processing thread group, and assuming that the second thread 1 is still in the process, the sub-task processing thread group may specifically allocate the database import sub-task to the second thread 2 for execution. At this time, the second thread 2 may perform a series of operations such as parsing, data type conversion processing, partition calculation, and the like on the complete data corresponding to the database import subtask (i.e., "user ID51, user name 51, account 51, contact 51; \ 8230; \ 8230; user ID100, user name 100, account 100;") to obtain the following partition data fragments corresponding to three partitions: the partition data segment corresponding to partition 1 (i.e., "user ID52, user name 52, household 52, contact 52; user ID55, user name 55, household 55, contact 55; \8230;" user ID100, user name 100, contact 100; "), the partition data segment corresponding to partition 2 (i.e.," user ID53, user name 53, household 53, contact 53; user ID56, user name 56, household 56, contact 56; \8230; "user ID98, user name 98, household 98, contact 98;" and the partition data segment corresponding to partition 3 (i.e., "user ID51, user name 51, household 51, contact 51; user ID54, user name 54, household 54, contact 54; \\\\\ \ 8230;" user ID99, user name 99, contact 99; "etc.). After each partition data segment is serialized, the data segments can be respectively sent to the nodes where the corresponding partitions are located.

The second thread 1 and the second thread 2 can process the complete data from the first buffer segment and the second buffer segment in parallel, thereby greatly improving the import efficiency of a single database file (namely, the database file csv). Finally, each partition can be correctly imported with corresponding data (see the bottom of fig. 5 for specific data).

Fig. 6 is a flowchart illustrating another file importing method for a distributed database according to an exemplary embodiment of the present disclosure. As shown in fig. 6, the method is applied to a first thread, and the method includes the following steps:

s601, the first thread responds to a database import task aiming at the database file, and reads buffer segments of the database file from a buffer, wherein each buffer segment comprises partial data of the database file.

As described above, when a user actively triggers or automatically triggers a preset condition when the client executes another task so as to import data included in a database file into a distributed database, the client where the first thread is located may initiate a database import task to the first thread, so that the first thread may determine the database file in response to the database import task and read a buffer corresponding to the database file in a memory. The database file may be derived from other databases, for example, derived from the other databases, or the database file may have any other sources, which is not limited in this specification.

S602, the first thread, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data and distributing the database import subtask to a subtask processing thread group, so that a second thread processes the distributed database import subtask under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, and sends the complete data indicated by the database import subtask to a corresponding node in the distributed database; and recording the remaining incomplete data for constituting complete data in a later buffered segment read by the first thread.

As described above, due to the uncertainty of the size of the database file and the characteristic that the size of the storage space of the buffer is fixed, the data in each buffer segment read from the buffer by the first thread may be incomplete, resulting in the data missing problem. Therefore, the first thread needs to perform the delimiting operation on the buffer segment, so as to determine the complete data and the remaining incomplete data in the buffer segment, and the complete data is used as a single database import subtask to be distributed to the subtask processing thread group. Specifically, the above-mentioned delimiting operation may have any one of the following results: 1. the buffer segments are all complete data; 2. the buffer segment has both complete data and incomplete data. The "incomplete data" may be data in which a part of the above buffer segment exists that cannot satisfy the minimum unit of the database execution data (e.g., a record of a row of a data table). For the above 2 nd result, a further division can be made in the buffer segment storage location according to the data: for example, if the buffer sequentially stores data from the head to the tail (assuming a low address to a high address) according to the storage order, the buffer segments may have "complete data-incomplete data", "incomplete data-complete data", and "incomplete data-complete data-incomplete data". It will be appreciated by those skilled in the art that in either case, the presence of incomplete data in a buffered segment necessarily means the presence of incomplete data in another buffered segment. Specifically, if the tail of the buffer corresponding to the current buffering segment has incomplete data a, another incomplete data B corresponding to the incomplete data a will also exist at the head of the buffer corresponding to the last buffering segment read by the first thread. Therefore, the first thread can record incomplete data obtained by executing the delimitation operation in the buffer segment, and complete data is formed by the recorded incomplete data and the incomplete data in the next buffer segment read by the first thread.

As described above, in an embodiment, in the case that the first thread determines that there is incomplete data left in the read buffer segment, the incomplete data may be added to the head of the buffer corresponding to the next buffer segment read by the first thread. In this embodiment, the incomplete data of the subsequent buffer segment that should be at the head of the corresponding buffer area and the added incomplete data constitute new complete data. Taking fig. 3 as an example, there are two buffer areas, namely a first buffer area and a second buffer area, where the head of the first buffer area stores complete data, and the tail of the first buffer area stores incomplete data, so that the first thread can add the incomplete data at the tail of the first buffer area to the head of the second buffer area, thereby ensuring that the original incomplete data and the added incomplete data in the second buffer area are continuous, and further facilitating the two to form complete data in the second buffer area.

As described above, in an actual scenario, due to the limitation of cost and other factors, the number of buffers is not increased without limit, but is formed by a plurality of fixed buffers, and further, in a case that a database import sub-task corresponding to any buffer is completed, the buffer can continue to be multiplexed, i.e., a new buffer segment is written for reading by the first thread. Therefore, the buffer area corresponding to the current buffering segment read by the first thread and the buffer area corresponding to the read later buffering segment have the same or different situations.

As described above, in an embodiment, the buffer corresponding to the current buffer segment read by the first thread is the first buffer, and the buffer corresponding to the next buffer segment read by the first thread is the second buffer. Under the condition that the first buffer area is the same as the second buffer area, the rest incomplete data is moved to the head part of the current buffer area; and writing the residual incomplete data into the head of the second buffer area and clearing the residual incomplete data from the first buffer area when the first buffer area is different from the second buffer area. Taking fig. 4 as an example, in this embodiment, if the first buffer is the same as the second buffer, the remaining incomplete data in the buffer only needs to move from the tail of the corresponding buffer to the head. The two different cases are discussed in the previous embodiment with reference to fig. 3, and are not described herein again.

As described above, in an embodiment, the data in the database file may be semi-structured data, and the first thread may determine whether the read current buffer segment is complete according to a preset data integrity rule, where the data integrity rule includes: and under the condition that the tail of the buffer area corresponding to the current buffer segment read by the first thread is a line separator, determining that the data of the buffer segment is complete, otherwise, determining that the data of the buffer segment is incomplete, and executing delimitation operation according to a judgment result. The integrity rule corresponds to the specific rule: for example, a particular rule for a database file specifies each row of data of the database file as a complete piece of data, then the corresponding integrity rule may be determined based on whether the end of the buffer corresponding to the current buffered segment is a row separator. It will be understood by those skilled in the art that the database file may be in the format of Comma-Separated Values (CSV), javaScript Object Notation (JSON), extensible Markup Language (XML), and the like. Of course, the data integrity rules corresponding to database files of different formats are substantially consistent, and the main difference lies in the determination manner of the line separator, for example, the line separator of a CSV file may be "\ n", the line separator of a JSON file may be corresponding right brackets "}", and the line separator of an XML file may be a corresponding end tag, which is not limited in this specification.

As described above, in summary, because the delimiting operation is performed by the first thread according to the data integrity rule with respect to the buffer segment of the buffer, the semi-structured data is completely divided into a plurality of complete data in the reading process, and the data is directly processed by the following second thread to be sent to the corresponding database, so that the problems that in the related art, part of database files containing semi-structured data cannot be directly split, and the split files need to occupy additional disk space are solved. As can be seen from the foregoing embodiments, in the scheme of the present application, by performing delimitation operation on corresponding buffer segments in the buffer, resource consumption required for separately segmenting a file and occupation of a disk space required for storing the segmented file are avoided, and meanwhile, a single database file similar to a CSV format is successfully segmented by a data integrity rule, and each segmented complete data is processed in parallel by a second thread. In addition, the scheme reduces the reading times of the database file based on a streaming reading mode, and improves the overall processing efficiency of the database import task.

FIG. 7 is a schematic block diagram of an electronic device in an exemplary embodiment. Referring to fig. 7, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, but may also include other required hardware. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the file importing device of the distributed database on the logic level. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

Corresponding to the foregoing embodiment of the method for importing a file of a distributed database, this specification further provides an embodiment of a device for importing a file of a distributed database.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a file importing apparatus of a distributed database according to an exemplary embodiment. As shown in fig. 8, in a software implementation, the apparatus may include:

a buffer segment reading unit 801, configured to, in response to a database import task for a database file, read, by a first thread, a buffer segment of the database file from a buffer, where each buffer segment includes partial data of the database file;

a buffer segment processing unit 802 for the first thread to, for each buffer segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import sub-task aiming at the complete data, distributing the sub-task import sub-task to a sub-task processing thread group, and recording the residual incomplete data to form complete data in a next buffer segment read by the first thread;

the database import subtask processing unit 803 is configured to, when determining that the second thread belongs to the subtask processing thread group, process the allocated database import subtask, so as to send the complete data indicated by the database import subtask to the corresponding node in the distributed database.

Optionally, the buffer fragment processing unit 802 is specifically configured to: and the first thread adds the incomplete data to the head of the buffer area corresponding to the next buffer segment read by the first thread under the condition that the read buffer segment has residual incomplete data.

Optionally, the buffer area corresponding to the current buffer segment read by the first thread is a first buffer area, and the buffer area corresponding to the read subsequent buffer segment is a second buffer area; the buffer fragment processing unit 802 is specifically configured to:

under the condition that the first buffer area and the second buffer area are the same, moving the remaining incomplete data to the head of the buffer area where the incomplete data is located currently;

and in the case that the first buffer area is different from the second buffer area, writing the remaining incomplete data into the head of the second buffer area and clearing the incomplete data from the first buffer area.

Optionally, the data in the database file is semi-structured data, and the database file is in any one of the following formats: comma separated value CSV format, javaScript object notation JSON format and extensible markup language XML format.

Optionally, the database file relates to a plurality of partitions of the distributed database on different nodes; the database import subtask processing unit 803 is specifically configured to:

analyzing the complete data and converting the data type;

performing partition calculation on the processed complete data to obtain partition data fragments corresponding to each partition in the complete data;

serializing each partition data segment, and respectively sending the serialized partition data segments to the nodes where the corresponding partitions are located, so that each node can import the received partition data segments into the distributed database.

Optionally, the database import subtask processing unit 803 is specifically configured to: directly sending the partition data segments to the nodes where the corresponding partitions are located;

or,

storing the partition data fragments into preset spaces corresponding to the corresponding partitions;

and under the condition that the preset space meets preset sending conditions, sending the data fragments of the partitions in the preset space to the nodes where the corresponding partitions are located in batches.

Optionally, the buffer segment is data written into the buffer after the first thread performs streaming reading on the database file.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a file importing apparatus of a distributed database according to an exemplary embodiment. As shown in fig. 9, in a software implementation, the apparatus may include:

a buffer segment reading unit 901, configured to enable a first thread to read, in response to a database import task for a database file, buffer segments of the database file from a buffer, where each buffer segment contains partial data of the database file;

a buffered segment read unit 902 for causing the first thread to, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data and distributing the database import subtask to a subtask processing thread group, so that a second thread processes the distributed database import subtask under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, and sends the complete data indicated by the database import subtask to a corresponding node in the distributed database; and recording the remaining incomplete data for constituting complete data in a later buffer segment read by the first thread.

Optionally, the buffered segment reading unit 902 is specifically configured to:

and the first thread adds the incomplete data to the head of the buffer area corresponding to the next buffer segment read by the first thread under the condition that the read buffer segment has residual incomplete data.

Optionally, the buffer area corresponding to the current buffer segment read by the first thread is a first buffer area, and the buffer area corresponding to the read subsequent buffer segment is a second buffer area; the buffer segment reading unit 902 is specifically configured to:

Optionally, the data in the database file is semi-structured data; the buffer segment reading unit 902 is specifically configured to:

judging whether the read current buffer segment is complete according to a preset data integrity rule, wherein the data integrity rule comprises the following steps: under the condition that the tail of a buffer area corresponding to the current buffering segment read by the first thread is a line separator, determining that the data of the buffering segment is complete, otherwise, determining that the data of the buffering segment is incomplete;

and executing delimitation operation according to the judgment result.

Optionally, the database file is in any one of the following formats: comma separated value CSV format, javaScript object notation JSON format and extensible markup language XML format.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for the execution of a computer program include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only a preferred embodiment of the present disclosure, and should not be taken as limiting the present disclosure, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A file import method of a distributed database comprises the following steps:

the first thread, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data, distributing the database import subtask to a subtask processing thread group, and recording the residual incomplete data to form the complete data in the next buffer segment read by the first thread;

and under the condition that the data processing device determines that the data processing device belongs to the subtask processing thread group, processing the distributed database import subtask to send the complete data indicated by the database import subtask to the corresponding node in the distributed database.

2. The method of claim 1, the recording the remaining incomplete data comprising:

3. The method of claim 2, wherein the buffer corresponding to the current buffer segment read by the first thread is a first buffer, and the buffer corresponding to the next buffer segment read by the first thread is a second buffer; the adding the incomplete data to the head of the buffer corresponding to the next buffer segment read by the first thread includes:

under the condition that the first buffer area is the same as the second buffer area, moving the remaining incomplete data to the head of the buffer area where the incomplete data is located;

4. The method of claim 1, the data in the database file being semi-structured data; the first thread performs a delimitation operation comprising:

and executing delimitation operation according to the judgment result.

5. The method of claim 1, the database file being in any one of the following formats: comma separated value CSV format, javaScript object notation JSON format and extensible markup language XML format.

6. The method of claim 1, the database file relating to a plurality of partitions of the distributed database on different nodes; the second thread processing the allocated database import subtask, including:

analyzing the complete data and converting the data type;

7. The method according to claim 6, wherein the sending the serialized partition data segments to the nodes where the corresponding partitions are located respectively comprises:

directly sending the partition data segments to the nodes where the corresponding partitions are located;

or,

8. The method of claim 1, the buffered segment being data written to the buffer by the first thread after streaming a read of the database file.

9. A file import method of a distributed database comprises the following steps:

the first thread, for each buffered segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data and distributing the database import subtask to a subtask processing thread group, so that a second thread processes the distributed database import subtask under the condition that the second thread determines that the second thread belongs to the subtask processing thread group, and sends the complete data indicated by the database import subtask to a corresponding node in the distributed database; and recording the remaining incomplete data for constituting complete data in a later buffer segment read by the first thread.

10. A file importing device of a distributed database comprises:

a buffer segment processing unit to cause the first thread to, for each buffer segment read: executing delimitation operation to determine complete data and residual incomplete data in the buffer segment, creating a corresponding database import subtask aiming at the complete data, distributing the database import subtask to a subtask processing thread group, and recording the residual incomplete data to form the complete data in the next buffer segment read by the first thread;

11. A file importing device of a distributed database comprises:

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 9.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to any one of claims 1 to 9 when executing the program.