CN110442645B - Data indexing method and device - Google Patents

Data indexing method and device Download PDF

Info

Publication number
CN110442645B
CN110442645B CN201910627141.9A CN201910627141A CN110442645B CN 110442645 B CN110442645 B CN 110442645B CN 201910627141 A CN201910627141 A CN 201910627141A CN 110442645 B CN110442645 B CN 110442645B
Authority
CN
China
Prior art keywords
data
fragment
target
index file
main fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910627141.9A
Other languages
Chinese (zh)
Other versions
CN110442645A (en
Inventor
谷凯凯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New H3C Big Data Technologies Co Ltd
Original Assignee
New H3C Big Data Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New H3C Big Data Technologies Co Ltd filed Critical New H3C Big Data Technologies Co Ltd
Priority to CN201910627141.9A priority Critical patent/CN110442645B/en
Publication of CN110442645A publication Critical patent/CN110442645A/en
Application granted granted Critical
Publication of CN110442645B publication Critical patent/CN110442645B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2272Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • G06F16/278Data partitioning, e.g. horizontal or vertical partitioning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data indexing method and device, wherein target main fragments to which data to be inserted belong are determined according to a fragment rule of an appointed index file, the target main fragments are backed up to the local from a target node where the target main fragments are located, the data to be inserted which belong to the target main fragments are inserted into the backed-up target main fragments through a data insertion process which is independent of an ElasticSearch service process, a target new fragment is obtained, and the target main fragment in the target node is replaced by the target new fragment. Therefore, the data insertion process and the ElasticSearch service process in the cluster can be ensured to be independent, and the data insertion efficiency can be improved.

Description

Data indexing method and device
Technical Field
The application relates to the technical field of big data, in particular to a data indexing method and device.
Background
The ElasticSearch is the most popular full text search engine at present, and adopts an index (index) to store data. Wherein the index may comprise a plurality of shards (shards) distributed over a plurality of ElasticSearch nodes of the ElasticSearch cluster. In order to ensure the stability of data, at least one copy fragment (duplicate plate) is usually set for each fragment of the index, where an original fragment is called a Primary fragment (Primary plate). The main partition and its corresponding replica partition are typically deployed on different ElasticSearch nodes.
In some application scenarios, data records in a third-party database need to be inserted into an index file of an ElasticSearch cluster for saving. In the prior art, a data insertion request is mainly sent to an ElasticSearch node in a cluster through a client provided by an ElasticSearch, and when each data record in the request is inserted into a main fragment to which the data record belongs and a corresponding copy fragment, the client can receive response information for completing the request, so that query service for newly inserted data records can be provided to the outside.
However, the main fragment and the copy fragment to which each data record in the request belongs are usually located on different elastic search nodes, and each data record is usually inserted into the main fragment by the elastic search node where the main fragment is located, and then the elastic search node where the main fragment is located notifies the elastic search node where the copy fragment is located to insert the elastic search node into the copy fragment. Therefore, inserting data into the ElasticSearch cluster requires a large number of network IO operations, is time-consuming and inefficient, and interferes with the query service provided by the ElasticSearch node.
Disclosure of Invention
In view of the above, an objective of the present invention is to provide a data indexing method and apparatus to improve the above problems.
In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:
in a first aspect, an embodiment of the present application further provides a data indexing method, where the method includes:
acquiring a plurality of data records to be inserted, into which an appointed index file needs to be inserted;
acquiring metadata information of the specified index file from an ElasticSearch cluster, determining a fragmentation rule of the specified index file according to the metadata information, and determining a target main fragment to which each data record to be inserted belongs from a plurality of main fragments included in the specified index file according to the fragmentation rule;
for each determined target main fragment, backing up the target main fragment from a target node in the ElasticSearch cluster, in which the target main fragment is stored, to the local through a data insertion process which is independent from an ElasticSearch service process in the ElasticSearch cluster, inserting each data record to be inserted into the target main fragment belonging to the target main fragment into the backed-up target main fragment to obtain a target new fragment, and replacing the target main fragment on the target node with the target new fragment.
In a second aspect, an embodiment of the present application provides a data indexing apparatus, including:
the data distribution module is used for acquiring a plurality of data records to be inserted, into which the specified index file needs to be inserted; acquiring metadata information of the specified index file from an ElasticSearch cluster, determining a fragmentation rule of the specified index file according to the metadata information, and determining a target main fragment to which each data record to be inserted belongs from a plurality of main fragments included in the specified index file according to the fragmentation rule;
and the data insertion module is used for backing up the target main fragment from a target node in the ElasticSearch cluster, in which the target main fragment is stored, to the local through a data insertion process which is independent from an ElasticSearch service process in the ElasticSearch cluster, inserting each data record to be inserted, belonging to the target main fragment, into the backed-up target main fragment to obtain a target new fragment, and replacing the target main fragment on the target node with the target new fragment.
Compared with the prior art, according to the data indexing method and device provided by the embodiment of the application, for the data records to be inserted belonging to the same main fragment, the main fragment to which the data records to be inserted belong is backed up to the local, and the data records to be inserted are inserted into the specified index file through the data insertion process which is independent from the ElasticSearch service process in the ElasticSearch cluster. On one hand, the data insertion process and the ElasticSearch service process in the ElasticSearch cluster can be ensured to be mutually independent, so that the influence on the data query service provided by the ElasticSearch service process is avoided, or the limitation of the data query service provided by the ElasticSearch service process is avoided. On the other hand, data transmission among various elastic search service processes in the data insertion process can be avoided, and therefore data insertion efficiency is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a schematic architecture diagram of an ElasticSearch cluster provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a data indexing method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of data insertion in an example provided by an embodiment of the present application;
fig. 4 is a schematic structural diagram of a target server according to an embodiment of the present application;
fig. 5 is a schematic functional block diagram of a data indexing device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Referring to fig. 1, fig. 1 is a schematic diagram of an architecture of an Elastic Search (ES) cluster 10 according to an embodiment of the present disclosure. The ES cluster 10 is deployed on a plurality of servers (or hosts) communicatively connected to each other, each server running one or more ES service processes, each ES service process being one ES node (node) of the ES cluster, for example, as shown in fig. 1, the ES cluster 10 includes ES nodes 0,1, and 2.
The ES provides a Client, e.g., an ES Client API interface, an ES Client REST interface, etc., that enables communication with the ES service process, through which the user-side device can communicate with the ES cluster 10.
Referring to fig. 2, fig. 2 is a schematic flow chart of a data indexing method provided in an embodiment of the present application, where the method may be applied to a server deployed with any ES node, and may also be applied to a server independent from an ES cluster. Some steps of the data indexing method may be performed by a server independent from the ES cluster, and another part of steps may be performed by a server deployed with the ES node, which is not limited in this embodiment. For convenience of description, a server that performs the data indexing method provided by the present embodiment will be hereinafter defined as a "target server".
The respective steps of the data indexing method will be described below.
Step S21, acquiring a plurality of data records to be inserted, into which the specified index file needs to be inserted.
Step S22, determining the fragmentation rule of the designated index file according to the metadata information of the designated index file acquired from the ES cluster, and determining the target master fragment to which each data record to be inserted belongs from the plurality of master fragments included in the designated index file according to the fragmentation rule.
In the present embodiment, the ES cluster 10 stores at least one index file, such as the index file a in the above example. In implementation, the plurality of data records to be inserted may be predetermined according to actual conditions and obtained from a corresponding external database or a third-party storage system (e.g., HDFS, Hive table, etc.). In addition, it can also be predetermined in which index file a plurality of data records to be inserted need to be inserted, and the determined index file is the designated index file.
In this embodiment, each ES node stores metadata information of the ES cluster, and the metadata information stored by each ES node includes metadata information of each index file in the ES cluster.
In the implementation process, the target server may obtain, from any ES node, metadata information of the designated index file in advance, where the metadata information includes an identifier of each fragment in the designated index file, the number of fragments, distribution information of the fragments, a fragment rule, and the like. In one case, if the target server is a server in which ES nodes are deployed, metadata information specifying the index file may be acquired from the ES nodes of the own server.
The fragmentation rule is also called a routing rule or a routing policy, and a routing field is specified in the routing rule and can be set by a user in a customized manner. Further, in a case where the user does not define the routing field, an id field of the data record may be adopted as the routing field by default.
When a target server acquires a plurality of data records to be inserted, whether the specified index file exists in the ES cluster 10 or not can be determined according to the metadata information of the specified index file acquired in advance, if so, the routing field used by the routing policy of the specified index file is determined according to the metadata information of the specified index file, and then the main partition to which the data record to be inserted belongs is determined according to the value of the routing field of each data record to be inserted, and the determined main partition is the target main partition. For example, a hash may be performed on a value of a routing field of each data record to be inserted, and the obtained hash value is used to obtain a remainder for the number of fragments included in the designated index file, where the fragment indicated by the remainder is a target primary fragment to which the data record belongs.
Step S23, for each determined target main fragment, through a data insertion process which is independent from the ElasticSearch service process in the ElasticSearch cluster, backing up the target main fragment from the target node in the ElasticSearch cluster where the target main fragment is stored, inserting each data record to be inserted belonging to the target main fragment into the backed up target main fragment to obtain a target new fragment, and replacing the target main fragment on the target node with the target new fragment.
The data insertion process and each ES service process in the ES cluster 10 are independent from each other and do not interfere with each other. The data insertion process may be run on the same server as the ES service process, or may be run on a different server, which is not limited in this embodiment.
In this embodiment, the pre-obtained metadata information for specifying the index file includes a shard array corresponding to each ES node, where the shard array records an identifier of a master shard managed by the ES node. For example, in the foregoing example, the fragmentation array corresponding to ES node 0 includes two identifiers 0 and 3, the fragmentation array corresponding to ES node 1 includes two identifiers 1 and 4, and the fragmentation array corresponding to ES node 2 includes two identifiers 2 and 5. Thus, the node corresponding to the fragment array containing the obtained remainder can be determined as the ES node managing the target primary fragment, and the server where the ES node is located is the target node where the target primary fragment is located.
Each time a target master segment and the target node where the target master segment is located are determined, a copy of the target master segment may be copied from the target node to the local, where the copied target master segment is the target master segment backed up in step S23. The target main fragment to be backed up includes multiple parts, for example, fragment data, index state data state, and fragment log file sublog of the target main fragment. Wherein, the fragment log file transcog records the operation record which is not persisted in the disk.
In order to ensure the completeness of the target primary partition, before each data record to be inserted belonging to the target primary partition is inserted into the target primary partition for backup, a Recovery API (Application Programming Interface) provided inside the ES may be called by the data insertion process, and the data recorded in the partition log file is updated to the target primary partition for backup.
In this embodiment, after updating the data in the fragment log file transcog to the backup target main fragment, the data insertion process may establish index data for each to-be-inserted data record belonging to the target main fragment, call a main fragment write API inside the ES, and write the established index data into the backup target main fragment. The main slice write API may be a shardOperationOnPrimary interface in a transportshardbulkukok action class provided by the ES.
Through the steps shown in fig. 2, the target primary segment on the target node is backed up to the local, the data record to be inserted belonging to the target primary segment is inserted through the data insertion process independent from the ES service process, and the target primary segment on the target node is replaced by the obtained target new segment, so that the data insertion process and the services provided by the ES service process are independent and do not interfere with each other.
And moreover, the data insertion is carried out locally, so that the problem of low data insertion efficiency caused by interaction (namely, multiple network I/O) among the ES service processes in the data insertion process is solved. In addition, by adopting the design, only the target main fragment on the target node is replaced by the target new fragment, and the newly inserted data record is retrieved, so that the situation that the copy fragment is updated is not needed to be waited for, the waiting time of a user is reduced, and the user experience is improved.
Optionally, in an implementation manner in this embodiment, one data insertion process may be used to implement the insertion of multiple data records to be inserted.
In still another embodiment, in order to further improve efficiency, a plurality of data insertion processes corresponding to the determined plurality of target master shards may be respectively created, and the plurality of data insertion processes execute the above step S23 in parallel.
Further, in order to reduce the transmission of data in the network, each of the plurality of data insertion processes may be created on a target node where a target primary segment corresponding to the data insertion process is located. For example, assuming that a target master shard 0 is determined from the ES cluster 10, and the target master shard 0 is managed by the ES node 0, a data insertion process may be created on the server where the ES node 0 is located, and a data record to be inserted belonging to the target master shard 0 may be inserted into the target master shard 0 of the backup through the data insertion process. Therefore, when backing up the target main fragment 0, the target main fragment 0 can be directly copied from the local and stored, and the target main fragment located in the local can be directly replaced after the target new fragment is obtained, so that the copied target main fragment 0 and the target new fragment obtained after data insertion can be transmitted without a network.
In this embodiment, a computing framework may be adopted to implement the obtaining and splitting of the data records to be inserted, that is, step S21 and step S22 may be implemented by adopting a computing framework. The computing framework may be, but is not limited to, MapReduce, Flink, Kafka streaming computing framework, and the like.
In implementation, the insertion of data records to be inserted belonging to the same master slice may be divided into one logical task of the computing framework. Taking MapReduce as an example, multiple data records to be inserted may be shunted at the Map stage according to the value of the routing field of each data record to be inserted, which may be specifically implemented by a Mapper; and summarizing the data records to be inserted belonging to the same main fragment into one Reducer for processing, wherein the number of the reducers is the same as the number of the main fragments included in the designated index file. Further, the data insertion process described later may also be performed by Reducer.
Taking Flink as an example, the data records to be inserted belonging to the same main fragment can be processed in the same process. Taking Kafka streaming computation as an example, data records to be inserted belonging to the same main slice may be assigned to the same partition for processing.
Optionally, considering that the ES cluster memory is usually loaded with data of each main slice and its corresponding copy slice, after replacing the target main slice of the target node with the target new slice, the data in the ES cluster memory will not match the actual data of the specified index file. Therefore, the data indexing method provided by this embodiment may further include the following steps:
and synchronizing the data in the memory of the ES cluster with the specified index file when at least one main fragment in the specified index file is replaced.
Specifically, when any target main fragment is replaced by a corresponding target new fragment, data in the memory of the ES cluster may be synchronized with the specified index file; or when all target main fragments are respectively replaced by corresponding target new fragments, synchronizing data in the memory of the ES cluster with the specified index file; and synchronizing data in the memory of the ES cluster with the specified index file when part of the target primary shards are respectively replaced with corresponding target new shards, which is not limited in this embodiment.
In detail, when each index file in the ES cluster 10 is opened, part of data in each main slice and each copy slice of the index file will be loaded into the memory of the ES cluster 10. Correspondingly, when the index file is closed, the data of the index file in the memory of the ES cluster 10 will also be released. Based on this, the synchronization of the data in the memory of the ES cluster 10 with the specified index file can be achieved by the following steps:
first, when at least one main slice in the designated index file is replaced, the designated index file is closed, and the data in the memory of the elastic search cluster 10 is released.
And then, opening the specified index file, reloading the data of each main fragment and each copy fragment of the specified index file into the memory, and synchronizing the data in the memory of the ElasticSearch cluster with the specified index file.
The data insertion process may call an offline API inside the ES to close the specified index file, and call an online API inside the ES to open the closed specified index file. Specifically, the offline API may be, for example, http:// cluster IP address: port number/indexname/_ open, the online API may be, for example, http:// cluster IP address: port number/indexname/_ close. Where indexname is the name that specifies the index file.
In this embodiment, when the specified index file is opened, each ES service process in the ES cluster 10 reloads each main fragment of the specified index file and part of data in the corresponding copy fragment into the memory, each ES service process detects the loaded main fragment and the corresponding copy fragment in the loading process, and if it is detected that data of any main fragment is inconsistent with data of the corresponding copy fragment, an internal copy mechanism of the ES cluster 10 itself is triggered, thereby implementing data synchronization of the main fragment and the copy fragment.
The data indexing method provided in this embodiment is further described below by taking the above-mentioned 6 data records [0,1,2,3,4,5,6] to be inserted as an example, and combining the ES cluster 10 shown in fig. 1 and the flowchart shown in fig. 3.
Firstly, the target server determines whether an index file a is created in the ES cluster 10 (i.e., an index file is designated) according to the acquired metadata information of the ES cluster 10. If yes, further obtaining the metadata information of the index file A, wherein the metadata information contains the fragmentation rule of the index file A.
Secondly, the target server acquires data records [0,1,2,3,4,5,6] to be inserted one by one through a calculation frame, determines that a routing field used by a routing rule of the index file A is an id field according to metadata information of the index file A acquired in advance, hashes the value of the id field of each acquired data record respectively, and takes the obtained hash value to the number 6 of main fragments included in the index file A to obtain a corresponding remainder value.
For example, the remainder value obtained according to the to-be-inserted data record 0 is 0, which indicates that the to-be-inserted data record 0 belongs to the main slice 0; similarly, it may be determined that the to-be-inserted data record 1 belongs to the main slice 1, the to-be-inserted data record 2 belongs to the main slice 2, the to-be-inserted data record 3 belongs to the main slice 3, the to-be-inserted data record 4 belongs to the main slice 4, and the to-be-inserted data record 5 belongs to the main slice 5.
And thirdly, determining that the main slices 0 and 3 are managed by the ES node 0, the main slices 1 and 4 are managed by the ES node 1, and the main slices 2 and 5 are managed by the ES node 2 according to the metadata information of the index file A.
Fourthly, aiming at the main fragment 0 (namely, the target main fragment), a data insertion process P0 which is independent from the ES service process is created, and the main fragment 0 is backed up from the server of the ES node 0 through the data insertion process P0 and is stored locally (such as a local backup file shown in figure 3);
calling a recovery API inside the ES, and restoring data in a log file transcog in the backed-up main fragment 0 to the backed-up main fragment 0; and calling a main fragment write API in the ES, establishing index information of the data record 0 to be inserted, inserting the index information into the backup main fragment 0 to obtain a new fragment 0 ', and replacing the main fragment 0 managed by the ES node 0 with the new fragment 0'.
Fifthly, aiming at the main fragment 1, a data insertion process P1 independent of the ES service process is created, and the main fragment 1 is backed up from a server where the ES node 1 is located through the data insertion process P1 and stored locally; and calling a recovery API inside the ES, restoring the data in the log file transcog in the backed-up main fragment 1 to the backed-up main fragment 1, calling the main fragment write API inside the ES, establishing index information of the data record 1 to be inserted, inserting the established index information into the backed-up main fragment 1 to obtain a new fragment 1 ', and replacing the main fragment 1 managed by the ES node 1 with the new fragment 1'.
Sixthly, establishing a data insertion process P2 independent of the ES service process for the main fragment 2, backing up a main fragment 2 from a server where the ES node 2 is located through the data insertion process P2, and storing the main fragment 2 locally; and calling a recovery API inside the ES, restoring the data in the log file transcog in the backed-up main fragment 2 to the backed-up main fragment 2, calling a main fragment write API inside the ES to establish index information of the data record 2 to be inserted, inserting the established index information into the backed-up main fragment 2 to obtain a new fragment 2 ', and replacing the main fragment 2 managed by the ES node 2 with the new fragment 2'.
Similarly, the processing for the primary slices 3-5 can be performed in a manner similar to the above-described flow until the data records 0-5 to be inserted are inserted into the index file a.
And seventhly, when any one main fragment in the index file A is replaced by the new fragment, closing the index file A to release the data of the index file A in the memory of the ES cluster 10, and re-opening the index file A to re-load the data in each main fragment and the corresponding copy fragment of the index file A into the memory of the ES cluster 10.
Eighthly, in the process of reloading the data of each main fragment and the corresponding copy fragment of the index file A to the memory of the ES cluster 10, if the ES service process detects that the data of any main fragment is inconsistent with the data of the corresponding copy fragment, triggering an internal copy mechanism of the ES cluster 10.
Referring to fig. 4, fig. 4 is a block diagram illustrating a structure of a target server 40 according to the present embodiment. Target server 40 includes a processor 41 and a machine-readable storage medium 42. The processor 41 and the machine-readable storage medium 42 may communicate via a system bus 43. Also, the machine-readable storage medium 42 stores machine-executable instructions, and the processor 41 may perform the data indexing method described above by reading and executing the machine-executable instructions in the machine-readable storage medium 42.
The machine-readable storage medium 42 referred to herein may be any electronic, magnetic, optical, or other physical storage device that can contain or store information such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: a RAM (random Access Memory), a volatile Memory, a non-volatile Memory, a flash Memory, a storage drive (e.g., a hard drive), a solid state drive, any type of storage disk (e.g., an optical disk, a dvd, etc.), or similar storage medium, or a combination thereof.
Referring to fig. 5, fig. 5 is a block diagram illustrating a data indexing device 50 according to an embodiment of the present disclosure. The data indexing device 50 includes at least one functional module that may be stored in the form of software in the machine-readable storage medium 42. Functionally, the data indexing device 50 may include a data splitting module 51 and a data inserting module 52.
The data splitting module 51 is configured to obtain multiple data records to be inserted, where the multiple data records need to be inserted into an assigned index file; acquiring metadata information of an appointed index file from an ElasticSearch cluster, determining a fragmentation rule of the appointed index file according to the metadata information, and determining a target main fragment to which each data record to be inserted belongs from a plurality of main fragments included in the appointed index file according to the fragmentation rule.
The data insertion module 52 is configured to, for each determined target master fragment, backup the target master fragment to the local from the target node in the ElasticSearch cluster in which the target master fragment is stored through a data insertion process that is independent from the ElasticSearch service process in the ElasticSearch cluster, insert each to-be-inserted data record belonging to the target master fragment into the backed-up target master fragment to obtain a target new fragment, and replace the target master fragment on the target node with the target new fragment.
Alternatively, the data insertion module 52 may call a main partition write API provided by the ElasticSearch cluster, establish index data of each to-be-inserted data record belonging to the target main partition, and write the index data into the backup target main partition.
Optionally, the fragmentation rule specifying the index file may include a routing field specifying a routing policy use for the index file. In this case, the data splitting module 51 may determine, according to the value of the routing field of each to-be-inserted data record, a target primary segment to which the to-be-inserted data record belongs.
Optionally, the data indexing device 50 may further include a synchronization module 53.
The synchronization module 53 is configured to close the designated index file and release data in the elastic search cluster memory when at least one primary segment in the designated index file is replaced; and opening the specified index file, reloading the data of each main fragment and each copy fragment of the specified index file into the memory, and synchronizing the data in the memory of the ElasticSearch cluster with the specified index file.
Optionally, the synchronization module 53 may trigger an internal copy mechanism of the ElasticSearch cluster 10 to synchronize the data of each main fragment and the data of each copy fragment of the designated index file if it is detected that the data of any main fragment is inconsistent with the data of the copy fragment corresponding to the main fragment in the process of reloading each main fragment and the data of each copy fragment into the memory.
Optionally, the target primary shard for backup may include a shard log file, in which case the data indexing device 50 may also include an update module 54.
The updating module 54 is configured to update the data recorded in the fragment log file to the target primary fragment of the backup before inserting each to-be-inserted data record belonging to the target primary fragment into the target primary fragment of the backup.
The detailed description of each functional module above may specifically refer to the description of the relevant steps in the foregoing.
To sum up, the embodiments of the present application provide a data indexing method and apparatus, where for a to-be-inserted data record belonging to a same main partition, the main partition to which the to-be-inserted data record belongs is backed up to a local, and the to-be-inserted data record is inserted into an assigned index file through a data insertion process that is independent from an ES service process in an ES cluster. On one hand, the data insertion process and the ES service process in the ES cluster can be ensured to be mutually independent, so that the influence on the data query service provided by the ES service process is avoided, or the limitation of the data query service provided by the ES service process is avoided. On the other hand, data transmission among various elastic search service processes in the data insertion process can be avoided, and therefore data insertion efficiency is improved.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (12)

1. A method for indexing data, the method comprising:
acquiring a plurality of data records to be inserted, into which an appointed index file needs to be inserted;
determining a fragment rule of the designated index file according to metadata information of the designated index file acquired from an ElasticSearch cluster, and determining a target main fragment to which each data record to be inserted belongs from a plurality of main fragments included in the designated index file according to the fragment rule;
for each determined target main fragment, backing up the target main fragment from a target node in the ElasticSearch cluster, in which the target main fragment is stored, to the local through a data insertion process which is independent from an ElasticSearch service process in the ElasticSearch cluster, inserting each data record to be inserted into the target main fragment belonging to the target main fragment into the backed-up target main fragment to obtain a target new fragment, and replacing the target main fragment on the target node with the target new fragment.
2. The method of claim 1, further comprising:
when at least one main fragment in the specified index file is replaced, closing the specified index file, and releasing data in the ElasticSearch cluster memory;
and opening the specified index file, reloading the data of each main fragment and each copy fragment of the specified index file into the ElasticSearch cluster memory, and synchronizing the data in the ElasticSearch cluster memory with the specified index file.
3. The method according to claim 2, wherein the step of reloading the data of each main fragment and each copy fragment of the specified index file into the ElasticSearch cluster memory comprises:
and when detecting that the data of any main fragment is inconsistent with the data of the copy fragment corresponding to the main fragment, triggering an internal copy mechanism of the ElasticSearch cluster to synchronize the data of the copy fragment and the data of the main fragment.
4. The method of any of claims 1-3, wherein the target primary shard for backup comprises a shard log file; the method further comprises the following steps:
before each data record to be inserted belonging to the target main fragment is inserted into the backup target main fragment, updating the data recorded in the fragment log file into the backup target main fragment.
5. The method according to any of claims 1-3, wherein the step of inserting each data record to be inserted belonging to the target primary slice into the target primary slice of the backup comprises:
and establishing index data of each data record to be inserted belonging to the target main fragment, calling a main fragment writing API provided by the ElasticSearch cluster, and writing the index data into the backup target main fragment.
6. The method according to any one of claims 1-3, wherein the fragmentation rule for the specified index file comprises a routing field used by a routing policy for the specified index file;
the step of determining a target main fragment to which each data record to be inserted belongs from a plurality of main fragments included in the specified index file according to the fragment rule includes:
and determining the target main fragment to which each data record to be inserted belongs according to the value of the routing field of each data record to be inserted.
7. A data indexing apparatus, comprising:
the data distribution module is used for acquiring a plurality of data records to be inserted, into which the specified index file needs to be inserted; acquiring metadata information of the specified index file from an ElasticSearch cluster, determining a fragmentation rule of the specified index file according to the metadata information, and determining a target main fragment to which each data record to be inserted belongs from a plurality of main fragments included in the specified index file according to the fragmentation rule;
and the data insertion module is used for backing up the target main fragment from a target node in the ElasticSearch cluster, in which the target main fragment is stored, to the local through a data insertion process which is independent from an ElasticSearch service process in the ElasticSearch cluster, inserting each data record to be inserted, belonging to the target main fragment, into the backed-up target main fragment to obtain a target new fragment, and replacing the target main fragment on the target node with the target new fragment.
8. The apparatus of claim 7, further comprising:
the synchronization module is used for closing the specified index file and releasing the data in the ElasticSearch cluster memory when at least one main fragment in the specified index file is replaced; and opening the specified index file, reloading the data of each main fragment and each copy fragment of the specified index file into the ElasticSearch cluster memory, and synchronizing the data in the ElasticSearch cluster memory with the specified index file.
9. The apparatus according to claim 8, wherein, in the process of reloading each main fragment and data of each copy fragment of the specified index file into the memory of the ElasticSearch cluster, if it is detected that data of any main fragment is inconsistent with data of a copy fragment corresponding to the main fragment, the synchronization module triggers an internal copy mechanism of the ElasticSearch cluster to synchronize the data of the copy fragment and the data of the main fragment.
10. The apparatus of any of claims 7-9, wherein the target primary shard for backup comprises a shard log file; the device further comprises:
and the updating module is used for updating the data recorded in the fragment log file to the backup target main fragment before each data record to be inserted belonging to the target main fragment is inserted into the backup target main fragment.
11. The apparatus according to any of claims 7 to 9, wherein the data insertion module establishes index data of each data record to be inserted belonging to the target primary partition, and calls a primary partition write API provided by the ElasticSearch cluster to write the index data into the target primary partition of the backup.
12. The apparatus according to any one of claims 7-9, wherein the fragmentation rule for the specified index file comprises a routing field used by a routing policy for the specified index file;
and the data distribution module determines a target main fragment to which each data record to be inserted belongs according to the value of the routing field of the data record to be inserted.
CN201910627141.9A 2019-07-11 2019-07-11 Data indexing method and device Active CN110442645B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910627141.9A CN110442645B (en) 2019-07-11 2019-07-11 Data indexing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910627141.9A CN110442645B (en) 2019-07-11 2019-07-11 Data indexing method and device

Publications (2)

Publication Number Publication Date
CN110442645A CN110442645A (en) 2019-11-12
CN110442645B true CN110442645B (en) 2020-09-15

Family

ID=68430314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910627141.9A Active CN110442645B (en) 2019-07-11 2019-07-11 Data indexing method and device

Country Status (1)

Country Link
CN (1) CN110442645B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112019605B (en) * 2020-08-13 2023-05-09 上海哔哩哔哩科技有限公司 Data distribution method and system for data stream
CN113485962B (en) * 2021-06-30 2023-08-01 中国民航信息网络股份有限公司 Log file storage method, device, equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9805108B2 (en) * 2010-12-23 2017-10-31 Mongodb, Inc. Large distributed database clustering systems and methods
JP2018505501A (en) * 2015-01-25 2018-02-22 イグアジオ システムズ エルティーディー. Application-centric object storage
CN106407376B (en) * 2016-09-12 2019-12-20 杭州数梦工场科技有限公司 Index reconstruction method and device
CN108897865A (en) * 2018-06-29 2018-11-27 北京奇虎科技有限公司 The index copy amount appraisal procedure and device of distributed type assemblies

Also Published As

Publication number Publication date
CN110442645A (en) 2019-11-12

Similar Documents

Publication Publication Date Title
KR102392944B1 (en) Data backup methods, storage media and computing devices
US10289692B2 (en) Preserving file metadata during atomic save operations
US9436556B2 (en) Customizable storage system for virtual databases
US10191814B2 (en) Restoring data in a hierarchical storage management system
US9515878B2 (en) Method, medium, and system for configuring a new node in a distributed memory network
JP2018511884A (en) System and method for automatic cloud-based full data backup and restore on mobile devices
WO2012178072A1 (en) Extracting incremental data
US11397749B2 (en) Asynchronous replication of in-scope table data
US10412163B2 (en) Computer system, distributed object sharing method, and edge node
US11151081B1 (en) Data tiering service with cold tier indexing
GB2520361A (en) Method and system for a safe archiving of data
CN115599747B (en) Metadata synchronization method, system and equipment of distributed storage system
US11422727B2 (en) Restoring a storage system using file relocation metadata
CN110442645B (en) Data indexing method and device
US20120324436A1 (en) Method of updating versioned software using a shared cache
JP6196389B2 (en) Distributed disaster recovery file synchronization server system
EP3251011B1 (en) Cloud-based hierarchical system preservation
US9934240B2 (en) On demand access to client cached files
CN110618996B (en) Function library heat updating method applied to distributed database
JP7038864B2 (en) Search server centralized storage
US11210212B2 (en) Conflict resolution and garbage collection in distributed databases
JP2012506593A (en) Discarded items with knowledge based on synchronization
US10558450B2 (en) Mechanism for customizing multiple computing devices
KR102225258B1 (en) A computer program for providing efficient change data capture in a database system
US20210248108A1 (en) Asynchronous data synchronization and reconciliation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant