CN114925138A - Hadoop-based ES data synchronization method, device, equipment and medium - Google Patents

Hadoop-based ES data synchronization method, device, equipment and medium Download PDF

Info

Publication number
CN114925138A
CN114925138A CN202210583140.0A CN202210583140A CN114925138A CN 114925138 A CN114925138 A CN 114925138A CN 202210583140 A CN202210583140 A CN 202210583140A CN 114925138 A CN114925138 A CN 114925138A
Authority
CN
China
Prior art keywords
data
preset
index
synchronized
hadoop
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210583140.0A
Other languages
Chinese (zh)
Inventor
詹芮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Property and Casualty Insurance Company of China Ltd
Original Assignee
Ping An Property and Casualty Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Property and Casualty Insurance Company of China Ltd filed Critical Ping An Property and Casualty Insurance Company of China Ltd
Priority to CN202210583140.0A priority Critical patent/CN114925138A/en
Publication of CN114925138A publication Critical patent/CN114925138A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1435Saving, restoring, recovering or retrying at system level using file system or storage system metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1458Management of the backup or restore process
    • G06F11/1469Backup restoration techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a data processing technology, and discloses an ES data synchronization method based on Hadoop, which comprises the following steps: deploying a pre-constructed virtual ES instance on a preset Hadoop system; acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance; and sending the index file to a preset ES cluster to load the data to be synchronized. The invention also provides an ES data synchronization device, equipment and a medium based on Hadoop. The invention can improve the ES data synchronization efficiency.

Description

Hadoop-based ES data synchronization method, device, equipment and medium
Technical Field
The invention relates to the technical field of data processing, in particular to a Hadoop-based ES data synchronization method and device, electronic equipment and a computer readable storage medium.
Background
With the coming of the big data era of the internet, the internet is applied to construct a corresponding data storage management system based on a big data frame Hadoop, and the field of internet data retrieval is more provided with ES (electronic search engine) technology to meet various search scenes. In practical application of internet data retrieval, data on a Hadoop side is generally required to be synchronized to an ES side periodically, how the data is synchronized to the ES side from the Hadoop side is achieved, data are obtained in batches from the Hadoop side through an ES cluster in a current mainstream data synchronization scheme, and the ES cluster creates indexes for the obtained data to generate fragments for storage.
When the amount of data to be synchronized is very large, such a data synchronization scheme may have the following disadvantages:
1. the data volume is large, and the ES cluster has too long time for synchronizing the data.
2. The method greatly occupies CPU resources in the ES cluster, increases the expense of the ES cluster, and influences the normal data retrieval efficiency of the ES.
Therefore, there is a need for an improved method for ES data synchronization to improve data synchronization efficiency, reduce overhead of ES clusters, and guarantee data retrieval efficiency of ES clusters.
Disclosure of Invention
The invention provides a Hadoop-based ES data synchronization method and device and a computer-readable storage medium, and mainly aims to improve the accuracy of Hadoop-based ES data synchronization.
In order to achieve the above object, the ES data synchronization method based on Hadoop provided by the present invention includes:
deploying a pre-constructed virtual ES instance on a preset Hadoop system;
acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance;
and sending the index file to a preset ES cluster to load the data to be synchronized.
Optionally, the deploying of the pre-constructed virtual ES instance on the preset Hadoop system includes:
acquiring networking information of a preset ES cluster;
loading the pre-constructed virtual ES instance on the preset Hadoop system according to the networking information of the ES cluster;
activating the virtual ES instance.
Optionally, the generating, by using the virtual ES instance, the index file of the data to be synchronized includes:
re-partitioning the data to be synchronized to obtain partition data with a preset number;
creating an index for each of the partition data one by one using the virtual ES instance;
and associating each partition data with the index corresponding to the partition data to generate an index file corresponding to the partition data.
Optionally, the re-partitioning the data to be synchronized to obtain a preset number of partitioned data includes:
acquiring the size of the data to be synchronized;
calculating to obtain the size of each partition according to the size of the data to be synchronized and the preset quantity;
and partitioning the data to be synchronized according to the size of each partition by using a preset repartitioning operator to obtain the partition data of the preset number.
Optionally, the creating an index for each partition data one by using the virtual ES instance includes:
acquiring preset index global configuration information;
acquiring preset configuration content corresponding to each partition data, and configuring an index field according to the preset configuration content;
and constructing an index corresponding to each partition data according to the preset index global configuration information and the index field.
Optionally, the associating each partition data with the index corresponding to the partition data, and generating the index file corresponding to the partition data includes:
writing each partition data into a corresponding index one by one;
and performing segmented refreshing, segmented submission and segmented combination on the index written with the data by using the virtual ES instance to obtain an index file corresponding to the partitioned data.
Optionally, before sending the index file to a preset ES cluster for loading the data to be synchronized, the method further includes:
and synchronizing the index file to the preset file system of the Hadoop system for data backup.
In order to solve the above problem, the present invention further provides a Hadoop-based ES data synchronization apparatus, including:
the virtual ES instance deployment module is used for deploying a pre-constructed virtual ES instance on a preset Hadoop system;
the index file generation module is used for acquiring data to be synchronized and generating an index file of the data to be synchronized by using the virtual ES instance;
and the index file sending module is used for sending the index file to a preset ES cluster to load the data to be synchronized.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one computer program; and
and the processor executes the program stored in the memory to realize the Hadoop-based ES data synchronization method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, in which at least one computer program is stored, and the at least one computer program is executed by a processor in an electronic device to implement the Hadoop-based ES data synchronization method described above.
According to the invention, the pre-constructed virtual ES instance is deployed on the preset Hadoop system, the index file of the data to be synchronized is generated through the virtual ES instance, and the index file is sent to the preset ES cluster, the preset ES cluster does not need to perform index creation and fragmentation operation on the data to be synchronized, and only needs to store the data to be synchronized according to the index file, so that the consumption of the ES cluster in the data synchronization process is reduced, and the ES data synchronization efficiency is improved.
Drawings
Fig. 1 is a schematic flow chart of an ES data synchronization method based on Hadoop according to an embodiment of the present invention;
fig. 2 is a functional block diagram of an ES data synchronization apparatus based on Hadoop according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for implementing the Hadoop-based ES data synchronization method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides an ES data synchronization method based on Hadoop. The execution main body of the Hadoop-based ES data synchronization method includes but is not limited to at least one of electronic devices such as a server and a terminal which can be configured to execute the method provided by the embodiment of the application. In other words, the Hadoop-based ES data synchronization method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a block chain platform. The server side can be an independent server, and can also be a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform.
Fig. 1 is a schematic flow chart of an ES data synchronization method based on Hadoop according to an embodiment of the present invention. In this embodiment, the method for ES data synchronization based on Hadoop includes:
s1, deploying the pre-constructed virtual ES instance on a preset Hadoop system;
in the embodiment of the invention, the Hadoop-based ES data synchronization method is applied to a big data storage management system constructed based on a Hadoop framework. In the practical application of big data retrieval, a big data storage management system constructed based on a Hadoop framework is commonly used in big data application.
In the embodiment of the invention, the big data storage management system constructed based on the Hadoop framework generally comprises subsystems such as a data source, a computing framework, a cluster resource management system and a distributed file system. The data source can adopt a Hive data warehouse or an HBase real-time distributed database. The computing framework generally comprises a Mapduce distributed computing framework or a Spark memory computing framework. The System for managing cluster resources usually adopts a YARN cluster resource management System, and the Distributed File System is more referred to as HDFS (Hadoop Distributed File System).
In the embodiment of the present invention, the virtual ES (elastic search engine) instance refers to a program unit that has the same function as an ES cluster and can implement data index creation and data fragmentation.
In detail, the deploying of the pre-constructed virtual ES instance on the preset Hadoop system includes:
acquiring networking information of a preset ES cluster; loading the pre-constructed virtual ES instance on the preset Hadoop system according to the networking information of the ES cluster; activating the virtual ES instance.
In the embodiment of the invention, the preset ES cluster refers to a group of servers providing a real-time data search and analysis engine function in a big data retrieval application, and compared with a single ES server node, the preset ES cluster comprises a plurality of server nodes, wherein the plurality of server nodes can be divided into a main node and a slave node, the main node and the slave node have different labor division, cooperate with each other to establish an index for data to be synchronized and generate data fragments, and store data of different fragments to a plurality of different slave nodes according to the index, so that the high availability and fault tolerance of big data are improved.
In the embodiment of the invention, the pre-constructed virtual ES instance is loaded according to the networking information of the preset ES cluster, so that the distribution of the primary nodes and the secondary nodes of the virtual ES instance and the corresponding quantity of the primary nodes and the secondary nodes are consistent with the networking information of the ES cluster.
S2, acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance;
in the embodiment of the present invention, the data to be synchronized refers to data that needs to be synchronized to the preset ES cluster from the big data storage management system constructed based on the HADOOP framework, so that the preset ES cluster can perform relevant data retrieval operations according to the data to be synchronized.
In the embodiment of the present invention, the index file refers to a mapping relationship defined between an index of the data to be synchronized and a storage space where data corresponding to the index is located, for a data type and a storage logic of the data to be synchronized.
In the embodiment of the present invention, as compared to a relational database, an ES may be understood as a document-oriented database, and one piece of data may be understood as one document. The ES defines the logical storage and type of documents by indexes. Each type can contain a plurality of documents, each document can contain a plurality of fields, and each document can be retrieved by creating an index for each field, so that the retrieval efficiency of each document is improved.
In detail, the generating the index file of the data to be synchronized by using the virtual ES instance includes: re-partitioning the data to be synchronized to obtain partition data with a preset number; creating an index for each of the partition data one by one using the virtual ES instance; and associating each partition data with the index corresponding to the partition data to generate an index file corresponding to the partition data.
It will be appreciated that the data to be synchronized itself has a certain data organization, for example, the data to be synchronized may be distributed in different ranks within different storage areas according to the logical relationship between the data.
In the embodiment of the invention, in the big data storage management system constructed based on the Hadoop framework, the virtual ES instance can simultaneously create indexes of a plurality of partitioned data by re-partitioning the data to be synchronized, so that the efficiency of the virtual ES instance in creating the indexes of the data to be synchronized is improved.
In detail, the re-partitioning the data to be synchronized to obtain a preset number of partitioned data includes: acquiring the size of the data to be synchronized; calculating the size of each partition according to the size of the data to be synchronized and the preset number; and partitioning the data to be synchronized according to the size of each partition by using a preset repartitioning operator to obtain the partition data of the preset number.
In the embodiment of the invention, the size of each partition data is determined by the size of the data to be synchronized and the repartitioning operator, and the larger each partition data is, the smaller the corresponding partition number is. Similarly, the smaller each partition data is, the larger the number of corresponding partitions is.
In the embodiment of the present invention, the repartition operator may adopt a "reparation" algorithm, and the repartition operator traverses each data in the data to be synchronized and randomly allocates the traversed data to the corresponding new partition.
In detail, the creating an index for each of the partition data one by one using the virtual ES instance includes: acquiring preset index global configuration information; acquiring preset configuration content corresponding to each partition data, and configuring an index field according to the preset configuration content; and constructing an index corresponding to each partition data according to the preset index global configuration information and the index field.
In the embodiment of the present invention, the index may be understood as a place where the ES stores the data to be synchronized, that is, one database in the relational database. The data to be synchronized is stored and indexed in shards, indexing a logical space that brings one or more shards together.
In the embodiment of the present invention, the preset index global configuration information includes, but is not limited to, the number of fragments and copies, a data synchronization frequency, and the like.
In the embodiment of the invention, one index object can store a plurality of entities with different purposes, the main entity stored in the index object is called a document, one index object can store a plurality of documents with different purposes, different entities in a single index can be distinguished through index types, and the entities can be understood as tables in a relational database. Each document is made up of multiple fields, ES is an unstructured database, and each document may contain multiple fields and have a unique identifier.
In the embodiment of the present invention, the preset configuration content includes, but is not limited to, a keyword, a word frequency, a preset field attribute, and the like of each document.
In detail, the associating each partition data with the index corresponding to the partition data to generate the index file corresponding to the partition data includes: writing each partition data into a corresponding index one by one; and performing segmented refreshing, segmented submission and segmented combination on the index written with the data by using the virtual ES instance to obtain an index file corresponding to the partitioned data.
In the embodiment of the invention, the segmented refreshing operation is to write all documents in the memory buffer area into a new segment, write the documents into a file system cache and empty the old memory buffer area, write the documents into the memory buffer area, generate a new segment every 1s by default, perform the segmented refreshing operation every 1s by default, open the segment after the segmented refreshing operation, and realize near real-time search.
In the embodiment of the present invention, the segment commit operation refers to that the virtual ES instance completely commits and refreshes each segment to a disk at one time, and writes a list including all segment commit points. The preset ES cluster uses the commit point list to judge which segments belong to the current fragment in the process of starting or reopening an index so as to ensure the safety of data.
In the embodiment of the present invention, the segment merging operation refers to that a new segment is generated after a segment refresh operation is performed on data in the index each time, and as time increases, the number of newly generated segments increases, which may cause that all segments need to be sequentially scanned each time the preset ES cluster performs a data retrieval operation, resulting in a slow data retrieval speed.
In this embodiment of the present invention, the segmented refresh operation may be implemented by a refresh interface provided by the virtual ES instance, the segmented commit operation may be implemented by a flush interface provided by the virtual ES instance, and the segmented merge operation may be implemented by a merge interface provided by the virtual ES instance.
S3, sending the index file to a preset ES cluster to load the data to be synchronized.
In the embodiment of the present invention, the preset ES cluster only needs to read the index file, and store the data to be synchronized according to the data fragmentation and the index relationship defined by the index file, and there is no need to perform data analysis on the data to be synchronized and create an index.
Preferably, before sending the index file to the preset ES cluster for loading the data to be synchronized, the method further includes: and synchronizing the index file to a preset file system of the Hadoop system for data backup.
In the embodiment of the present invention, when the preset ES cluster synchronizes the data to be synchronized from the preset Hadoop System, and synchronization fails due to an abnormal condition, recovery of related data may be performed according to an index File stored in the File System, where the File System may adopt HDFS (Hadoop Distributed File System).
According to the embodiment of the invention, the pre-constructed virtual ES instance is deployed on the preset Hadoop system, the index file of the data to be synchronized is generated through the virtual ES instance, and the index file is sent to the preset ES cluster, the preset ES cluster does not need to perform index creation and fragmentation operation on the data to be synchronized, and only needs to store the data to be synchronized according to the index file, so that the consumption of the ES cluster in the data synchronization process is reduced, and the ES data synchronization efficiency is improved.
Fig. 2 is a functional block diagram of a Hadoop-based ES data synchronization apparatus according to an embodiment of the present invention.
The Hadoop-based ES data synchronization device 100 can be installed in electronic equipment. According to the implemented functions, the Hadoop-based ES data synchronization apparatus 100 may include a virtual ES instance deployment module 101, an index file generation module 102, and an index file transmission module 103. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the virtual ES instance deployment module 101 is configured to deploy a pre-constructed virtual ES instance on a preset Hadoop system;
the index file generation module 102 is configured to obtain data to be synchronized, and generate an index file of the data to be synchronized by using the virtual ES instance;
the index file sending module 103 is configured to send the index file to a preset ES cluster to load the data to be synchronized.
In detail, the specific implementation of each module of the Hadoop-based ES data synchronization apparatus 100 is as follows:
step one, deploying a pre-constructed virtual ES instance on a preset Hadoop system;
in the embodiment of the invention, the Hadoop-based ES data synchronization method is applied to a big data storage management system constructed based on a Hadoop framework. In the practical application of big data retrieval, a big data storage management system constructed based on a Hadoop framework is commonly used in big data application.
In the embodiment of the invention, the big data storage management system constructed based on the Hadoop framework generally comprises subsystems such as a data source, a computing framework, a cluster resource management system and a distributed file system. The data source can adopt a Hive data warehouse or an HBase real-time distributed database. The computing framework generally comprises a Mapduce distributed computing framework or a Spark memory computing framework. The System for managing cluster resources usually adopts a YARN cluster resource management System, and the Distributed File System is more referred to as HDFS (Hadoop Distributed File System).
In the embodiment of the present invention, the virtual ES (elastic search engine) instance refers to a program unit that has the same function as an ES cluster and can implement data index creation and data fragmentation.
In detail, the deploying of the pre-constructed virtual ES instance on the preset Hadoop system includes:
acquiring networking information of a preset ES cluster; loading the pre-constructed virtual ES instance on the preset Hadoop system according to the networking information of the ES cluster; activating the virtual ES instance.
In the embodiment of the invention, the preset ES cluster refers to a group of servers providing real-time data search and analysis engine functions in big data retrieval application, and compared with a single ES server node, the preset ES cluster comprises a plurality of server nodes, wherein the plurality of server nodes can be divided into a main node and a slave node, the main node and the slave node have different labor distribution, cooperate with each other to create an index for data to be synchronized and generate data fragments, and store data of different fragments to a plurality of different slave nodes according to the index, so that the high availability and the fault tolerance of big data are improved.
In the embodiment of the invention, the pre-constructed virtual ES instance is loaded according to the networking information of the preset ES cluster, so that the distribution of the primary nodes and the secondary nodes of the virtual ES instance and the corresponding quantity of the primary nodes and the secondary nodes are consistent with the networking information of the ES cluster.
Step two, acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance;
in the embodiment of the present invention, the data to be synchronized refers to data that needs to be synchronized to the preset ES cluster from the big data storage management system constructed based on the HADOOP framework, so that the preset ES cluster can perform relevant data retrieval operations according to the data to be synchronized.
In the embodiment of the present invention, the index file refers to a mapping relationship defined between an index of the data to be synchronized and a storage space in which the data corresponding to the index is located, for a data type and a storage logic of the data to be synchronized.
In the embodiment of the present invention, as compared to a relational database, an ES may be understood as a document-oriented database, and one piece of data may be understood as one document. The ES defines the logical storage and type of the document by the index. Each type can contain a plurality of documents, each document can contain a plurality of fields, and each document can be retrieved by creating an index for each field, so that the retrieval efficiency of each document is improved.
In detail, the generating the index file of the data to be synchronized by using the virtual ES instance includes: re-partitioning the data to be synchronized to obtain partition data with a preset number; creating an index for each of the partition data one by one using the virtual ES instance; and associating each partition data with the index corresponding to the partition data to generate an index file corresponding to the partition data.
It will be appreciated that the data to be synchronized itself has a certain data organization, for example, the data to be synchronized may be distributed in different ranks within different storage areas according to the logical relationship between the data.
In the embodiment of the invention, in the big data storage management system constructed based on the Hadoop framework, the virtual ES instance can simultaneously create indexes of a plurality of partitioned data by re-partitioning the data to be synchronized, so that the efficiency of the virtual ES instance in creating the indexes of the data to be synchronized is improved.
In detail, the re-partitioning the data to be synchronized to obtain a preset number of partitioned data includes: acquiring the size of the data to be synchronized; calculating the size of each partition according to the size of the data to be synchronized and the preset number; and partitioning the data to be synchronized according to the size of each partition by using a preset repartitioning operator to obtain the partition data of the preset number.
In the embodiment of the invention, the size of each partition data is determined by the size of the data to be synchronized and the repartitioning operator, and the larger each partition data is, the smaller the corresponding partition number is. Similarly, the smaller each of the partition data is, the larger the corresponding partition number is.
In the embodiment of the invention, the repartition operator can adopt a 'reparation' algorithm, namely traversing each data in the data to be synchronized and randomly distributing the traversed data to the corresponding new partition.
In detail, the creating an index for each of the partition data one by one using the virtual ES instance includes: acquiring preset index global configuration information; acquiring preset configuration content corresponding to each partition data, and configuring an index field according to the preset configuration content; and constructing an index corresponding to each partition data according to the preset index global configuration information and the index field.
In the embodiment of the present invention, the index may be understood as a place where the ES stores the data to be synchronized, that is, one database in the relational database. The data to be synchronized is stored and indexed in shards, indexing a logical space that brings one or more shards together.
In the embodiment of the present invention, the preset index global configuration information includes, but is not limited to, the number of fragments and copies, a data synchronization frequency, and the like.
In the embodiment of the invention, one index object can store a plurality of entities with different purposes, the main entity stored in the index object is called a document, one index object can store a plurality of documents with different purposes, different entities in a single index can be distinguished through index types, and the entities can be understood as tables in a relational database. Each document is composed of multiple fields, ES is an unstructured database, and each document may contain multiple fields and have a unique identifier.
In the embodiment of the present invention, the preset configuration content includes, but is not limited to, a keyword, a word frequency, a preset field attribute, and the like of each document.
In detail, the associating each partition data with the index corresponding to the partition data to generate the index file corresponding to the partition data includes: writing each partition data into a corresponding index one by one; and performing segmented refreshing, segmented submitting and segmented merging on the index after the data is written by using the virtual ES instance to obtain an index file corresponding to the partitioned data.
In the embodiment of the invention, the segmented refreshing operation is to write all documents in the memory buffer area into a new segment, write the documents into the file system cache and empty the old memory buffer area, write the documents into the memory buffer area, generate a new segment every 1s by default, perform the segmented refreshing operation every 1s by default, open the segment after the segmented refreshing operation, and realize near real-time search.
In the embodiment of the present invention, the segment commit operation refers to that the virtual ES instance completely commits and flushes each segment to the disk at one time, and writes a list including all segment commit points. The preset ES cluster uses the commit point list to judge which segments belong to the current fragment in the process of starting or reopening an index so as to ensure the safety of data.
In the embodiment of the present invention, the segment merging operation refers to that a new segment is generated after a segment refreshing operation is performed on data in the index each time, and as time increases, the number of newly generated segments increases, which may cause that all segments need to be scanned sequentially each time the preset ES cluster performs a data retrieval operation, which results in a slow data retrieval speed.
In this embodiment of the present invention, the segment refresh operation may be implemented by a refresh interface provided by the virtual ES instance, the segment commit operation may be implemented by a flush interface provided by the virtual ES instance, and the segment merge operation may be implemented by a merge interface provided by the virtual ES instance.
And step three, sending the index file to a preset ES cluster to load the data to be synchronized.
In the embodiment of the present invention, the preset ES cluster only needs to read the index file, and store the data to be synchronized according to the data fragmentation and the index relationship defined by the index file, and there is no need to perform data analysis on the data to be synchronized and create an index.
Preferably, before the sending the index file to the preset ES cluster for loading the data to be synchronized, the method further includes: and synchronizing the index file to a preset file system of a Hadoop system for data backup.
In the embodiment of the present invention, when the preset ES cluster synchronizes the data to be synchronized from the preset Hadoop System, and synchronization fails due to an abnormal condition, recovery of related data may be performed according to an index File stored in the File System, where the File System may adopt HDFS (Hadoop Distributed File System).
According to the embodiment of the invention, the pre-constructed virtual ES instance is deployed on the preset Hadoop system, the index file of the data to be synchronized is generated through the virtual ES instance, and the index file is sent to the preset ES cluster, the preset ES cluster does not need to perform index creation and fragmentation operation on the data to be synchronized, and only needs to store the data to be synchronized according to the index file, so that the consumption of the ES cluster in the data synchronization process is reduced, and the ES data synchronization efficiency is improved.
Therefore, the ES data synchronization device based on Hadoop provided by the invention can improve ES data synchronization efficiency.
Fig. 3 is a schematic structural diagram of an electronic device for implementing the Hadoop-based ES data synchronization method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a Hadoop based ES data synchronization program, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, e.g. a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a Hadoop-based ES data synchronization program, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., Hadoop-based ES data synchronization programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device 1 and another electronic device.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the embodiments described are illustrative only and are not to be construed as limiting the scope of the claims.
The Hadoop-based ES data synchronization program stored in the memory 11 of the electronic device 1 is a combination of a plurality of instructions, and when running in the processor 10, can realize:
deploying a pre-constructed virtual ES instance on a preset Hadoop system;
acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance;
and sending the index file to a preset ES cluster to load the data to be synchronized.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
deploying a pre-constructed virtual ES instance on a preset Hadoop system;
acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance;
and sending the index file to a preset ES cluster to load the data to be synchronized.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it will be obvious that the term "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A Hadoop-based ES data synchronization method is characterized by comprising the following steps:
deploying a pre-constructed virtual ES instance on a preset Hadoop system;
acquiring data to be synchronized, and generating an index file of the data to be synchronized by using the virtual ES instance;
and sending the index file to a preset ES cluster to load the data to be synchronized.
2. The Hadoop-based ES data synchronization method as claimed in claim 1, wherein the deployment of the pre-constructed virtual ES instance on a pre-defined Hadoop system comprises:
acquiring networking information of a preset ES cluster;
loading the pre-constructed virtual ES instance on the preset Hadoop system according to the networking information of the ES cluster;
activating the virtual ES instance.
3. The Hadoop-based ES data synchronization method of claim 1, wherein the generating the index file of the data to be synchronized by using the virtual ES instance comprises:
re-partitioning the data to be synchronized to obtain partition data with a preset number;
creating an index for each of the partition data one by one using the virtual ES instance;
and associating each partition data with the index corresponding to the partition data to generate an index file corresponding to the partition data.
4. The Hadoop-based ES data synchronization method of claim 3, wherein the re-partitioning the data to be synchronized to obtain a preset number of partitioned data comprises:
acquiring the size of the data to be synchronized;
calculating the size of each partition according to the size of the data to be synchronized and the preset number;
and partitioning the data to be synchronized according to the size of each partition by using a preset repartitioning operator to obtain the partition data with the preset quantity.
5. The Hadoop-based ES data synchronization method of claim 3 wherein said creating an index for each of said partitioned data one by one using said virtual ES instance comprises:
acquiring preset index global configuration information;
acquiring preset configuration content corresponding to each partition data, and configuring an index field according to the preset configuration content;
and constructing an index corresponding to each partition data according to the preset index global configuration information and the index field.
6. The Hadoop-based ES data synchronization method as claimed in claim 3, wherein the associating each of the partition data with the index corresponding to the partition data to generate the index file corresponding to the partition data comprises:
writing each partition data into a corresponding index one by one;
and performing segmented refreshing, segmented submission and segmented combination on the index written with the data by using the virtual ES instance to obtain an index file corresponding to the partitioned data.
7. The Hadoop-based ES data synchronization method according to any one of claims 1 to 6, wherein before sending the index file to a preset ES cluster for loading the data to be synchronized, the method further comprises:
and synchronizing the index file to the preset file system of the Hadoop system for data backup.
8. A Hadoop-based ES data synchronization apparatus, the apparatus comprising:
the virtual ES instance deployment module is used for deploying a pre-constructed virtual ES instance on a preset Hadoop system;
the index file generation module is used for acquiring data to be synchronized and generating an index file of the data to be synchronized by using the virtual ES instance;
and the index file sending module is used for sending the index file to a preset ES cluster to load the data to be synchronized.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the Hadoop based ES data synchronization method according to any one of claims 1 to 7.
10. A computer-readable storage medium, storing a computer program, which when executed by a processor implements the Hadoop-based ES data synchronization method according to any one of claims 1 to 7.
CN202210583140.0A 2022-05-25 2022-05-25 Hadoop-based ES data synchronization method, device, equipment and medium Pending CN114925138A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210583140.0A CN114925138A (en) 2022-05-25 2022-05-25 Hadoop-based ES data synchronization method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210583140.0A CN114925138A (en) 2022-05-25 2022-05-25 Hadoop-based ES data synchronization method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN114925138A true CN114925138A (en) 2022-08-19

Family

ID=82810658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210583140.0A Pending CN114925138A (en) 2022-05-25 2022-05-25 Hadoop-based ES data synchronization method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN114925138A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463692A (en) * 2017-08-11 2017-12-12 山东合天智汇信息技术有限公司 Super large text data is synchronized to the method and system of search engine
CN111506646A (en) * 2020-03-16 2020-08-07 阿里巴巴集团控股有限公司 Data synchronization method, device, system, storage medium and processor
CN113407634A (en) * 2021-07-05 2021-09-17 挂号网(杭州)科技有限公司 Data synchronization method, device, system, server and storage medium
CN113590703A (en) * 2021-08-10 2021-11-02 平安银行股份有限公司 ES data importing method and device, electronic equipment and readable storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107463692A (en) * 2017-08-11 2017-12-12 山东合天智汇信息技术有限公司 Super large text data is synchronized to the method and system of search engine
CN111506646A (en) * 2020-03-16 2020-08-07 阿里巴巴集团控股有限公司 Data synchronization method, device, system, storage medium and processor
CN113407634A (en) * 2021-07-05 2021-09-17 挂号网(杭州)科技有限公司 Data synchronization method, device, system, server and storage medium
CN113590703A (en) * 2021-08-10 2021-11-02 平安银行股份有限公司 ES data importing method and device, electronic equipment and readable storage medium

Similar Documents

Publication Publication Date Title
US9536014B1 (en) Parallel processing of data
CN103379159B (en) A kind of method that distributed Web station data synchronizes
US10310904B2 (en) Distributed technique for allocating long-lived jobs among worker processes
CN112328677B (en) Lost data recovery method, device, equipment and medium based on table association
CN111651453A (en) User historical behavior query method and device, electronic equipment and storage medium
CN115118738B (en) Disaster recovery method, device, equipment and medium based on RDMA
CN111538573A (en) Asynchronous task processing method and device and computer readable storage medium
CN112506486A (en) Search system establishing method and device, electronic equipment and readable storage medium
CN112015815A (en) Data synchronization method, device and computer readable storage medium
CN114691050B (en) Cloud native storage method, device, equipment and medium based on kubernets
CN111651426B (en) Data migration method, device and computer readable storage medium
CN115543198A (en) Method and device for lake entering of unstructured data, electronic equipment and storage medium
CN113297180A (en) Data migration method and device, electronic equipment and storage medium
CN114116684A (en) Docker containerization-based deep learning large model and large data set version management method
CN113590703A (en) ES data importing method and device, electronic equipment and readable storage medium
CN112506432A (en) Dynamic and static separated real-time data storage and management method and device for electric power automation system
CN114626103A (en) Data consistency comparison method, device, equipment and medium
CN116842244A (en) Search engine data synchronization method, system, device and storage medium
CN116303789A (en) Parallel synchronization method and device for multi-fragment multi-copy database and readable medium
CN114925138A (en) Hadoop-based ES data synchronization method, device, equipment and medium
CN106649669B (en) A kind of date storage method and system based on long-range Dictionary server
CN115687384A (en) UUID (user identifier) identification generation method, device, equipment and storage medium
CN115687359A (en) Data table partitioning method and device, storage medium and computer equipment
CN114860690A (en) Data migration method, device, equipment and storage medium
CN114510400A (en) Task execution method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination