CN112699118A

CN112699118A - Data synchronization method, corresponding device, system and storage medium

Info

Publication number: CN112699118A
Application number: CN202011565310.XA
Authority: CN
Inventors: 徐辉
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2021-04-23

Abstract

The application provides a data synchronization method, a corresponding device, a corresponding system and a corresponding storage medium. The data synchronization method comprises the following steps: the distributed file system sends the source data layer data to a Hive server of a data warehouse tool; receiving uniform warehouse layer data which is generated and fed back by the Hive server based on the source data layer data; sending source data layer data and unified data warehouse layer data to a data center system; and the data center system receives and stores the data of the source data layer and the data of the uniform data warehouse layer. According to the method and the system, the source data layer data and the unified data warehouse layer data in the distributed file system can be synchronized to the data staging system, so that the safe backup of the data in the distributed file system is realized, and the situations of data mistaken deletion and data mistaken directory are effectively prevented.

Description

Data synchronization method, corresponding device, system and storage medium

Technical Field

The present application relates to the field of data analysis technologies, and in particular, to a data synchronization method, and a corresponding apparatus, system, and storage medium.

Background

With the continuous enrichment and improvement of data collection means, more and more industry data are accumulated. Data size has grown to a massive data level that the traditional software industry cannot carry. Under the scene of analyzing mass data, due to the limitation of the processing capacity of a single server, a data reader and a data analyzer usually adopt a distributed computing mode, and the situation of data mistaken deletion or data transmission mistaken directory is easy to happen under the large data environment.

Disclosure of Invention

The application provides a data synchronization method, a corresponding device, a corresponding system and a corresponding storage medium aiming at the defects of the existing mode, and is used for solving the technical problem that data is easy to be deleted by mistake or a data transmission by mistake in a large data environment in the prior art.

In a first aspect, an embodiment of the present application provides a data synchronization method, which is applied to a distributed file system, and includes:

sending source data layer data to a Hive server of a data warehouse tool;

receiving uniform warehouse layer data which is generated and fed back by the Hive server based on the source data layer data;

and sending the data of the source data layer and the data of the unified number bin layer to a data center system.

In a second aspect, an embodiment of the present application provides a distributed file system, including:

a memory;

a processor electrically connected to the memory;

the memory stores a computer program, and the computer program is executed by the processor to implement the data synchronization method provided by the first aspect of the embodiments of the present application.

In a third aspect, an embodiment of the present application provides a data synchronization method, which is applied to a data relay station system, and includes:

receiving and storing source data layer data and uniform data warehouse layer data sent by a distributed file system; the uniform data warehouse layer data is generated by the Hive server based on the data of the source data layer and fed back to the distributed file system.

In a fourth aspect, an embodiment of the present application provides a data relay system, including: a data center station main server;

the data center station main server comprises: a memory and a processor electrically connected;

the memory stores a computer program that is executed by the processor to implement the data synchronization method provided by the third aspect of the embodiments of the present application.

In a fifth aspect, an embodiment of the present application provides a data synchronization system, including: the system comprises a columnar storage server, a Hive server, a distributed file system of the second aspect of the embodiment of the application and a data staging system of the fourth aspect of the embodiment of the application;

the data center system and the Hive server are in communication connection with the distributed file system;

the columnar storage server is used for: sending uniform number bin layer data in a column type storage database to a Hive server;

the Hive server is used for: receiving data of a source pasting data layer sent by a distributed file system; generating uniform warehouse layer data based on the source data layer data; and sending the uniform number of warehouse layer data to the distributed file system.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements: the data synchronization method provided by the first aspect of the embodiment of the present application or the data synchronization method provided by the third aspect of the embodiment of the present application.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

according to the embodiment of the application, unified data warehouse layer data can be generated based on the source pasting data layer data in the distributed file system, the source pasting data layer data and the unified data warehouse layer data are synchronized to the data intermediate platform system, the safe backup of the data in the distributed file system is achieved, and the situations of data mistaken deletion and data mistaken directory are effectively prevented. Therefore, when an accident (such as unstable network or network failure) occurs, a user (such as an enterprise user) can read data normally to perform data processing (such as data analysis, data cleaning and the like).

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic structural framework diagram of a data synchronization system according to an embodiment of the present application;

FIG. 2 is a block diagram of a structural framework of another data synchronization system and a connection between the system and a big data platform according to an embodiment of the present disclosure;

FIG. 3 is a data flow diagram of data transmission and application in an embodiment of the present application;

fig. 4 is a schematic structural framework diagram of an electronic device according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of a data synchronization method according to an embodiment of the present application;

fig. 6 is a partial schematic flow chart of another data synchronization method according to an embodiment of the present application;

fig. 7 is a schematic flowchart of another data synchronization method according to an embodiment of the present application;

fig. 8 is a schematic structural framework diagram of a data synchronization apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural framework diagram of another data synchronization apparatus according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar parts or parts having the same or similar functions throughout. In addition, if a detailed description of the known art is not necessary for illustrating the features of the present application, it is omitted. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

It will be understood by those within the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

The terms referred to in this application will first be introduced and explained:

data asset: data resources, such as file data, electronic data and the like, owned or controlled by the enterprise and capable of bringing future economic benefits to the enterprise, are recorded in a physical or electronic mode. The characteristics of the data asset characteristics may include: the system has the advantages of global data coverage, clear hierarchical structure, accurate and consistent data, improved performance, reduced cost, convenience and easiness in use. The data assets are the most core products of the data center, the data of all products on each service line in an enterprise are communicated, the data can be reused and shared only by processing once, the chimney type development and storage of each original service are abandoned, and the development cost and the manpower and material resources are reduced.

The Data assets may include Data in all or part of Data layers and dimension tables in ODS (Operational Data Store), DW (Data wavelet), TDM (Tag Data Model, Application Data Store), ADS (Application Data Store), and DIM (DIM Data, dimension table).

ODS layer: the data table of the data warehouse source system is usually stored in an original way, which is called an ODS layer, also called a preparation area, and is the source of processing data of a subsequent data warehouse layer.

DW layer: the Data management system comprises a DWD (Data Warehouse Detail layer) and a DWS (Data Warehouse Summary layer), wherein the DWD is an isolation layer of a service layer and a Data Warehouse and is a layer closest to Data in a Data source, and the Data in the Data source enters an ETL (Extract-Transform-Load) and is then loaded into a local layer; the layer mainly solves the problems of data quality and data integrity; the table of the DW layer generally includes two types, one type is used for storing data which needs to be loaded currently, and the other type is used for storing historical data after processing.

Hive: a data warehouse basic tool is used for processing structured data in a distributed system infrastructure Hadoop, and the Hive infrastructure is arranged above the Hadoop, so that query and analysis are convenient.

Hive provides a simple SQL Query function, and can convert an SQL (Structured Query Language) statement into a MapReduce (a programming model for parallel operations of large-scale data sets) task for operation. Hive is built on top of Hadoop based static batch processing, which generally has high delay and needs a large amount of overhead when a job is submitted and scheduled, therefore, Hive cannot realize low-delay fast query on a large-scale data set, for example, Hive executes query on a data set of hundreds of MB (megabyte) and generally has time delay of minute level, and therefore Hive is not suitable for applications needing low delay.

HDFS (Hadoop distributed File System): the abbreviation of Hadoop Distributed File System stands for Hadoop Distributed File System. The HDFS is a high fault-tolerant system, is suitable for being deployed on cheap machines, and is very suitable for persistently storing an ultra-large data set; HDFS also provides high throughput data access, and is well suited for applications on large data sets.

SVN: the abbreviation of subversion, an open-source version control system, is used for efficient management by adopting a branch management system, in short, a plurality of persons jointly develop the same project, shared resources are realized, and finally centralized management is realized.

Whether the file is text or binary type, the SVN internally represents the updated part of the file by a binary difference comparison algorithm, which means that all files are stored in a file library in a difference form; and the files transmitted on the network are small difference parts of the files, so that the operations of creating branches, marking labels, combining and the like can be completed very quickly, and the management of the organization of the directory is more convenient. The SVN can not only perform version tracking on the file, but also perform version tracking on the directory, so that the directory structure can be modified at any time according to the needs of the project, the existing directory can be moved to a new place, and the completeness of the submitting operation is ensured. The processing mode of the SVN to the submission operation is similar to the transaction processing of the database, and the SVN is either completely successful or completely invalid, so that the atomicity is ensured.

Kafka: the message engine can support a cluster mode, and can realize that a plurality of same backups jointly provide a redundancy mechanism to ensure high availability of the system; the data can be stored persistently (generally set to be 7 days), which means that the data transmitted into Kafka within 7 days cannot be deleted and is stored on the hard disk in a landing manner; also, the Kafka core architecture is that the producer sends a message to Kafka, and the consumer reads the message from the Kafka server, which is the most common two message engine paradigms: a message queue model and a publish/subscribe model, both of which are supported by Kafka.

ClickHouse: is a column-oriented DBMS (Database Management System), i.e., a single column of values can be stored separately, allowing tables and databases to be created, data loaded, and queries run at run-time without reconfiguring and restarting the server; performance is improved using data compression; the disk is supported to store data, and the storable data volume is improved; real-time data updating is supported, data can be continuously added into a table, and no lock processing is performed when the data is added; the method has the greatest advantages of high query speed, large volume and PB level, supports SQL-like query, supports database allopatric copy deployment, and is very suitable for a data source database supporting data service application.

The data center station: the method is characterized in that mass data are acquired, calculated, stored and processed through a data technology, data standards and calibers are unified, and meanwhile, model services, algorithm services, organization, processes, standards, specifications, management systems and the like of a data center station in the data center station building process are included. The data center station can form standard data after unifying the data, then stores the standard data to form data assets, and realizes data service capability through a data mining and analyzing tool, thereby providing high-efficiency service for enterprises and clients.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments.

An embodiment of the present application provides a data synchronization system, and as shown in fig. 1, the data synchronization system 100 includes: a Hive server 110, a Distributed File System (HDFS) 120, and a data staging System 130; both the data staging system 130 and the HIVE server 110 are communicatively coupled to the distributed file system 120.

The Hive server 110 is used for: receiving data of the source data layer sent by the distributed file system 120; generating uniform warehouse layer data based on the source data layer data; the unified number bin level data is sent to distributed file system 120.

In one embodiment, the number of the Hive servers 110 may be one, and the Hive servers 110 may store the source data layer data and the uniform number bin layer data in a Hive database, create an external table of the two layers of data after sending the uniform number bin layer data to the distributed file system 120, and use a storage path of the HDFS as a read-write path of the external table.

In another embodiment, the number of the Hive servers 110 may be two, and when receiving the paste-source data layer data, one Hive server 110 may store the paste-source data layer data in a Hive database of the server, and establish an external table ODS _ Hive of the data, where a read-write path of the external table ODS _ Hive is a storage path of the HDFS; when the Hive server 110 generates uniform number bin layer data, the uniform number bin layer data may be stored in a Hive database of another Hive server 110, and after the uniform number bin layer data is sent to the distributed file system 120, an external table DW _ Hive of the data is established, where a read-write path of the external table DW _ Hive is a storage path of the HDFS.

Depending on the size of the data volume, more Hive servers 110 can be arranged for storing data and constructing corresponding external tables according to actual needs.

Optionally, the Hive server 110 in this embodiment of the application may further generate label data layer data and application data layer data based on the uniform data bin layer data, and create a dimension table DIM.

Optionally, as shown in fig. 2, the data synchronization system 100 provided in the embodiment of the present application further includes: a ClickHouse server 140 (a server storing a columnar storage database ClickHouse) and a backend server 150.

The ClickHouse server 140 and the background server 150 are both in communication connection with the Hive server 110, and the ClickHouse server 140 is used for receiving the uniform number warehouse level data sent by the Hive server 110 and storing the uniform number warehouse level data in a column-type storage database ClickHouse; the background server 150 is configured to receive data of at least one layer of the data assets sent by the Hive server 110 and store the data as metadata of the data assets.

Metadata for data assets includes hardware (e.g., terminal devices such as displays, cell phones, computers, etc.) data and software data. As shown in fig. 3, the hardware data includes hardware buried point data (real-time data) and pre-stored database data (off-line data), and may specifically include attribute information of the terminal device itself, such as supported resolution, device size, and the like; the software data comprises client data and server data, wherein the client data and the server data respectively comprise software buried point data (real-time data) of corresponding software and prestored database software data (off-line data), and specifically can comprise recorded data and stored files generated by the corresponding software of the client and the server in the using process.

Referring to fig. 2 and 3, the hardware data and the software data are transmitted to the big data platform 160 through a universal data Interface API (Application Programming Interface), and the big data platform 160 performs data cleaning on the received hardware data and software data and then sends the cleaned hardware data and software data to the HDFS as metadata of the data asset. The attribute information included in the universal type data interface may include a data source (from system), a data reporter (reporter), a data version number (version), a data transmission place (location), a data occurrence time stamp (event time), a data event type (event type), and data detail information (attributes).

In order to implement an interface that can multiplex and collect real-time data streams from various sources, the generic data interface may define a generic standard data format, such as, but not limited to, json format or xml format.

The big data platform 160 includes HDFS and Spark (fast general purpose computing engines designed specifically for large scale data processing) and is installed with data consumers (which can be implemented with the message engine Kafka); the HDFS in the big data platform 160 may receive and store the hardware data and the software data, Spark may clean the hardware data and the software data, Kafka may send the cleaned hardware data and software data to the HDFS in the data synchronization system provided in the embodiment of the present application, and the HDFS in the data synchronization system may construct ODS layer data based on the received hardware data and software data.

Both hardware and software data sent by Kafka to an HDFS in a data synchronization system may include data source, data reporter, data version number, data transmission location, data occurrence timestamp, data event type, and data detail information.

Hardware data and software data received by an HDFS (Hadoop distributed File System) in the data synchronization system are from API (application program interface) interfaces, have a uniform data format, and comprise data sources, data reporters, data version numbers, data sending places, data occurrence time stamps, data event types and data detailed information, so that the writing of subsequent off-line development is facilitated, no matter what the initial data format of each service is, after the data format is unified, a program can be adopted for processing, the workload is reduced, and developers are prevented from knowing service data metadata.

An embodiment of the present application provides a distributed file system 120, including: a memory and a processor electrically connected.

The memory stores a computer program executed by the processor to implement a data synchronization method, the details of which will be described later.

Referring to fig. 2, an embodiment of the present application provides a data center system 130, including: a central office server 131.

The data center desk server 131 includes: a memory and a processor electrically connected; the memory stores a computer program that is executed by the processor to implement a data synchronization method, the details of which will be described later.

Optionally, referring to fig. 2, the data center system 130 provided in the embodiment of the present application further includes: a plurality of data staging servers 132.

Each of the data center station sub-servers 132 is communicatively connected to the data center station main server 131. The data center desk server 131 is used for: transmitting the requested data to the data console server 132 in response to the second data call request of the data console server 132; the data center station server 132 is configured to: receiving and storing data sent by the central office server 131 in the data; and responding to a third data calling request of the user terminal, and sending the requested data to the user terminal.

The data center station server 132 in the embodiment of the present application is a local server, the second data call request may be generated by an update operation on the data center station server 132, and the data center station server 132 may send the second data call request to the data center station main server 131 during each update, so as to obtain data updated in the data center station main server 131.

Optionally, each data center station server 132 includes: a memory and a processor electrically connected; the memory stores a computer program executed by the processor to implement the data transmission method as follows: receiving and storing data sent by the central office server 131 in the data; and responding to a third data calling request of the user terminal, and sending the requested data to the user terminal.

Developers can access the data console server 132 through the user terminal to read the uploaded data for specific application of the data, as shown in fig. 3, such as data services for development, operation, analysis, query, prediction, and the like of the data.

In the embodiment of the application, the data flow from hardware data and software data to the data application of the developer is shown in fig. 3.

Optionally, the summary of data server 131 and the sub-data server 132 are both SVN servers, i.e., servers installed with SVNs.

The SVN is a product with safer technical performance, the combination of the system and the control is realized, on one hand, the whole safety function of the system can be effectively distributed in the branch systems, and further, the normal operation of the branch systems is ensured, so that the branch systems can be complemented, the safety of the system integrity is finally ensured, and the aim of finally pursuing safety is realized through a balance principle.

Those skilled in the art will appreciate that the distributed file system 120, the data center main server 131 and the data center sub-server 132 provided in the embodiments of the present application may each be specially designed and manufactured for the required purposes, or may also include known devices in general-purpose computers. These devices have stored therein computer programs that are selectively activated or reconfigured. Such a computer program may be stored in a device (e.g., computer) readable medium or in any type of medium suitable for storing electronic instructions and respectively coupled to a bus.

The present application provides, in an alternative embodiment, an electronic device as any one of the distributed file system 120, the data center station main server 131, and the data center station sub-server 132. As shown in fig. 4, the electronic device 400 includes: the memory 401 and the processor 402 are electrically connected, such as by a bus 403.

The Memory 401 may be a ROM (Read-Only Memory) or other type of static storage device that can store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read-Only Memory) or other optical disk storage, optical disk storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The Processor 402 may be a CPU (Central Processing Unit), a general purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 402 may also be a combination of computing functions, e.g., comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 403 may include a path that transfers information between the above components. The bus may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Optionally, the electronic device 400 may also include a transceiver 404. The transceiver 404 may be used for reception and transmission of signals. The transceiver 404 may allow the electronic device 400 to communicate wirelessly or wiredly with other devices to exchange data. It should be noted that the transceiver 404 is not limited to one in practical applications.

Optionally, the electronic device 400 may further include an input unit 405. The input unit 405 may be used to receive input numeric, character, image, and/or sound information or to generate key signal inputs related to user settings and function control of the electronic device 400. The input unit 405 may include, but is not limited to, one or more of a touch screen, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, a camera, a microphone, and the like.

Optionally, the electronic device 400 may further include an output unit 406. Output unit 406 may be used to output or present information processed by processor 402. The output unit 406 may include, but is not limited to, one or more of a display device, a speaker, a vibration device, and the like.

While fig. 4 illustrates an electronic device 400 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

Based on the same inventive concept, the embodiment of the present application provides a data synchronization method, which can be applied to the foregoing data synchronization system 100, as shown in fig. 5, the data synchronization method includes:

s510, the distributed file system 120 sends the paste source data layer data to the Hive server 110.

Alternatively, the provenance data layer may be constructed by:

receiving hardware data and software data sent by a message engine Kafka; creating a source data table comprising a data source, a data reporter, a data version number, a data sending place, a data occurrence timestamp, a data event type and data detailed information; and writing the hardware data and the software data into a position corresponding to the current date in a source data table according to the current date corresponding to the data occurrence time stamp of the hardware data and the software data to obtain the original data layer.

S520, the Hive server 110 receives the data of the source data layer, generates uniform data warehouse layer data based on the data of the source data layer, and sends the uniform data warehouse layer data to the distributed file system 120.

Alternatively, the uniform number bin layer may be constructed based on Kimball's dimensional modeling theory.

S530, the distributed file system 120 receives the uniform number bin layer data fed back by the Hive server 110.

S540, the distributed file system 120 sends the source data layer data and the unified data warehouse layer data to the data staging system 130.

Specifically, the distributed file system 120 sends the posted data layer data and the unifying data warehouse layer data to the data center desk server 131.

S550, the data center system 130 receives and stores the data of the source data layer and the data of the uniform data warehouse layer sent by the distributed file system 120.

Optionally, the databank server 131 receives and stores the data of the source data layer and the data of the uniform data warehouse layer sent by the distributed file system 120.

Optionally, the data synchronization method provided in the embodiment of the present application further includes:

the data center station main server 131 sends the requested data to the data center station sub-server 132 in response to the second data call request of the data center station sub-server 132; the data center station sub-server 132 receives and stores the data transmitted from the data center station main server 131, and transmits the requested data to the user terminal in response to a third data call request of the user terminal.

Optionally, as shown in fig. 6, when the distributed file system 120 sends the tile source data layer data and the unified data warehouse layer data to the data staging system 130, the following steps S541-S544 are specifically performed:

s541, traversing the files in the system and acquiring the information abstract value of each traversed file.

The information Digest value may be any one of an MD5(MD5 Message Digest Algorithm, MD5 information Digest Algorithm) value, an MD4(MD4 Message Digest Algorithm, MD4 information Digest Algorithm) value, and the like.

Alternatively, a python (a cross-platform computer programming language) script may be employed to traverse the files in the present system.

Optionally, when traversing files in the system, a recursive traversal mode may be adopted, specifically, it is determined whether a traversed current file is a folder, if the current file is a folder, a file in the folder is traversed, and if the current file is not a folder, a next file is continuously traversed.

Optionally, traversing the files in the present system includes any one of the following two ways: firstly, traversing the files themselves stored in the system, namely the distributed file system 120; and secondly, traversing the data directory of the external table of the Hive database.

S542, determining whether the message digest value already exists in the stored message digest list; if yes, go to step S543; if not, S544 is executed.

S543, stops sending the data of the file to which the information digest belongs to the data staging system 130.

S544, sends the data of the file to which the information digest value belongs to the data console system 130, and adds the information digest value to the information digest list.

The method for traversing and transmitting data can realize duplication elimination, the information abstract value of each transmitted data is added to the information abstract list, whether the data to be transmitted are duplicated or not can be judged subsequently based on the information abstract value in the information abstract list, and the new data is transmitted to the data intermediate station system 130 only when the data are not duplicated, so that invalid data transmission is reduced, and the data transmission efficiency is improved.

Optionally, as shown in fig. 7, the data synchronization method provided in the embodiment of the present application further includes, on the basis of the above steps S510 to S550, the following steps S560 to S590:

s560, the data center system 130 traverses the files in the system, and determines whether the system increases or decreases the data relative to the distributed file system 120; if the data is added, go to step S570; if the data is reduced, step S580 is performed.

S570, the data console system 130 deletes the data added in the system.

S580, the data console system 130 sends the first data call request to the distributed file system 120 based on the reduced data.

The first data call request may be generated by an update operation on the data center desk server 131, and the data center desk server 131 may send the first data call request to the distributed file system 120 at each update to obtain updated data in the distributed file system 120.

S590, the distributed file system 120 sends the data requested to be called to the data console system 130 in response to the first data call request sent by the data console system 130.

The names of the added data or the reduced data can be recorded in a file, and then the name of the data to be deleted or the name of the data to be added can be inquired in the file, so that the corresponding data can be deleted or requested to be added, the data of the data center station system 130 is the same as the data in the distributed file system 120, and the data synchronization of the data center station system 130 and the distributed file system 120 is realized.

Based on the same inventive concept, the embodiment of the present application provides a data synchronization apparatus, which can be applied to the distributed file system 120, as shown in fig. 8, the data synchronization apparatus 800 includes: a first data transmission module 801, a data reception module 802, and a second data transmission module 803.

The first data sending module 801 is configured to: the source data layer data is sent to the HIVE server 110, which is the application data warehouse facility.

The data receiving module 802 is configured to: and receiving unified data warehouse layer data which is generated and fed back by the HIVE server 110 based on the data of the source data layer.

The second data sending module 803 is configured to: the posted source data layer data and the consolidated data bin layer data are sent to the data staging system 130.

Optionally, the second data sending module 803 is specifically configured to: and sending the data of the source pasting data layer and the data of the unified data warehouse layer to a data center main server in the data center system.

Optionally, the second data sending module 803 is specifically configured to: traversing files in the distributed file system 120, and acquiring an information abstract value of each traversed file; determining whether a message digest value already exists in a stored message digest list; if the message digest value already exists in the stored message digest list, stopping sending the data of the file to which the message digest value belongs to the data console system 130; if the message digest value does not exist in the stored message digest list, the data of the file to which the message digest value belongs is transmitted to the data center system 130, and the message digest value is added to the message digest list.

Optionally, the second data sending module 803 is further configured to: in response to the first data call request sent after the data console system 130 traverses the file, the data requested to be called is sent to the data console system 130.

Based on the same inventive concept, the present application provides another data synchronization apparatus, which can be applied to the data relay station system 130, as shown in fig. 9, the data synchronization apparatus 900 includes: a data receiving module 901.

The data receiving module 901 is configured to: receiving and storing the source data layer data and the uniform data warehouse layer data sent by the distributed file system 120; the consolidated data warehouse tier data is generated by the HIVE server 110 based on the source data tier data and fed back to the distributed file system 120.

Optionally, the data synchronization apparatus applied to the data center system 130 according to the embodiment of the present application further includes: the device comprises a file traversing module, a data deleting module and a data sending module.

The file traversal module is used for: traversing the files in the data center system 130 to determine whether the data center system 130 has added or subtracted data relative to the distributed file system 120; the data deleting module is used for: upon determining that the data staging system 130 has added data relative to the distributed file system 120, deleting the added data in the data staging system 130; the data sending module is used for: upon determining that the data staging system 130 has reduced data relative to the distributed file system 120, a first data call request is sent to the distributed file system 120 based on the reduced data.

The data receiving module 901 is further configured to receive data sent by the distributed file system 120 for the first data call request.

The data synchronization apparatus of the embodiment of the present application can execute any data synchronization method provided by the embodiment of the present application, and the implementation principles thereof are similar, and the content not shown in detail in this embodiment may refer to the foregoing embodiments, and is not described herein again.

Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements any one of the data synchronization methods provided by the embodiments of the present application.

The computer readable medium includes, but is not limited to, any type of disk including floppy disks, hard disks, optical disks, CD-ROMs, and magneto-optical disks, ROMs, RAMs, EPROMs (Erasable Programmable Read-Only Memory), EEPROMs, flash Memory, magnetic cards, or fiber optic cards. That is, a readable medium includes any medium that stores or transmits information in a form readable by a device (e.g., a computer).

The embodiment of the present application provides a computer-readable storage medium suitable for any one of the above data synchronization methods, which is not described herein again.

By applying the technical scheme of the embodiment of the application, at least the following beneficial effects can be realized:

1) according to the embodiment of the application, unified data warehouse layer data can be generated based on the source pasting data layer data in the distributed file system, the source pasting data layer data and the unified data warehouse layer data are synchronized to the data intermediate platform system, the safe backup of the data in the distributed file system is achieved, and the situations of data mistaken deletion and data mistaken directory are effectively prevented. Therefore, when an accident (such as unstable network or network failure) occurs, a user (such as an enterprise user) can read data normally to perform data processing (such as data analysis, data cleaning and the like).

2) When the data of the data center system is synchronized and backed up, the file of the distributed file system can be traversed, the information abstract value of each file is obtained, whether the data of the current file is synchronized and backed up to the data center system is determined based on the information abstract value, and only the unsynchronized and backed-up data is synchronized and backed up to the data center, so that the duplication removal of the data is realized, and the data synchronization and backup efficiency is improved.

3) According to the embodiment of the application, after the source data layer data is generated into the unified data warehouse layer data to be synchronized and backed up to the data center system, files and data of the data center system can be traversed, the difference between the data backed up by the data center system and the data stored in the distributed file system is determined, and the backed-up data in the data center system is consistent with the data stored in the distributed file system in a mode of deleting the data added to the data center system or calling the reduced data from the distributed file system, so that the data synchronization and the backup of the data center system are further perfected, and the data safety is further improved.

4) The data center station sub-server in the embodiment of the application can be used for communicating with a distributed file system to backup data in the distributed file system, the local data center station sub-server obtains backup data of the data center station sub-server and directly communicates with the user terminal, and a user (such as an enterprise user) can directly access the data center station sub-server through the user terminal to read required data conveniently, so that the user can normally read the data to perform data processing (such as data analysis, data cleaning and the like) when an accident (such as unstable network or unavailable network) occurs, and the situation that the data cannot be accessed or the data is lost is avoided.

Those of skill in the art will appreciate that the various operations, methods, steps in the processes, acts, or solutions discussed in this application can be interchanged, modified, combined, or eliminated. Further, other steps, measures, or schemes in various operations, methods, or flows that have been discussed in this application can be alternated, altered, rearranged, broken down, combined, or deleted. Further, steps, measures, schemes in the prior art having various operations, methods, procedures disclosed in the present application may also be alternated, modified, rearranged, decomposed, combined, or deleted.

In the description of the present application, it is to be understood that the terms "first", "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the present application, "a plurality" means two or more unless otherwise specified.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

The foregoing is only a partial embodiment of the present application, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present application, and these modifications and decorations should also be regarded as the protection scope of the present application.

Claims

1. A data synchronization method is applied to a distributed file system and comprises the following steps:

sending source data layer data to a Hive server of a data warehouse tool;

and sending the data of the source pasting data layer and the data of the uniform data warehouse layer to a data center system.

2. The data synchronization method of claim 1, wherein the sending the near source data layer data and the unified data bin layer data to a data center system comprises:

traversing files in the distributed file system, and acquiring the information abstract value of each traversed file;

determining whether the message digest value already exists in a stored message digest list;

if the information abstract value exists in the stored information abstract list, stopping sending the data of the file to which the information abstract value belongs to the data center system;

and if the information abstract value does not exist in the stored information abstract list, sending the data of the file to which the information abstract value belongs to the data center system, and adding the information abstract value to the information abstract list.

3. The data synchronization method according to claim 1 or 2, wherein after the sending the near source data layer data and the uniform data bin layer data to the data center system, further comprising:

and responding to a first data calling request sent after the data center station system traverses the file, and sending the data requested to be called to the data center station system.

4. The data synchronization method according to claim 1 or 2, wherein the data center system comprises: a data center station main server; the data center station main server is a version control system SVN server;

the sending the data of the source data layer and the data of the unified data warehouse layer to a data center system comprises:

sending the data of the source pasting data layer and the data of the unified data warehouse layer to a data center main server in the data center system;

and the data center station main server is a version control system SVN server.

5. A distributed file system, comprising: a memory and a processor electrically connected;

the memory stores a computer program for execution by the processor to implement the data synchronization method of any of claims 1-4.

6. A data synchronization method is applied to a data center system and comprises the following steps:

receiving and storing source data layer data and uniform data warehouse layer data sent by a distributed file system; the uniform data warehouse layer data is generated by the Hive server based on the source data layer data and fed back to the distributed file system.

7. The data synchronization method according to claim 6, wherein after receiving the data of the source data layer and the data of the bin layer of the uniform number sent by the distributed file system, the method further comprises:

traversing files in the data center system, and determining whether the data center system increases or decreases data relative to the distributed file system;

if the data center platform system adds data relative to the distributed file system, deleting the data added in the data center platform system;

if the data console system reduces data relative to the distributed file system, a first data call request is sent to the distributed file system based on the reduced data.

8. The data synchronization method according to claim 6, wherein the receiving and storing the source data layer data and the unified data warehouse layer data sent by the distributed file system comprises:

a data center station main server in the data center station system receives and stores source data layer data and uniform data warehouse layer data sent by a distributed file system;

and the data center station main server is an SVN server.

9. The data synchronization method of claim 6, further comprising:

the data center station main server responds to a second data calling request of a data center station sub-server in the data center station system and sends the requested data to the data center station sub-server;

the data center station sub-server receives and stores the data sent by the data center station main server, responds to a third data calling request of the user terminal, and sends the requested data to the user terminal;

and the data center station server is an SVN server.

10. A data center system, comprising: a data center station main server;

the memory stores a computer program for execution by the processor to implement the data synchronization method of any of claims 6-9.

11. The data center system of claim 10, further comprising: a plurality of data staging servers;

each data center station server is in communication connection with the data center station main server;

the data center station main server is used for: responding to a second data calling request of the data station server, and sending the requested data to the data station server;

the data center station server is used for: receiving and storing data sent by a central station server in the data; and responding to a third data calling request of the user terminal, and sending the requested data to the user terminal.

12. The system according to claim 11, wherein the central data station server and the central data station server are version control system SVN servers.

13. A data synchronization system, comprising: a Hive server, a distributed file system as claimed in claim 5 and a data staging system as claimed in any one of claims 10 to 12;

the Hive server is used for: and receiving source pasting data layer data sent by the distributed file system, generating uniform number warehouse layer data based on the source pasting data layer data, and sending the uniform number warehouse layer data to the distributed file system.

14. A computer-readable storage medium, in which a computer program is stored, which computer program, when executed by a processor, implements: the data synchronization method of any one of claims 1-4 or the data synchronization method of any one of claims 6-9.