WO2022021710A1 - Data dump method and apparatus, device, and storage medium - Google Patents

Data dump method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2022021710A1
WO2022021710A1 PCT/CN2020/132377 CN2020132377W WO2022021710A1 WO 2022021710 A1 WO2022021710 A1 WO 2022021710A1 CN 2020132377 W CN2020132377 W CN 2020132377W WO 2022021710 A1 WO2022021710 A1 WO 2022021710A1
Authority
WO
WIPO (PCT)
Prior art keywords
target data
data
map
hierarchical structure
shuffled
Prior art date
Application number
PCT/CN2020/132377
Other languages
French (fr)
Chinese (zh)
Inventor
宋大伟
丁静
Original Assignee
苏州亿歌网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州亿歌网络科技有限公司 filed Critical 苏州亿歌网络科技有限公司
Publication of WO2022021710A1 publication Critical patent/WO2022021710A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/221Column-oriented storage; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables

Definitions

  • the embodiments of the present application relate to the technical field of databases, and in particular, to a data dumping method, apparatus, device, and storage medium based on a Map/Reduce architecture.
  • a database is a "warehouse that organizes, stores and manages data according to the data structure". It has a large storage space and can store millions, tens of millions, and hundreds of millions of pieces of data. It is suitable for various types of business data, such as games. Platform data, medical system data, etc. However, with the continuous increase of the amount of data stored in the database, how to improve the efficiency of data query and use is an urgent problem to be solved.
  • Embodiments of the present application provide a data dumping method, apparatus, device, and storage medium based on a Map/Reduce architecture, so as to improve data query and use efficiency.
  • an embodiment of the present application provides a data dump method based on a Map/Reduce architecture, including:
  • the Map terminal obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage;
  • the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  • the Map terminal shuffles the target data according to the target data grading framework, including:
  • the Map terminal parses the key of the target data
  • the Map terminal sorts the keys of the target data according to the target data grading framework to obtain the shuffled target data.
  • the Map terminal sorts the keys of the target data according to the target data grading framework, including:
  • the Map terminal sorts the hash values corresponding to the keys of the target data according to the target data grading framework.
  • the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure, including:
  • the Reduce side obtains a preconfigured data hierarchy as the target data hierarchy
  • the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure.
  • the columnar storage form includes parquet columnar storage.
  • the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure, including:
  • the Reduce side aggregates and stores the shuffled target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
  • an embodiment of the present application further provides a data dump device based on a Map/Reduce architecture, including:
  • the Map-side processing module is used for the Map-side to obtain target data, and shuffle the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage;
  • the processing module on the Reduce side is used for the Reduce side to aggregate and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  • the Map-side processing module includes: a key parsing unit and a data shuffling unit, wherein,
  • the key parsing unit used for parsing the key of the target data by the Map end
  • the data shuffling unit is used for the Map end to sort the keys of the target data according to the target data grading framework to obtain the shuffled target data.
  • the data shuffling unit is specifically configured for the Map side to sort the hash values corresponding to the keys of the target data according to the target data grading framework.
  • the Reduce side processing module includes: a target data hierarchical structure acquisition unit and a data summary storage unit, wherein,
  • the target data hierarchical structure acquisition unit is used for the Reduce side to acquire a pre-configured data hierarchical structure as the target data hierarchical structure;
  • the data summary storage unit is used for the Reduce side to summarize and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  • the columnar storage form includes parquet columnar storage.
  • the Reduce side processing module is specifically used for the Reduce side to aggregate and store the shuffled target data according to the target data hierarchical structure and the specification of parquet columnar storage.
  • an embodiment of the present application further provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the program as described in the present application when the processor executes the program.
  • a computer device including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the program as described in the present application when the processor executes the program.
  • the data dump method based on the Map/Reduce architecture described in any embodiment.
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the data based on the Map/Reduce architecture described in any embodiment of the present application dump method.
  • the Map side after the Map side obtains the data stored in the form of row storage, it shuffles the data according to the target data classification framework, and the Reduce side reads the shuffled data according to the target data.
  • the data hierarchical structure is summarized and stored in the form of columnar storage, and the dumped data is not only stored in the form of columnar storage, but also can achieve a multi-output effect that matches the target data hierarchical structure. This improves data query and usage efficiency.
  • FIG. 1 is a flowchart of a data dump method based on Map/Reduce architecture in Embodiment 1 of the present application;
  • FIG. 2 is a flowchart of a data dump method based on the Map/Reduce architecture in Embodiment 2 of the present application;
  • FIG. 3 is a flowchart of a data dump method based on the Map/Reduce architecture in Embodiment 3 of the present application;
  • FIG. 4 is a schematic structural diagram of a data dump device based on the Map/Reduce architecture in Embodiment 4 of the present application;
  • FIG. 5 is a schematic diagram of a hardware structure of a computer device in Embodiment 5 of the present application.
  • FIG. 1 is a flowchart of a data dump method based on the Map/Reduce architecture provided in the first embodiment of the present application, which can be applied to dump specific business data (for example, game platform data) to improve its query and use efficiency.
  • the method may be executed by the data dump device based on the Map/Reduce architecture provided by the embodiments of the present application, and the device may be implemented in software and/or hardware, and may generally be integrated in a processor.
  • the data dump method based on the Map/Reduce architecture provided in this embodiment specifically includes:
  • the Map terminal acquires the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage.
  • Target data refers to the data stored in the row database that needs to be dumped to improve data query and usage efficiency.
  • the target data may be offline data stored in a row database within a set period of time (for example, within a month).
  • the row-based database stores data according to the row, which is good at random read operations and is not suitable for big data analysis.
  • Traditional databases such as SQL Server, Oracle, and MySQL all belong to the category of row-based databases.
  • the database stores data in the form of two-dimensional tables of rows and columns, but it stores data in the form of one-dimensional character strings.
  • the following table 1 is taken as an example for simplified explanation.
  • a row-based database stores the data values in one row together, and then stores the data in the next row, and so on, that is:
  • the target data grading framework can refer to a general data grading framework under the Map/Reduce architecture based on HDFS (Hadoop Distributed File System, Hadoop Distributed File System), or it can refer to a custom data grading framework under the HDFS-based Map/Reduce architecture Data Grading Framework.
  • a custom data grading framework can match the following target data hierarchy, that is, the shuffled data ordering matches the target data hierarchy. Specifically, the shuffling operation of the target data is implemented under the grading framework of the target data.
  • Map tasks and Reduce tasks are executed on different nodes. After each Map task is executed, an output file is generated and the data is pulled by the Reduce task, that is, the shuffled target data is pulled.
  • the Reduce side aggregates and stores the shuffled target data in a columnar storage form according to the target data hierarchical structure.
  • columnar storage stores the data values in one column together, and then stores the data in the next column, and so on, that is:
  • a column-based storage database data is stored according to the basic logical storage unit of the column, and the data in a column exists in the storage medium in the form of continuous storage.
  • the data in a column exists in the storage medium in the form of continuous storage.
  • the query response time is significantly reduced, and data can be efficiently searched in the data column without maintaining an index (any column can be used as an index), which can minimize irrelevant input and output during the query process. , to avoid a full table data scan.
  • the target data hierarchical structure refers to a data hierarchical structure that matches the business analysis of target business data (for example, game platform business data, medical system data).
  • target business data for example, game platform business data, medical system data.
  • the first data layer is the game name
  • the second data layer is the event type (such as login events, consumption events, interactive events, etc.)
  • the third data layer is the date
  • the first data layer is the game name
  • the second data layer is the user ID
  • the third data layer is the event type
  • the fourth data layer is the date.
  • the data hierarchy is matched to business analysis and can be configured in real time.
  • the user can configure the data hierarchy in real time according to actual big data analysis requirements or business analysis requirements, so that the data aggregated and output by the Reduce side matches the configured data hierarchy.
  • the Reduce side can aggregate and store the shuffled target data in the form of columnar storage according to the target data hierarchical structure, specifically:
  • the Reduce side obtains a preconfigured data hierarchy as the target data hierarchy
  • the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure.
  • multiple data hierarchical structures can be pre-configured for storage, and the Reduce side obtains a data hierarchical structure selected by the user as the target data hierarchical structure, and then mixes mixed data structures according to the target data hierarchical structure.
  • the washed target data is aggregated and stored in the form of columnar storage.
  • the columnar storage form may be parquet columnar storage.
  • Parquet is a columnar storage format capable of efficiently storing nested data.
  • a Parquet file consists of a header, one or more blocks that follow it, and a footer for the end.
  • Each file block whose file header contains only Parquet files is responsible for storing a row group, a row group consists of column blocks, and a column block is responsible for storing a column of data. The data in each column block is in units of pages.
  • the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data, which may be specifically:
  • the Reduce side aggregates and stores the shuffled target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
  • the Reduce side collects and stores the shuffled target data according to the hierarchical structure of the target data that matches the business analysis and according to the parquet columnar storage specification (schema). That is, the data at each level is specified and stored, and the target data can be displayed to the user in the form of multiple outputs. For example, the data of a certain event of a certain game is designated to be stored, the data of a certain event of a certain game of a certain day is designated to be stored, and so on. Furthermore, when performing big data analysis, for example, when analyzing the data of a certain event of a certain game on a certain day, the data can be quickly obtained at the designated storage location, thereby improving the efficiency of data query and use.
  • the Reduce side when the Reduce side aggregates and stores the shuffled target data according to the parquet columnar storage specification, it specifically sets the suffix of the generated data file to the suffix of the Parquet file, according to the compression of the Parquet columnar storage. method (such as compression ratio, etc.) for data compression, and so on.
  • the Map side after the Map side obtains the data stored in the form of row storage, it shuffles the data according to the target data classification framework, and the Reduce side reads the shuffled data according to the target data.
  • the data hierarchical structure is summarized and stored in the form of columnar storage, and the dumped data is not only stored in the form of columnar storage, but also can achieve a multi-dimensional output effect that matches the target data hierarchical structure. This improves data query and usage efficiency.
  • FIG. 2 is a flowchart of a data dump method based on a Map/Reduce architecture provided in Embodiment 2 of the present application. This embodiment is optimized on the basis of the above-mentioned embodiment, wherein the Map end is classified according to the target data framework. Shuffling the target data, specifically:
  • the Map terminal parses the keys of the target data; the Map terminal sorts the keys of the target data according to the target data grading framework to obtain the shuffled target data.
  • the data dump method based on the Map/Reduce architecture provided by this embodiment specifically includes:
  • the Map terminal obtains target data, where the target data is stored in a row-based storage format.
  • the Map terminal parses the key of the target data.
  • a key/value pair store is the simplest form of organization for a database. After the Map task runs, the target data is parsed into the form of key-value pairs, and the key and value values of the data can be obtained.
  • the Map terminal sorts the keys of the target data according to the target data grading framework.
  • the value value corresponding to the key value can be quickly found, and then the shuffling and sorting of the target data can be realized by sorting the key value according to the target data grading framework. Furthermore, before the data reaches the Reduce side, the data has been sorted by key based on the target data grading framework.
  • the key ordering can be related to the encapsulation type of the key. For example, if the key is an IntWritable type encapsulated as an int, then the keys can be sorted according to the numerical size according to the target data grading framework. If the key is a Text type encapsulated into a String, then The characters can be sorted according to the data lexicographical order according to the target data grading framework.
  • the Map side sorts the keys of the target data according to the target data grading framework, which may also be specifically: the Map side sorts the keys of the target data according to the target data grading framework. Hash values are sorted.
  • the key of the data is calculated by the hash function, and the hash value corresponding to the key of the data is obtained.
  • the tasks with the same hash value can be divided into the same node for processing. That is, hash value sorting enables key-value pairs with the same key value to be grouped together, and ensures that all key-value pairs of a single key value will be sent to the same Reducer for processing in one Reduce function call , or you can combine the key-value pairs ( ⁇ key,value_1>, ⁇ key,value_2>,..., ⁇ key,value_n>) of a single key value into ⁇ key,(value_1,value_2,...,value_n) >, and as an input parameter to a Reduce function call.
  • the Reduce side pulls the sorted target data, and acquires a preconfigured data hierarchy as the target data hierarchy.
  • the Reduce end aggregates and stores the sorted target data in a columnar storage form according to the target data hierarchical structure.
  • data dumping is implemented based on the Map/Reduce architecture, so that the data originally stored in the form of row storage is converted into data stored in the form of columnar storage, and the dumped data is pre-configured with support It is output in multiple forms that match the hierarchical data structure, which is convenient for users to analyze big data, and achieves the beneficial effect of improving data query and use efficiency.
  • the data dump is realized based on the Map/Reduce architecture, which improves the stability of data dump processing.
  • FIG. 3 is a flowchart of a data dump method based on a Map/Reduce architecture provided by Embodiment 3 of the present application. On the basis of the foregoing embodiment, this embodiment provides a specific implementation manner.
  • the data dump method based on the Map/Reduce architecture provided by this embodiment specifically includes:
  • the Map terminal obtains target data, where the target data is stored in a row storage format.
  • the Map terminal parses the key of the target data.
  • the Map terminal sorts the hash values corresponding to the keys of the target data according to the target data grading framework.
  • the Reduce side pulls the sorted target data, and acquires a preconfigured data hierarchy as the target data hierarchy.
  • the Reduce side aggregates and stores the sorted target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
  • data dumping is implemented based on the Map/Reduce architecture, so that the data originally stored in the form of row storage is converted into data stored in the form of columnar storage, and the dumped data is pre-configured with support
  • the data is output in multiple forms matching the hierarchical structure of the data, which is convenient for users to analyze big data, and achieves the beneficial effect of improving data query and use efficiency.
  • the data dump is realized based on the Map/Reduce architecture, which improves the stability of data dump processing.
  • FIG. 4 is a schematic structural diagram of a data dump device based on Map/Reduce architecture provided in Embodiment 4 of the present application, which can be applied to dump specific business data (for example, game platform data) to improve its query and use efficiency
  • the device can be implemented in software and/or hardware, and can generally be integrated in a processor.
  • the data dumping device based on the Map/Reduce architecture specifically includes: a Map-side processing module 410 and a Reduce-side processing module 420, wherein,
  • the Map side processing module 410 is used for the Map side to obtain the target data, and according to the target data grading framework, the target data is shuffled; Wherein, the target data is stored in the row storage form;
  • the Reduce side processing module 420 is used for the Reduce side to aggregate and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  • the Map side after the Map side obtains the data stored in the form of row storage, it shuffles the data according to the target data classification framework, and after the Reduce side reads the shuffled data, it shuffles the data according to the target data classification framework.
  • the data hierarchical structure is summarized and stored in the form of columnar storage, and the dumped data is not only stored in the form of columnar storage, but also can achieve a multi-output effect that matches the target data hierarchical structure. This improves data query and usage efficiency.
  • the Map-side processing module includes: a key parsing unit and a data shuffling unit, wherein,
  • the key parsing unit used for parsing the key of the target data by the Map end
  • the data shuffling unit is used for the Map end to sort the keys of the target data according to the target data grading framework to obtain the shuffled target data.
  • the data shuffling unit is specifically configured for the Map side to sort the hash values corresponding to the keys of the target data according to the target data grading framework.
  • the Reduce side processing module includes: a target data hierarchical structure acquisition unit and a data summary storage unit, wherein,
  • the target data hierarchical structure acquisition unit is used for the Reduce side to acquire a pre-configured data hierarchical structure as the target data hierarchical structure;
  • the data summary storage unit is used for the Reduce side to summarize and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  • the columnar storage form includes parquet columnar storage.
  • the Reduce side processing module is specifically used for the Reduce side to aggregate and store the shuffled target data according to the target data hierarchical structure and the specification of parquet columnar storage.
  • the above-mentioned data dumping device based on the Map/Reduce architecture can execute the data dumping method based on the Map/Reduce architecture provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the data dumping method based on the Map/Reduce architecture. Effect.
  • FIG. 5 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application.
  • Figure 5 shows a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application.
  • the computer device 12 shown in FIG. 5 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • computer device 12 takes the form of a general-purpose computing device.
  • Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, Enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect ( PCI) bus.
  • Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including both volatile and nonvolatile media, removable and non-removable media.
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
  • Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive”).
  • a disk drive may be provided for reading and writing to removable non-volatile magnetic disks (eg "floppy disks"), as well as removable non-volatile optical disks (eg CD-ROM, DVD-ROM) or other optical media) to read and write optical drives.
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • System memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.
  • a program/utility 40 having a set (at least one) of program modules 42, which may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of a network environment.
  • Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with computer device 12, and/or communicate with Any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. Such communication may take place through input/output (I/O) interface 22 . Also, the computer device 12 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 20 . As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 . It should be understood that, although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes drives and data backup storage systems, etc.
  • I/O input/output
  • the processing unit 16 executes various functional applications and data processing by running the programs stored in the system memory 28, for example, implementing a data dump based on the Map/Reduce architecture provided by the embodiments of the present application. That is, when the processing unit executes the program, it realizes: the Map side obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage; the Reduce side According to the hierarchical structure of the target data, the shuffled target data is aggregated and stored in the form of columnar storage.
  • the sixth embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements a Map/Reduce architecture-based data dump as provided by all the application embodiments of the present application Method: That is, when the program is executed by the processor: the Map side obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage; the Reduce side According to the hierarchical structure of the target data, the shuffled target data is aggregated and stored in the form of columnar storage.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (a non-exhaustive list) of computer readable storage media include: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable Programmable Read Only Memory (EPROM or Flash), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for carrying out the operations of the present application may be written in one or more programming languages, including object-oriented programming languages (such as Java, Smalltalk, C++), and conventional procedural programming language (such as the "C" language or similar programming language).
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider via the Internet connect).
  • LAN local area network
  • WAN wide area network
  • Internet service provider via the Internet connect

Abstract

A Map/Reduce architecture-based data dump method and apparatus, a device, and a storage medium. The method comprises: a Map end obtains target data and performs shuffling on the target data according to a target data classification framework, wherein the target data is stored in a row-type storage form (S110); a Reduce end aggregates and stores the shuffled target data in a column-type storage form according to a target data hierarchy (S120). According to the method, bumped data is stored in a column-type storage form, and multivariate data output corresponding to a target data classification framework can be implemented, thus improving the data query and utilization efficiency.

Description

数据转储方法、装置、设备及存储介质Data dump method, device, device and storage medium
本申请要求在2020年07月28日提交中国专利局、申请号为202010740395.4的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202010740395.4 filed with the China Patent Office on July 28, 2020, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请实施例涉及数据库技术领域,尤其涉及一种基于Map/Reduce架构的数据转储方法、装置、设备及存储介质。The embodiments of the present application relate to the technical field of databases, and in particular, to a data dumping method, apparatus, device, and storage medium based on a Map/Reduce architecture.
背景技术Background technique
数据库是“按照数据结构来组织、存储和管理数据的仓库”,它的存储空间很大,可以存放百万条、千万条、上亿条数据,适用于各种类型的业务数据,例如游戏平台数据、医疗系统数据等等。然而,随着数据库中存储数据量的持续上升,如何提高数据查询及使用效率是亟待解决的问题。A database is a "warehouse that organizes, stores and manages data according to the data structure". It has a large storage space and can store millions, tens of millions, and hundreds of millions of pieces of data. It is suitable for various types of business data, such as games. Platform data, medical system data, etc. However, with the continuous increase of the amount of data stored in the database, how to improve the efficiency of data query and use is an urgent problem to be solved.
发明内容SUMMARY OF THE INVENTION
本申请实施例提供一种基于Map/Reduce架构的数据转储方法、装置、设备及存储介质,以提高数据查询及使用效率。Embodiments of the present application provide a data dumping method, apparatus, device, and storage medium based on a Map/Reduce architecture, so as to improve data query and use efficiency.
第一方面,本申请实施例提供了一种基于Map/Reduce架构的数据转储方法,包括:In a first aspect, an embodiment of the present application provides a data dump method based on a Map/Reduce architecture, including:
Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;The Map terminal obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage;
Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
进一步的,所述Map端根据目标数据分级框架对所述目标数据进行混洗,包括:Further, the Map terminal shuffles the target data according to the target data grading framework, including:
所述Map端解析所述目标数据的键;The Map terminal parses the key of the target data;
所述Map端根据目标数据分级框架对所述目标数据的键进行排序,得到混洗后的目标数据。The Map terminal sorts the keys of the target data according to the target data grading framework to obtain the shuffled target data.
进一步的,所述Map端根据目标数据分级框架对所述目标数据的键进行排序,包括:Further, the Map terminal sorts the keys of the target data according to the target data grading framework, including:
所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行 排序。The Map terminal sorts the hash values corresponding to the keys of the target data according to the target data grading framework.
进一步的,所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储,包括:Further, the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure, including:
所述Reduce端获取预先配置的数据分层结构作为所述目标数据分层结构;The Reduce side obtains a preconfigured data hierarchy as the target data hierarchy;
所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure.
可选的,所述列式存储形式包括parquet列式存储。Optionally, the columnar storage form includes parquet columnar storage.
进一步的,所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储,包括:Further, the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure, including:
所述Reduce端根据目标数据分层结构以及parquet列式存储的规范,对混洗后的目标数据进行汇总存储。The Reduce side aggregates and stores the shuffled target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
第二方面,本申请实施例还提供了一种基于Map/Reduce架构的数据转储装置,包括:In a second aspect, an embodiment of the present application further provides a data dump device based on a Map/Reduce architecture, including:
Map端处理模块,用于Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;The Map-side processing module is used for the Map-side to obtain target data, and shuffle the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage;
Reduce端处理模块,用于Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The processing module on the Reduce side is used for the Reduce side to aggregate and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
进一步的,所述Map端处理模块包括:键解析单元和数据混洗单元,其中,Further, the Map-side processing module includes: a key parsing unit and a data shuffling unit, wherein,
所述键解析单元,用于所述Map端解析所述目标数据的键;the key parsing unit, used for parsing the key of the target data by the Map end;
所述数据混洗单元,用于所述Map端根据目标数据分级框架对所述目标数据的键进行排序,得到混洗后的目标数据。The data shuffling unit is used for the Map end to sort the keys of the target data according to the target data grading framework to obtain the shuffled target data.
进一步的,所述数据混洗单元,具体用于所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行排序。Further, the data shuffling unit is specifically configured for the Map side to sort the hash values corresponding to the keys of the target data according to the target data grading framework.
进一步的,所述Reduce端处理模块包括:目标数据分层结构获取单元和数据汇总存储单元,其中,Further, the Reduce side processing module includes: a target data hierarchical structure acquisition unit and a data summary storage unit, wherein,
所述目标数据分层结构获取单元,用于所述Reduce端获取预先配置的数据分层结构作为所述目标数据分层结构;The target data hierarchical structure acquisition unit is used for the Reduce side to acquire a pre-configured data hierarchical structure as the target data hierarchical structure;
所述数据汇总存储单元,用于所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The data summary storage unit is used for the Reduce side to summarize and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
可选的,所述列式存储形式包括parquet列式存储。Optionally, the columnar storage form includes parquet columnar storage.
进一步的,所述Reduce端处理模块,具体用于所述Reduce端根据目标数据分层结构以及parquet列式存储的规范,对混洗后的目标数据进行汇总存储。Further, the Reduce side processing module is specifically used for the Reduce side to aggregate and store the shuffled target data according to the target data hierarchical structure and the specification of parquet columnar storage.
第三方面,本申请实施例还提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如本申请任意实施例所述的基于Map/Reduce架构的数据转储方法。In a third aspect, an embodiment of the present application further provides a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor, the processor implementing the program as described in the present application when the processor executes the program. The data dump method based on the Map/Reduce architecture described in any embodiment.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请任意实施例所述的基于Map/Reduce架构的数据转储方法。In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements the data based on the Map/Reduce architecture described in any embodiment of the present application dump method.
本申请实施例提供的技术方案,Map端在获取以行式存储形式存储的数据之后,根据目标数据分级框架对这些数据进行混洗,Reduce端读取到混洗后的这些数据后,根据目标数据分层结构将其以列式存储形式进行汇总存储,进而转储后的这些数据不仅是以列式存储形式进行存储,还能够达到与所述目标数据分层结构匹配的多元输出效果,由此提高了数据查询及使用效率。In the technical solution provided by the embodiments of the present application, after the Map side obtains the data stored in the form of row storage, it shuffles the data according to the target data classification framework, and the Reduce side reads the shuffled data according to the target data. The data hierarchical structure is summarized and stored in the form of columnar storage, and the dumped data is not only stored in the form of columnar storage, but also can achieve a multi-output effect that matches the target data hierarchical structure. This improves data query and usage efficiency.
附图说明Description of drawings
图1是本申请实施例一中的一种基于Map/Reduce架构的数据转储方法的流程图;1 is a flowchart of a data dump method based on Map/Reduce architecture in Embodiment 1 of the present application;
图2是本申请实施例二中的一种基于Map/Reduce架构的数据转储方法的流程图;2 is a flowchart of a data dump method based on the Map/Reduce architecture in Embodiment 2 of the present application;
图3是本申请实施例三中的一种基于Map/Reduce架构的数据转储方法的流程图;3 is a flowchart of a data dump method based on the Map/Reduce architecture in Embodiment 3 of the present application;
图4是本申请实施例四中的一种基于Map/Reduce架构的数据转储装置的结构示意图;4 is a schematic structural diagram of a data dump device based on the Map/Reduce architecture in Embodiment 4 of the present application;
图5是本申请实施例五中的一种计算机设备的硬件结构示意图。FIG. 5 is a schematic diagram of a hardware structure of a computer device in Embodiment 5 of the present application.
具体实施方式detailed description
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。The present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, the drawings only show some but not all the structures related to the present application.
在更加详细地讨论示例性实施例之前应当提到的是,一些示例性实施例被描述成作为流程图描绘的处理或方法。虽然流程图将各项操作(或步骤)描述成顺序的处理,但是其中的许多操作可以被并行地、并发地或者同时实施。此 外,各项操作的顺序可以被重新安排。当其操作完成时所述处理可以被终止,但是还可以具有未包括在附图中的附加步骤。所述处理可以对应于方法、函数、规程、子例程、子程序等等。Before discussing the exemplary embodiments in greater detail, it should be mentioned that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart depicts various operations (or steps) as a sequential process, many of the operations may be performed in parallel, concurrently, or concurrently. In addition, the order of operations can be rearranged. The process may be terminated when its operation is complete, but may also have additional steps not included in the figures. The processes may correspond to methods, functions, procedures, subroutines, subroutines, and the like.
实施例一Example 1
图1是本申请实施例一提供的一种基于Map/Reduce架构的数据转储方法的流程图,可适用于对特定业务数据(例如是游戏平台数据)进行转储以提高其查询和使用效率的情况,该方法可以由本申请实施例提供的基于Map/Reduce架构的数据转储装置来执行,该装置可采用软件和/或硬件的方式实现,并一般可集成在处理器中。FIG. 1 is a flowchart of a data dump method based on the Map/Reduce architecture provided in the first embodiment of the present application, which can be applied to dump specific business data (for example, game platform data) to improve its query and use efficiency. In this case, the method may be executed by the data dump device based on the Map/Reduce architecture provided by the embodiments of the present application, and the device may be implemented in software and/or hardware, and may generally be integrated in a processor.
如图1所示,本实施例提供的基于Map/Reduce架构的数据转储方法,具体包括:As shown in FIG. 1 , the data dump method based on the Map/Reduce architecture provided in this embodiment specifically includes:
S110、Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储。S110. The Map terminal acquires the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage.
目标数据,指的是行式数据库中存储的需要进行转储以提高数据查询及使用效率的数据。典型的,所述目标数据可以是设定时间段内(例如近一个月内)存储于行式数据库中的离线数据。其中,所述行式数据库是按照行存储数据的,其擅长随机读操作不适用于大数据分析,SQL Server、Oracle,My SQL等传统数据库均是属于行式数据库范畴。Target data refers to the data stored in the row database that needs to be dumped to improve data query and usage efficiency. Typically, the target data may be offline data stored in a row database within a set period of time (for example, within a month). Among them, the row-based database stores data according to the row, which is good at random read operations and is not suitable for big data analysis. Traditional databases such as SQL Server, Oracle, and MySQL all belong to the category of row-based databases.
数据库以行、列二维表的形式存储数据,但是却以一维字符串的方式存储,以下述表1为例,进行简化地解释说明。The database stores data in the form of two-dimensional tables of rows and columns, but it stores data in the form of one-dimensional character strings. The following table 1 is taken as an example for simplified explanation.
表1Table 1
EmpIdEmpId LastnameLastname FirstnameFirstname SalarySalary
11 SmithSmith JoeJoe 4000040000
22 JonesJones MaryMary 5000050000
33 JohnsonJohnson CathyCathy 4400044000
行式数据库是把一行中的数据值串在一起存储起来,然后再存储下一行的数据,以此类推,即:A row-based database stores the data values in one row together, and then stores the data in the next row, and so on, that is:
1,Smith,Joe,40000;2,Jones,Mary,50000;3,Johnson,Cathy,44000;1, Smith, Joe, 40000; 2, Jones, Mary, 50000; 3, Johnson, Cathy, 44000;
假如,在对所有人员的薪水进行分析时,需要读取每一行的数据,然后在其中挑选与字段“Salary”对应的数据,再做数据分析。在表1中数据量极大的情况下,如果大数据分析只需要字段“Salary”这一列的数据,读取表1中所有 数据将会严重影响数据查询效率。Suppose, when analyzing the salaries of all personnel, it is necessary to read the data of each row, and then select the data corresponding to the field "Salary", and then do the data analysis. In the case of a huge amount of data in Table 1, if the big data analysis only needs the data in the column "Salary", reading all the data in Table 1 will seriously affect the data query efficiency.
目标数据分级框架,可以是指基于HDFS(Hadoop Distributed File System,Hadoop分布式文件系统)的Map/Reduce架构下的通用数据分级框架,也可以是指基于HDFS的Map/Reduce架构下的自定义的数据分级框架。典型的,自定义的数据分级框架,可以与下述目标数据分层结构匹配,也即混洗后的数据排序与目标数据分层结构匹配。具体的,所述目标数据的混洗操作是在所述目标数据分级框架下实现的。The target data grading framework can refer to a general data grading framework under the Map/Reduce architecture based on HDFS (Hadoop Distributed File System, Hadoop Distributed File System), or it can refer to a custom data grading framework under the HDFS-based Map/Reduce architecture Data Grading Framework. Typically, a custom data grading framework can match the following target data hierarchy, that is, the shuffled data ordering matches the target data hierarchy. Specifically, the shuffling operation of the target data is implemented under the grading framework of the target data.
在Hadoop这样的集群环境中,大部分Map任务与Reduce任务是执行在不同的节点上。每个Map任务执行结束后,生成输出文件,等待Reduce任务来拉取数据,也即拉取混洗后的目标数据。In a cluster environment like Hadoop, most Map tasks and Reduce tasks are executed on different nodes. After each Map task is executed, an output file is generated and the data is pulled by the Reduce task, that is, the shuffled target data is pulled.
S120、Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。S120, the Reduce side aggregates and stores the shuffled target data in a columnar storage form according to the target data hierarchical structure.
以上述表1为例,列式存储是把一列中的数据值串在一起存储起来,然后再存储下一列的数据,以此类推,即:Taking the above Table 1 as an example, columnar storage stores the data values in one column together, and then stores the data in the next column, and so on, that is:
1,2,3;Smith,Jones,Johnson;Joe,Mary,Cathy;40000,50000,44000;1, 2, 3; Smith, Jones, Johnson; Joe, Mary, Cathy; 40000, 50000, 44000;
在基于列式存储的数据库中,数据是按照列为基础逻辑存储单元进行存储的,一列中的数据在存储介质中以连续存储形式存在。假如,在对所有人员的薪水进行分析时,只需要读取一列的数据,也即与字段“Salary”对应的数据,再做数据分析。在表1中数据量极大的情况下,明显降低了查询响应时间,也可在数据列中高效查找数据,无需维护索引(任何列都能作为索引),查询过程中能够尽量减少无关输入输出,避免进行全表数据扫描。In a column-based storage database, data is stored according to the basic logical storage unit of the column, and the data in a column exists in the storage medium in the form of continuous storage. Suppose, when analyzing the salaries of all personnel, only one column of data needs to be read, that is, the data corresponding to the field "Salary", and then data analysis is performed. In the case of a large amount of data in Table 1, the query response time is significantly reduced, and data can be efficiently searched in the data column without maintaining an index (any column can be used as an index), which can minimize irrelevant input and output during the query process. , to avoid a full table data scan.
目标数据分层结构,指的是与目标业务数据(例如是游戏平台业务数据、医疗系统数据)的业务分析匹配的数据分层结构。以游戏平台业务数据为例,比如,第一数据分层为游戏名称,第二数据分层为事件类型(例如是登录事件、消费事件、互动事件等),第三数据分层为日期;再比如,第一数据分层为游戏名称,第二数据分层为用户ID,第三数据分层为事件类型,第四分层为日期。The target data hierarchical structure refers to a data hierarchical structure that matches the business analysis of target business data (for example, game platform business data, medical system data). Taking game platform business data as an example, for example, the first data layer is the game name, the second data layer is the event type (such as login events, consumption events, interactive events, etc.), and the third data layer is the date; For example, the first data layer is the game name, the second data layer is the user ID, the third data layer is the event type, and the fourth data layer is the date.
典型的,数据分层结构是与业务分析匹配的,能够实时进行配置。用户可以根据实际的大数据分析需求或者业务分析需求实时对数据分层结构进行配置,以使所述Reduce端汇总输出的数据是与配置的数据分层结构匹配的。Typically, the data hierarchy is matched to business analysis and can be configured in real time. The user can configure the data hierarchy in real time according to actual big data analysis requirements or business analysis requirements, so that the data aggregated and output by the Reduce side matches the configured data hierarchy.
进而,可以将所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储,具体为:Furthermore, the Reduce side can aggregate and store the shuffled target data in the form of columnar storage according to the target data hierarchical structure, specifically:
所述Reduce端获取预先配置的数据分层结构作为所述目标数据分层结构;The Reduce side obtains a preconfigured data hierarchy as the target data hierarchy;
所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure.
具体的,可以预先配置多种数据分层结构进行存储,所述Reduce端在其中获取用户选择的一种数据分层结构作为所述目标数据分层结构,进而根据所述目标分层结构将混洗后的目标数据以列式存储形式进行汇总存储。Specifically, multiple data hierarchical structures can be pre-configured for storage, and the Reduce side obtains a data hierarchical structure selected by the user as the target data hierarchical structure, and then mixes mixed data structures according to the target data hierarchical structure. The washed target data is aggregated and stored in the form of columnar storage.
作为一种具体的实施方式,所述列式存储形式可以是parquet列式存储。As a specific implementation manner, the columnar storage form may be parquet columnar storage.
Parquet是一种能够有效存储嵌套数据的列式存储格式。Parquet文件由一个文件头(header),一个或多个紧随其后的文件块(block),以及一个用于结尾的文件尾(footer)构成。文件头仅包含Parquet文件的每个文件块负责存储一个行组,行组由列块组成,且一个列块负责存储一列数据。每个列块中的数据以页为单位。Parquet is a columnar storage format capable of efficiently storing nested data. A Parquet file consists of a header, one or more blocks that follow it, and a footer for the end. Each file block whose file header contains only Parquet files is responsible for storing a row group, a row group consists of column blocks, and a column block is responsible for storing a column of data. The data in each column block is in units of pages.
进一步的,所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储,可以具体为:Further, the Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data, which may be specifically:
所述Reduce端根据目标数据分层结构以及parquet列式存储的规范,对混洗后的目标数据进行汇总存储。The Reduce side aggregates and stores the shuffled target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
所述Reduce端在拉取到混洗后的数据后,根据与业务分析匹配的目标数据分层结构,按照parquet列式存储的规范(schema),将这些混洗后的目标数据进行汇总存储,也即将各个层次的数据进行指定存储,进而目标数据能以多元输出的形式展示给用户。例如,将某个游戏的某个事件的数据进行指定存储,将某一天某个游戏的某一个事件的数据进行指定存储,等等。进而,在进行大数据分析时,例如对某一天某个游戏的某一个事件的数据进行分析时,可以在其指定存储的位置快速地得到这些数据,以此提高了数据查询及使用效率。After pulling the shuffled data, the Reduce side collects and stores the shuffled target data according to the hierarchical structure of the target data that matches the business analysis and according to the parquet columnar storage specification (schema). That is, the data at each level is specified and stored, and the target data can be displayed to the user in the form of multiple outputs. For example, the data of a certain event of a certain game is designated to be stored, the data of a certain event of a certain game of a certain day is designated to be stored, and so on. Furthermore, when performing big data analysis, for example, when analyzing the data of a certain event of a certain game on a certain day, the data can be quickly obtained at the designated storage location, thereby improving the efficiency of data query and use.
其中,所述Reduce端在按照parquet列式存储的规范将这些混洗后的目标数据进行汇总存储时,具体会将生成的数据文件的后缀设置为Parquet文件的后缀,按照Parquet列式存储的压缩方式(例如压缩比等)进行数据压缩,等等。Wherein, when the Reduce side aggregates and stores the shuffled target data according to the parquet columnar storage specification, it specifically sets the suffix of the generated data file to the suffix of the Parquet file, according to the compression of the Parquet columnar storage. method (such as compression ratio, etc.) for data compression, and so on.
本申请实施例提供的技术方案,Map端在获取以行式存储形式存储的数据之后,根据目标数据分级框架对这些数据进行混洗,Reduce端读取到混洗后的这些数据后,根据目标数据分层结构将其以列式存储形式进行汇总存储,进而转储后的这些数据不仅是以列式存储形式进行存储,还能够达到与所述目标数据分层结构匹配的多元输出效果,由此提高了数据查询及使用效率。In the technical solution provided by the embodiments of the present application, after the Map side obtains the data stored in the form of row storage, it shuffles the data according to the target data classification framework, and the Reduce side reads the shuffled data according to the target data. The data hierarchical structure is summarized and stored in the form of columnar storage, and the dumped data is not only stored in the form of columnar storage, but also can achieve a multi-dimensional output effect that matches the target data hierarchical structure. This improves data query and usage efficiency.
实施例二Embodiment 2
图2是本申请实施例二提供的一种基于Map/Reduce架构的数据转储方法的流程图,本实施例以上述实施例为基础进行优化,其中,将所述Map端根据目 标数据分级框架对所述目标数据进行混洗,具体为:FIG. 2 is a flowchart of a data dump method based on a Map/Reduce architecture provided in Embodiment 2 of the present application. This embodiment is optimized on the basis of the above-mentioned embodiment, wherein the Map end is classified according to the target data framework. Shuffling the target data, specifically:
所述Map端解析所述目标数据的键;所述Map端根据目标数据分级框架对所述目标数据的键进行排序,得到混洗后的目标数据。The Map terminal parses the keys of the target data; the Map terminal sorts the keys of the target data according to the target data grading framework to obtain the shuffled target data.
如图2所示,本实施例提供的基于Map/Reduce架构的数据转储方法,具体包括:As shown in Figure 2, the data dump method based on the Map/Reduce architecture provided by this embodiment specifically includes:
S210、Map端获取目标数据,所述目标数据以行式存储形式存储。S210. The Map terminal obtains target data, where the target data is stored in a row-based storage format.
S220、所述Map端解析所述目标数据的键。S220. The Map terminal parses the key of the target data.
键值对(key/value pair)存储是数据库最简单的组织形式。Map任务运行后,将目标数据解析成键值对的形式,进行可以得到数据的key值及value值。A key/value pair store is the simplest form of organization for a database. After the Map task runs, the target data is parsed into the form of key-value pairs, and the key and value values of the data can be obtained.
S230、所述Map端根据目标数据分级框架对所述目标数据的键进行排序。S230. The Map terminal sorts the keys of the target data according to the target data grading framework.
基于key值的查找,可以迅速地找到与key值相对应的value值,进而根据目标数据分级框架对key值排序即可实现对目标数据的混洗排序。进而,在数据到达Reduce端之前,已经基于目标数据分级框架对这些数据按键排序了。Based on the search of the key value, the value value corresponding to the key value can be quickly found, and then the shuffling and sorting of the target data can be realized by sorting the key value according to the target data grading framework. Furthermore, before the data reaches the Reduce side, the data has been sorted by key based on the target data grading framework.
具体的,按键排序可以与key的封装类型相关,例如,如果key为封装成int的IntWritable类型,那么根据目标数据分级框架可以按照数字大小对key排序,如果Key为封装成String的Text类型,那么根据目标数据分级框架可以按照数据字典顺序对字符排序。Specifically, the key ordering can be related to the encapsulation type of the key. For example, if the key is an IntWritable type encapsulated as an int, then the keys can be sorted according to the numerical size according to the target data grading framework. If the key is a Text type encapsulated into a String, then The characters can be sorted according to the data lexicographical order according to the target data grading framework.
作为一种具体的实施方式,所述Map端根据目标数据分级框架对所述目标数据的键进行排序,还可以具体为:所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行排序。As a specific implementation manner, the Map side sorts the keys of the target data according to the target data grading framework, which may also be specifically: the Map side sorts the keys of the target data according to the target data grading framework. Hash values are sorted.
通过哈希函数对数据的键进行计算,得到数据的键对应的哈希值,对哈希值进行排序后,可以将同样哈希值的任务划分到同一个节点上进行处理。也即,通过哈希值排序可以使得具有相同key值的键值对能够被划分到一起,并且保证单个key值的所有键值对会被发送到同一个Reducer上的一次Reduce函数调用上被处理,或者可以将单个key值的键值对(<key,value_1>,<key,value_2>,...,<key,value_n>)合并为<key,(value_1,value_2,...,value_n)>,并作为一次Reduce函数调用的输入参数。The key of the data is calculated by the hash function, and the hash value corresponding to the key of the data is obtained. After sorting the hash value, the tasks with the same hash value can be divided into the same node for processing. That is, hash value sorting enables key-value pairs with the same key value to be grouped together, and ensures that all key-value pairs of a single key value will be sent to the same Reducer for processing in one Reduce function call , or you can combine the key-value pairs (<key,value_1>,<key,value_2>,...,<key,value_n>) of a single key value into <key,(value_1,value_2,...,value_n) >, and as an input parameter to a Reduce function call.
S240、所述Reduce端拉取排序后的目标数据,并获取预先配置的数据分层结构作为所述目标数据分层结构。S240. The Reduce side pulls the sorted target data, and acquires a preconfigured data hierarchy as the target data hierarchy.
S250、所述Reduce端根据目标数据分层结构将排序后的目标数据以列式存储形式进行汇总存储。S250. The Reduce end aggregates and stores the sorted target data in a columnar storage form according to the target data hierarchical structure.
本实施例未尽详细解释之处请参见前述实施例,在此不再赘述。For details that are not explained in this embodiment, please refer to the foregoing embodiments, which will not be repeated here.
在上述技术方案中,基于Map/Reduce架构实现数据的转储,使原本以行式存储形式存储的数据转换为以列式存储形式存储的数据,且转储后的数据是以与支持预配置的数据分层结构匹配的多元形式输出的,便于用户进行大数据分析,达到了提高数据查询及使用效率的有益效果。同时,基于Map/Reduce架构实现数据的转储,提高了数据转储处理的稳定性。In the above technical solution, data dumping is implemented based on the Map/Reduce architecture, so that the data originally stored in the form of row storage is converted into data stored in the form of columnar storage, and the dumped data is pre-configured with support It is output in multiple forms that match the hierarchical data structure, which is convenient for users to analyze big data, and achieves the beneficial effect of improving data query and use efficiency. At the same time, the data dump is realized based on the Map/Reduce architecture, which improves the stability of data dump processing.
实施例三Embodiment 3
图3是本申请实施例三提供的一种基于Map/Reduce架构的数据转储方法的流程图。在上述实施例的基础上,本实施例提供了一种具体的实施方式。FIG. 3 is a flowchart of a data dump method based on a Map/Reduce architecture provided by Embodiment 3 of the present application. On the basis of the foregoing embodiment, this embodiment provides a specific implementation manner.
如图3所示,本实施例提供的基于Map/Reduce架构的数据转储方法,具体包括:As shown in FIG. 3 , the data dump method based on the Map/Reduce architecture provided by this embodiment specifically includes:
S310、Map端获取目标数据,所述目标数据以行式存储形式存储。S310 , the Map terminal obtains target data, where the target data is stored in a row storage format.
S320、所述Map端解析所述目标数据的键。S320. The Map terminal parses the key of the target data.
S330、所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行排序。S330. The Map terminal sorts the hash values corresponding to the keys of the target data according to the target data grading framework.
S340、所述Reduce端拉取排序后的目标数据,并获取预先配置的数据分层结构作为所述目标数据分层结构。S340. The Reduce side pulls the sorted target data, and acquires a preconfigured data hierarchy as the target data hierarchy.
S350、所述Reduce端根据目标数据分层结构以及parquet列式存储的规范,对排序后的目标数据进行汇总存储。S350. The Reduce side aggregates and stores the sorted target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
本实施例未尽详细解释之处请参见前述实施例,在此不再赘述。For details that are not explained in this embodiment, please refer to the foregoing embodiments, which will not be repeated here.
在上述技术方案中,基于Map/Reduce架构实现数据的转储,使原本以行式存储形式存储的数据转换为以列式存储形式存储的数据,且转储后的数据是以与支持预配置的数据分层结构匹配的多元形式输出的,便于用户进行大数据分析,达到了提高数据查询及使用效率的有益效果。同时,基于Map/Reduce架构实现数据的转储,提高了数据转储处理的稳定性。In the above technical solution, data dumping is implemented based on the Map/Reduce architecture, so that the data originally stored in the form of row storage is converted into data stored in the form of columnar storage, and the dumped data is pre-configured with support The data is output in multiple forms matching the hierarchical structure of the data, which is convenient for users to analyze big data, and achieves the beneficial effect of improving data query and use efficiency. At the same time, the data dump is realized based on the Map/Reduce architecture, which improves the stability of data dump processing.
实施例四Embodiment 4
图4是本申请实施例四提供的一种基于Map/Reduce架构的数据转储装置的结构示意图,可适用于对特定业务数据(例如是游戏平台数据)进行转储以提高其查询和使用效率的情况,该装置可采用软件和/或硬件的方式实现,并一般可集成在处理器中。4 is a schematic structural diagram of a data dump device based on Map/Reduce architecture provided in Embodiment 4 of the present application, which can be applied to dump specific business data (for example, game platform data) to improve its query and use efficiency In the case of the above, the device can be implemented in software and/or hardware, and can generally be integrated in a processor.
如图4所示,该基于Map/Reduce架构的数据转储装置具体包括:Map端处理模块410和Reduce端处理模块420,其中,As shown in FIG. 4 , the data dumping device based on the Map/Reduce architecture specifically includes: a Map-side processing module 410 and a Reduce-side processing module 420, wherein,
Map端处理模块410,用于Map端获取目标数据,并根据目标数据分级框 架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;The Map side processing module 410 is used for the Map side to obtain the target data, and according to the target data grading framework, the target data is shuffled; Wherein, the target data is stored in the row storage form;
Reduce端处理模块420,用于Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The Reduce side processing module 420 is used for the Reduce side to aggregate and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
本申请实施例提供的技术方案,Map端在获取以行式存储形式存储的数据之后,根据目标数据分级框架对这些数据进行混洗,Reduce端读取到混洗后的这些数据后,根据目标数据分层结构将其以列式存储形式进行汇总存储,进而转储后的这些数据不仅是以列式存储形式进行存储,还能够达到与所述目标数据分层结构匹配的多元输出效果,由此提高了数据查询及使用效率。In the technical solution provided by the embodiments of the present application, after the Map side obtains the data stored in the form of row storage, it shuffles the data according to the target data classification framework, and after the Reduce side reads the shuffled data, it shuffles the data according to the target data classification framework. The data hierarchical structure is summarized and stored in the form of columnar storage, and the dumped data is not only stored in the form of columnar storage, but also can achieve a multi-output effect that matches the target data hierarchical structure. This improves data query and usage efficiency.
进一步的,所述Map端处理模块包括:键解析单元和数据混洗单元,其中,Further, the Map-side processing module includes: a key parsing unit and a data shuffling unit, wherein,
所述键解析单元,用于所述Map端解析所述目标数据的键;the key parsing unit, used for parsing the key of the target data by the Map end;
所述数据混洗单元,用于所述Map端根据目标数据分级框架对所述目标数据的键进行排序,得到混洗后的目标数据。The data shuffling unit is used for the Map end to sort the keys of the target data according to the target data grading framework to obtain the shuffled target data.
进一步的,所述数据混洗单元,具体用于所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行排序。Further, the data shuffling unit is specifically configured for the Map side to sort the hash values corresponding to the keys of the target data according to the target data grading framework.
进一步的,所述Reduce端处理模块包括:目标数据分层结构获取单元和数据汇总存储单元,其中,Further, the Reduce side processing module includes: a target data hierarchical structure acquisition unit and a data summary storage unit, wherein,
所述目标数据分层结构获取单元,用于所述Reduce端获取预先配置的数据分层结构作为所述目标数据分层结构;The target data hierarchical structure acquisition unit is used for the Reduce side to acquire a pre-configured data hierarchical structure as the target data hierarchical structure;
所述数据汇总存储单元,用于所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The data summary storage unit is used for the Reduce side to summarize and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
可选的,所述列式存储形式包括parquet列式存储。Optionally, the columnar storage form includes parquet columnar storage.
进一步的,所述Reduce端处理模块,具体用于所述Reduce端根据目标数据分层结构以及parquet列式存储的规范,对混洗后的目标数据进行汇总存储。Further, the Reduce side processing module is specifically used for the Reduce side to aggregate and store the shuffled target data according to the target data hierarchical structure and the specification of parquet columnar storage.
上述基于Map/Reduce架构的数据转储装置可执行本申请任意实施例所提供的基于Map/Reduce架构的数据转储方法,具备执行基于Map/Reduce架构的数据转储方法相应的功能模块和有益效果。The above-mentioned data dumping device based on the Map/Reduce architecture can execute the data dumping method based on the Map/Reduce architecture provided by any embodiment of the present application, and has corresponding functional modules and beneficial effects for executing the data dumping method based on the Map/Reduce architecture. Effect.
实施例五Embodiment 5
图5为本申请实施例五提供的一种计算机设备的结构示意图。图5示出了适于用来实现本申请实施方式的示例性计算机设备12的框图。图5显示的计算机设备12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。FIG. 5 is a schematic structural diagram of a computer device according to Embodiment 5 of the present application. Figure 5 shows a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application. The computer device 12 shown in FIG. 5 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
如图5所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。As shown in FIG. 5, computer device 12 takes the form of a general-purpose computing device. Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。 Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures. By way of example, these architectures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MAC) bus, Enhanced ISA bus, Video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect ( PCI) bus.
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。 Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including both volatile and nonvolatile media, removable and non-removable media.
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。计算机设备12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图5未显示,通常称为“硬盘驱动器”)。尽管图5中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。系统存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。 System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 . Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. For example only, storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in Figure 5, a disk drive may be provided for reading and writing to removable non-volatile magnetic disks (eg "floppy disks"), as well as removable non-volatile optical disks (eg CD-ROM, DVD-ROM) or other optical media) to read and write optical drives. In these cases, each drive may be connected to bus 18 through one or more data media interfaces. System memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如系统存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。A program/utility 40 having a set (at least one) of program modules 42, which may be stored, for example, in system memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and programs Data, each or some combination of these examples may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与计算机设备12的其它模块通信。应当明白,尽管图5中未示出,可以结合计算机设备12使用其它硬件和 /或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。 Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with computer device 12, and/or communicate with Any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. Such communication may take place through input/output (I/O) interface 22 . Also, the computer device 12 may communicate with one or more networks (eg, a local area network (LAN), a wide area network (WAN), and/or a public network such as the Internet) through a network adapter 20 . As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18 . It should be understood that, although not shown in FIG. 5, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tapes drives and data backup storage systems, etc.
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现本申请实施例所提供的一种基于Map/Reduce架构的数据转储。也即,所述处理单元执行所述程序时实现:Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The processing unit 16 executes various functional applications and data processing by running the programs stored in the system memory 28, for example, implementing a data dump based on the Map/Reduce architecture provided by the embodiments of the present application. That is, when the processing unit executes the program, it realizes: the Map side obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage; the Reduce side According to the hierarchical structure of the target data, the shuffled target data is aggregated and stored in the form of columnar storage.
实施例六Embodiment 6
本申请实施例六提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请所有申请实施例提供的一种基于Map/Reduce架构的数据转储方法:也即,该程序被处理器执行时实现:Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The sixth embodiment of the present application provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, implements a Map/Reduce architecture-based data dump as provided by all the application embodiments of the present application Method: That is, when the program is executed by the processor: the Map side obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage; the Reduce side According to the hierarchical structure of the target data, the shuffled target data is aggregated and stored in the form of columnar storage.
可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。Any combination of one or more computer-readable media may be employed. The computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (a non-exhaustive list) of computer readable storage media include: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), Erasable Programmable Read Only Memory (EPROM or Flash), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。A computer-readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。Program code embodied on a computer readable medium may be transmitted using any suitable medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言(诸如Java、Smalltalk、C++),还包括常规的过程式程序设计语言(诸如“C”语言或类似的程序设计语言)。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络(包括局域网(LAN)或广域网(WAN))连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out the operations of the present application may be written in one or more programming languages, including object-oriented programming languages (such as Java, Smalltalk, C++), and conventional procedural programming language (such as the "C" language or similar programming language). The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. Where a remote computer is involved, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider via the Internet connect).

Claims (12)

  1. 一种基于Map/Reduce架构的数据转储方法,包括:A data dump method based on Map/Reduce architecture, including:
    Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;The Map terminal obtains the target data, and shuffles the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage;
    Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  2. 根据权利要求1所述的方法,其中,所述Map端根据目标数据分级框架对所述目标数据进行混洗,包括:The method according to claim 1, wherein the Map side shuffles the target data according to a target data grading framework, comprising:
    所述Map端解析所述目标数据的键;The Map terminal parses the key of the target data;
    所述Map端根据目标数据分级框架对所述目标数据的键进行排序,得到混洗后的目标数据。The Map terminal sorts the keys of the target data according to the target data grading framework to obtain the shuffled target data.
  3. 根据权利要求2所述的方法,其中,所述Map端根据目标数据分级框架对所述目标数据的键进行排序,包括:The method according to claim 2, wherein, the Map terminal sorts the keys of the target data according to the target data grading framework, comprising:
    所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行排序。The Map terminal sorts the hash values corresponding to the keys of the target data according to the target data grading framework.
  4. 根据权利要求1所述的方法,其中,所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储,包括:The method according to claim 1, wherein the Reduce side aggregates and stores the shuffled target data in a columnar storage form according to the target data hierarchical structure, comprising:
    所述Reduce端获取预先配置的数据分层结构作为所述目标数据分层结构;The Reduce side obtains a preconfigured data hierarchy as the target data hierarchy;
    所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The Reduce side aggregates and stores the shuffled target data in the form of columnar storage according to the target data hierarchical structure.
  5. 根据权利要求1-4任一项所述的方法,其中,所述列式存储形式包括parquet列式存储。The method according to any one of claims 1-4, wherein the columnar storage form comprises parquet columnar storage.
  6. 根据权利要求5所述的方法,其中,所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储,包括:The method according to claim 5, wherein the Reduce side aggregates and stores the shuffled target data in a columnar storage form according to the target data hierarchical structure, comprising:
    所述Reduce端根据目标数据分层结构以及parquet列式存储的规范,对混洗后的目标数据进行汇总存储。The Reduce side aggregates and stores the shuffled target data according to the hierarchical structure of the target data and the specification of parquet columnar storage.
  7. 一种基于Map/Reduce架构的数据转储装置,包括:A data dump device based on Map/Reduce architecture, comprising:
    Map端处理模块,用于Map端获取目标数据,并根据目标数据分级框架对所述目标数据进行混洗;其中,所述目标数据以行式存储形式存储;The Map-side processing module is used for the Map-side to obtain target data, and shuffle the target data according to the target data grading framework; wherein, the target data is stored in the form of row storage;
    Reduce端处理模块,用于Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The processing module on the Reduce side is used for the Reduce side to aggregate and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  8. 根据权利要求7所述的装置,其中,所述Map端处理模块包括:键解析单元和数据混洗单元,其中,The device according to claim 7, wherein the Map-side processing module comprises: a key parsing unit and a data shuffling unit, wherein,
    所述键解析单元,用于所述Map端解析所述目标数据的键;the key parsing unit, used for parsing the key of the target data by the Map end;
    所述数据混洗单元,用于所述Map端根据目标数据分级框架对所述目标数据的键进行排序,得到混洗后的目标数据。The data shuffling unit is used for the Map end to sort the keys of the target data according to the target data grading framework to obtain the shuffled target data.
  9. 根据权利要求8所述的装置,其中,所述数据混洗单元,具体用于所述Map端根据目标数据分级框架对所述目标数据的键对应的哈希值进行排序。The apparatus according to claim 8, wherein the data shuffling unit is specifically configured for the Map side to sort the hash values corresponding to the keys of the target data according to the target data grading framework.
  10. 根据权利要求7所述的装置,其中,所述Reduce端处理模块包括:目标数据分层结构获取单元和数据汇总存储单元,其中,The apparatus according to claim 7, wherein the Reduce side processing module comprises: a target data hierarchical structure acquisition unit and a data summary storage unit, wherein,
    所述目标数据分层结构获取单元,用于所述Reduce端获取预先配置的数据分层结构作为所述目标数据分层结构;The target data hierarchical structure acquisition unit is used for the Reduce side to acquire a pre-configured data hierarchical structure as the target data hierarchical structure;
    所述数据汇总存储单元,用于所述Reduce端根据目标数据分层结构将混洗后的目标数据以列式存储形式进行汇总存储。The data summary storage unit is used for the Reduce side to summarize and store the shuffled target data in the form of columnar storage according to the hierarchical structure of the target data.
  11. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述处理器执行所述程序时实现如权利要求1-6中任一所述的方法。A computer device, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the program according to any one of claims 1-6 when the processor executes the program method.
  12. 一种计算机可读存储介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-6中任一所述的方法。A computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method according to any one of claims 1-6.
PCT/CN2020/132377 2020-07-28 2020-11-27 Data dump method and apparatus, device, and storage medium WO2022021710A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010740395.4 2020-07-28
CN202010740395.4A CN111930731A (en) 2020-07-28 2020-07-28 Data dump method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022021710A1 true WO2022021710A1 (en) 2022-02-03

Family

ID=73314777

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/132377 WO2022021710A1 (en) 2020-07-28 2020-11-27 Data dump method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN111930731A (en)
WO (1) WO2022021710A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111930731A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Data dump method, device, equipment and storage medium
CN112559462A (en) * 2020-12-14 2021-03-26 深圳供电局有限公司 Data compression method and device, computer equipment and storage medium
CN112650720A (en) * 2020-12-18 2021-04-13 深圳市佳创视讯技术股份有限公司 Cache system management method and device and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130132371A1 (en) * 2011-11-23 2013-05-23 Infosys Technologies Limited Methods, systems, and computer-readable media for providing a query layer for cloud databases
CN105912636A (en) * 2016-04-08 2016-08-31 金蝶软件(中国)有限公司 Map/Reduce based ETL data processing method and device
CN108170535A (en) * 2017-12-30 2018-06-15 北京工业大学 A kind of method of the promotion table joint efficiency based on MapReduce model
CN108415912A (en) * 2017-02-09 2018-08-17 阿里巴巴集团控股有限公司 Data processing method based on MapReduce model and equipment
CN111930731A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Data dump method, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9495427B2 (en) * 2010-06-04 2016-11-15 Yale University Processing of data using a database system in communication with a data processing framework
CN103023805A (en) * 2012-11-22 2013-04-03 北京航空航天大学 MapReduce system
CN103440244A (en) * 2013-07-12 2013-12-11 广东电子工业研究院有限公司 Large-data storage and optimization method
CN106126727A (en) * 2016-07-01 2016-11-16 中国传媒大学 A kind of big data processing method of commending system
CN108153585B (en) * 2017-12-01 2021-08-20 北京大学 Method and device for optimizing operation efficiency of MapReduce framework based on locality expression function
CN110162528A (en) * 2019-05-24 2019-08-23 安徽芃睿科技有限公司 Magnanimity big data search method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130132371A1 (en) * 2011-11-23 2013-05-23 Infosys Technologies Limited Methods, systems, and computer-readable media for providing a query layer for cloud databases
CN105912636A (en) * 2016-04-08 2016-08-31 金蝶软件(中国)有限公司 Map/Reduce based ETL data processing method and device
CN108415912A (en) * 2017-02-09 2018-08-17 阿里巴巴集团控股有限公司 Data processing method based on MapReduce model and equipment
CN108170535A (en) * 2017-12-30 2018-06-15 北京工业大学 A kind of method of the promotion table joint efficiency based on MapReduce model
CN111930731A (en) * 2020-07-28 2020-11-13 苏州亿歌网络科技有限公司 Data dump method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ANONYMOUS: " MapReduce Design Patterns (chapter 4 (part 1))", CSDN, 7 January 2013 (2013-01-07), XP055890569, Retrieved from the Internet <URL:https://blog.csdn.net/cuirong1986/article/details/8476368> [retrieved on 20220211] *
SONG AIBO, WAN YUTONG;GONG HUAN;XUE YINGYING: "Research on Storage and Query of Large-Scale Multidimensional Data", COMPUTER ENGINEERING AND APPLICATIONS, HUABEI JISUAN JISHU YANJIUSUO, CN, vol. 52, no. 13, 5 April 2016 (2016-04-05), CN , pages 25 - 31, XP055890564, ISSN: 1002-8331, DOI: 10.3778/j.issn.1002-8331.1601-0357 *

Also Published As

Publication number Publication date
CN111930731A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
WO2022021710A1 (en) Data dump method and apparatus, device, and storage medium
US11036735B2 (en) Dimension context propagation techniques for optimizing SQL query plans
US9507825B2 (en) Techniques for partition pruning based on aggregated zone map information
Chen Cheetah: a high performance, custom data warehouse on top of MapReduce
US8719312B2 (en) Input/output efficiency for online analysis processing in a relational database
US8918388B1 (en) Custom data warehouse on top of mapreduce
US8832142B2 (en) Query and exadata support for hybrid columnar compressed data
US10733184B2 (en) Query planning and execution with source and sink operators
US9558251B2 (en) Transformation functions for compression and decompression of data in computing environments and systems
US20120016901A1 (en) Data Storage and Processing Service
JP6964384B2 (en) Methods, programs, and systems for the automatic discovery of relationships between fields in a mixed heterogeneous data source environment.
CN109947791B (en) Database statement optimization method, device, equipment and storage medium
Duda Business intelligence and NoSQL databases
JP6159908B1 (en) Method, program, and system for automatic discovery of relationships between fields in a heterogeneous data source mixed environment
CN112949269A (en) Method, system, equipment and storage medium for generating visual data analysis report
Abdel Azez et al. Optimizing join in HIVE star schema using key/facts indexing
CN116383238A (en) Data virtualization system, method, device, equipment and medium based on graph structure
CN113962597A (en) Data analysis method and device, electronic equipment and storage medium
US11016973B2 (en) Query plan execution engine
WO2023231615A1 (en) Materialized-column creation method and data query method based on data lake
US10558661B2 (en) Query plan generation based on table adapter
CN109542912B (en) Interval data storage method, device, server and storage medium
US20130297573A1 (en) Character Data Compression for Reducing Storage Requirements in a Database System
Hasan et al. Data transformation from sql to nosql mongodb based on r programming language
Sinthong et al. AFrame: Extending DataFrames for large-scale modern data analysis (Extended Version)

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20946965

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20946965

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03.08.2023)