CN111291040B

CN111291040B - Data processing method, device, equipment and medium

Info

Publication number: CN111291040B
Application number: CN201811502713.2A
Authority: CN
Inventors: 黎亚龙
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Sichuan Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Sichuan Co Ltd
Priority date: 2018-12-10
Filing date: 2018-12-10
Publication date: 2022-10-18
Anticipated expiration: 2038-12-10
Also published as: CN111291040A

Abstract

The embodiment of the invention provides a data processing method, a device, equipment and a computer storage medium, wherein the data processing method comprises the following steps: receiving a Structured Query Language (SQL) request; acquiring access data of Hive to the HDFS according to the SQL request; according to the access data, an expected score for accessing the next period of the HDFS directory specified in the Hive table; and transferring the data in a heterogeneous storage according to the expectation score. The method and the device are used for solving the problems that a data processing mode in the prior art is poor in flexibility and low in efficiency.

Description

Data processing method, device, equipment and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method, apparatus, device, and medium

Background

A Distributed File System (HDFS for short) is a Distributed File System running on common hardware, has the characteristics of high fault tolerance and high throughput, is suitable for large-scale data sets, and belongs to basic components of a big data ecosphere.

The HDFS provides a heterogeneous storage function, different data are stored in different areas, and files are stored on the response SSD or the HHD through setting a strategy. SSD has faster read and write speed and smaller read and write latency relative to HDD, with the following data according to intel's test, as shown in Table 1 below:

TABLE 1

The storage policy may be set by hdfs dfsadmin-setStoragePolicy < path > < policyName >. For those block blocks that change in storage polarity, the tool will migrate the corresponding block through the hdfs mover. The process of block migration is the process of data migration from the SSD to the HDD and the HDD migration to the SSD. The storage structure of the heterogeneous HDFS is shown in fig. 1.

Hive is a data warehouse infrastructure built on Hadoop. It provides a set of tools that can be used to perform data Extraction Transformation Loading (ETL), a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple SQL-like query language, called HQL, which allows users familiar with SQL to query data. Essentially, hive can map a structured data file into a database table and provide an SQL query function, and during the query process, hive reads data blocks in HDFS. Hive data storage depends on the underlying HDFS, and in the HDFS, the data storage is realized through the following parameters: when Hive is processing job task, hive reads data from disk by read method defined in HDFS, and the read speed of HDFS directly affects the running efficiency of program.

In the prior art, the flexibility of a storage mode is poor, an official provides an hdfs mover, a Hive only provides an SSD of a temporary table, and the storage of the RAM is optimized. A large number of temporary read requests exist in the cluster, data migration cannot be frequently performed on the scenes, and the periodic report needs to be observed and judged according to a period of time. The storage strategy of the file is manually marked, and the storage block migration cannot be carried out according to the current state and the use condition of the cluster.

In summary, in the prior art, the data processing method has poor flexibility and low efficiency.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, data processing equipment and a computer storage medium, which are used for solving the problems of poor flexibility and low efficiency of a data processing mode in the prior art.

In a first aspect, a data processing method is characterized in that the method includes: receiving a Structured Query Language (SQL) request;

acquiring access data of Hive to HDFS according to the SQL request; according to the access data, an expected score for accessing the next period of the HDFS directory specified in the Hive table; and transferring the data in a heterogeneous storage according to the expectation score.

In a second aspect, an embodiment of the present invention provides a data processing apparatus, where the apparatus includes: the receiving module is used for receiving a Structured Query Language (SQL) request; acquiring access data of Hive to the HDFS according to the SQL request; the processing module is used for assigning an expected score for the next period of the HDFS directory to be accessed in the Hive table according to the access data; and the execution module is used for transferring the data in the heterogeneous storage according to the expectation score.

An embodiment of the present invention provides a data processing apparatus, including: at least one processor, at least one memory, and computer program instructions stored in the memory, which when executed by the processor, implement the method of the first aspect of the embodiments described above.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which computer program instructions are stored, which, when executed by a processor, implement the method of the first aspect in the foregoing embodiments.

The data processing method, the device, the equipment and the medium provided by the embodiment of the invention receive a Structured Query Language (SQL) request; acquiring access data of Hive to the HDFS according to the SQL request; according to the access data, an expected score for accessing the next period of the HDFS directory specified in the Hive table; and transferring the data in the heterogeneous storage according to the expected score, fully utilizing different reading characteristics of the SSD and the HDD at the bottom layer of the HDFS, and obtaining an access rule by realizing a periodic algorithm by combining a storage rule and an access rule of Hive on the HDFS. In a regular period, data to be read is migrated from low-speed storage to high-speed storage through an automatic migration tool, and data outside the regular period is migrated from the high-speed storage to the low-speed storage. Therefore, the purposes of improving Hive processing efficiency and fully playing heterogeneous storage characteristics and improving the utilization rate of the disk are achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 shows a storage structure diagram of a heterogeneous HDFS;

FIG. 2 illustrates logical storage rules of Hive files in HDFS;

FIG. 3 is a schematic diagram of a data processing method proposed by the present invention;

FIG. 4 is a diagram illustrating a table structure corresponding to a directory and a storage policy in a relational database;

fig. 5 is a schematic structural component diagram of a data processing apparatus according to an embodiment of the present invention;

fig. 6 is a schematic structural composition diagram of a data processing device according to an embodiment of the present invention.

Detailed Description

Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the present invention by illustrating examples of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of another like element in a process, method, article, or apparatus that comprises the element.

Hive is a data warehouse infrastructure built on Hadoop. It provides a series of tools that can be used to perform data Extraction Transformation Loading (ETL), which is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive defines a simple SQL-like query language, called HQL, which allows users familiar with SQL to query data. In essence, hive can map a structured data file into a database table and provide SQL query function, and during the query process, hive reads data blocks in HDFS.

Hive data storage depends on the underlying HDFS, in which: when Hive is processing job task, hive reads data from disk by read method defined in HDFS, and the read speed of HDFS directly affects the running efficiency of program. The logical storage rule of the Hive file in the HDFS is shown in FIG. 2. As shown in fig. 3, a specific processing flow of the data processing method provided in the embodiment of the present invention is as follows:

step 31, receiving a Structured Query Language (SQL) request.

In a specific implementation, the Hive Client submits an SQL request to the big data cluster.

SQL requests are received through Beeline and/or Hive JDBC access interfaces.

In specific implementation, the SQL request can be received through access interfaces such as Beeline and Hive JDBC.

And step 32, analyzing and storing the received SQL request.

In specific implementation, the SQL request is submitted to a big data cluster after being analyzed, hive data is stored in the HDFS, and the following rules are satisfied:

hive.metastore.warehouse.dir/dbname.db/tablename/partion

the table name and the partition in the Hive are respectively corresponding to the table name and the partition in the Hive, and the access rule of the Hive table can be counted according to the access of the HDFS.

And step 33, acquiring the access data of Hive to the HDFS according to the received SQL request.

The access of Hive to the HDFS can be intercepted by modifying the code for calling the HDFS access by Hive, and the access data can be obtained.

Specifically, the HDFS read request may be determined by modifying the filesystem. Open () method in the HDFS source code; and for each read request, recording access data corresponding to the read request by asynchronously starting a single thread.

In specific implementation, the access of Hive to HDFS is intercepted by modifying a code for calling HDFS access, and access data is obtained.

When the Hive SQL execution process requests to access the HDFS directory corresponding to the table according to the received SQL request rule.

Specifically, in the execution process, data in the HDFS is read by a filesystem. All HDFS reading requests can be obtained by modifying a FileSystemopen () method in an HDFS source code, so that the access of Hive to the HDFS is intercepted, and access data is obtained.

Starting a thread after a read request comes in, recording an access request by asynchronously starting a single thread, and inserting a request path, a request time and a request byte size into a database, wherein one possible implementation mode is as follows:

after interception, recording the intercepted time and directory address into the database, and simultaneously, hive executes normal access logic.

And the access monitoring of the access path is realized by embedding the code.

In the technical scheme provided by the embodiment of the invention, a table corresponding to a directory and a storage strategy is established in a relational database, the storage strategy is set to be scanned regularly every hour, and a program updates data into the database by scanning the storage strategy through ordering hdfs storage policies-list-path. Specifically, the table design proposed by the embodiment of the present invention is shown in fig. 4.

And step 34, according to the access data, assigning an expected score for accessing the next period of the HDFS directory in the Hive table.

In specific implementation, the expected score of the next period of access to a certain HDFS directory in the Hive table can be scored according to the access data obtained in the above steps, that is, the stored interception time and directory address.

Wherein, the access time frequency is counted to obtain the expected value of the next period, and the algorithm is as follows:

wherein, the statistical period of the T probability can be set as day, month and year.

E (x) is the expected value obtained.

tn 1 st to n time periods.

t _i = ith time period.

i = a total of i time periods.

j = j directories in total.

k _j = number of accesses per period of jth directory.

k _x The number of file accesses that occur to the current directory.

In this step, the expectation that the file is accessed in the next period is obtained for the file in the single directory, if the expectation is greater than the expected expectation, the storage is subjected to downgrading processing, and if the expectation is greater than a set value, the file is subjected to upgrading processing.

The scheme performs data migration based on different time periods, and is different from the migration standard of man-made and single period. This can improve the hive processing efficiency more sufficiently.

According to the scheme, the storage characteristic of Hive on the HDFS is utilized, the request code corresponding to Hive is intercepted on the access interface of the HDFS, the access is recorded by combining the request code with the database, and the influence of code embedding on the access of the HDFS is reduced by using a concurrent thread mode.

When OLAP analysis is performed by using Hive, with the increasing Hive task of business, the cluster faces great pressure, and Hive is used as IO intensive calculation. The access speed of the disk is directly related to the final display time of the result, the report forms of part of departments can not finish the normal operation of the business seriously influenced, the completion time of the daily analysis task is greatly advanced after the technical scheme provided by the embodiment of the invention is used, and the stability of the service under the condition that the data volume is increased too fast is ensured.

And step 35, transferring the data in the heterogeneous storage according to the expectation score.

And (4) transferring the data in the heterogeneous storage by using an HDFS mover tool according to the score of the next period of the Hive table.

Such as periodically using hdfs mover [ -p < files/dirs > ] for disk migration.

In the technical scheme provided by the embodiment of the invention, the access rule is obtained by fully utilizing different reading characteristics of the SSD and the HDD at the bottom layer of the HDFS and combining the storage rule and the access rule of Hive on the HDFS and realizing a periodic algorithm. In a regular period, data to be read is migrated from low-speed storage to high-speed storage through an automatic migration tool, and data outside the regular period is migrated from the high-speed storage to the low-speed storage. Therefore, the purposes of improving Hive processing efficiency and fully playing heterogeneous storage characteristics and improving the utilization rate of the disk are achieved.

An embodiment of the present invention further provides a data processing apparatus, as shown in fig. 5, including:

a receiving module 501, configured to receive a structured query language SQL request; acquiring access data of Hive to the HDFS according to the SQL request;

a processing module 502, configured to assign an expected score for the next period of the HDFS directory to be accessed in the Hive table according to the access data;

and an execution module 503, configured to transfer the data in the heterogeneous storage according to the expectation score.

Specifically, the receiving module 501 is specifically configured to receive an SQL request through a Beeline and/or Hive JDBC access interface.

Specifically, the receiving module 501 is specifically configured to intercept access of the Hive to the HDFS by modifying a code for calling the HDFS access by the Hive, and obtain access data.

Specifically, the receiving module 501 is specifically configured to determine an HDFS read request by modifying a filesystem. And for each read request, asynchronously starting a single thread to record access data corresponding to the read request.

The access data, including the frequency of access times,

specifically, the processing module 502 is specifically configured to assign an expected score for the next period of the HDFS directory to be accessed in the Hive table according to the access data, and includes: and counting the access time frequency according to the following formula to obtain an expected value of the next period, wherein the expected value is used as an expected score for accessing the next period of the specified HDFS directory in the Hive table:

where E (x) is the expected value to be found, t1.. Tn, 1 st to n th time periods, t _i = i time period i = i total occurrences, j = j total directories, k _j Number of accesses k of jth directory single time period _x The number of file accesses occurring to the current directory.

Optionally, the apparatus further comprises:

analyzing and storing the received SQL request, wherein the data storage meets the following rules:

hive.metastore.warehouse.dir/dbname.db/tablename/partion

wherein, table name and partition correspond to table name and partition in Hive respectively.

Specifically, the apparatus may further include:

the storage module is used for establishing a corresponding relation between the directory and the storage strategy in the relational database; and storing the access data according to the corresponding relation.

Specifically, the execution module is specifically configured to transfer the data in the heterogeneous storage based on the HDFS mover according to an expected score of a next cycle of the Hive table.

Specifically, the execution module is specifically configured to periodically use the HDFS move to perform disk migration.

In addition, the data processing method of the embodiment of the present invention described in conjunction with fig. 3 may be implemented by a data processing apparatus. Fig. 6 is a schematic diagram illustrating a hardware structure of a data processing apparatus according to an embodiment of the present invention.

The data processing device may comprise a processor 601 and a memory 602 in which computer program instructions are stored.

Specifically, the processor 601 may include a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or may be configured as one or more Integrated circuits implementing embodiments of the present invention.

Memory 602 may include mass storage for data or instructions. By way of example, and not limitation, memory 602 may include a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, tape, or Universal Serial Bus (USB) Drive or a combination of two or more of these. Memory 602 may include removable or non-removable (or fixed) media, where appropriate. The memory 602 may be internal or external to the data processing apparatus, where appropriate. In a particular embodiment, the memory 602 is a non-volatile solid-state memory. In a particular embodiment, the memory 602 includes Read Only Memory (ROM). Where appropriate, the ROM may be mask-programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

The processor 601 realizes any one of the data processing methods in the above embodiments by reading and executing computer program instructions stored in the memory 602.

In one example, the data processing device may also include a communication interface 603 and a bus 610. As shown in fig. 6, the processor 601, the memory 602, and the communication interface 603 are connected via a bus 610 to complete communication therebetween.

The communication interface 603 is mainly used for implementing communication between modules, apparatuses, units and/or devices in the embodiments of the present invention.

Bus 610 includes hardware, software, or both to couple the components of the data processing device to each other. By way of example, and not limitation, a bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a Hypertransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus or a combination of two or more of these. Bus 410 may include one or more buses, where appropriate. Although specific buses have been described and shown in the embodiments of the invention, any suitable buses or interconnects are contemplated by the invention.

In addition, in combination with the data processing method in the foregoing embodiments, the embodiments of the present invention may be implemented by providing a computer-readable storage medium. The computer readable storage medium having stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the data processing methods in the above embodiments.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

The functional blocks shown in the above-described structural block diagrams may be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

As described above, only the specific embodiments of the present invention are provided, and it can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the system, the module and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. It should be understood that the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present invention, and these modifications or substitutions should be covered within the scope of the present invention.

Claims

1. A method of data processing, the method comprising:

receiving a Structured Query Language (SQL) request;

acquiring access data of Hive to the HDFS according to the SQL request;

according to the access data, an expected score for accessing the next period of the HDFS directory specified in the Hive table;

transferring data in a heterogeneous storage according to the expectation score;

the access data, including the frequency of access times,

according to the access data, the expected score of accessing the HDFS directory specified in the Hive table in the next period comprises the following steps:

and counting the access time frequency according to the following formula to obtain an expected value of the next period, wherein the expected value is used as an expected score for accessing the next period of the specified HDFS directory in the Hive table:

where E (x) is the expected value to be found, t1.. Tn, 1 st to n th time periods, t _i = i time period, i = i time periods occurring in total, j = j directories, k _j Number of accesses k of jth directory single time period _x The number of file accesses occurring to the current directory.

2. The method of claim 1, wherein receiving an SQL request comprises:

SQL requests are received through Beeline and/or Hive JDBC access interfaces.

3. The method of claim 1, wherein obtaining Hive access data to the HDFS comprises:

intercepting the access of Hive to the HDFS by modifying the code for calling the HDFS access by Hive, and acquiring access data;

intercepting the access of Hive to the HDFS by modifying the code for calling the HDFS access by Hive, and acquiring access data, wherein the method comprises the following steps:

determining an HDFS read request by modifying a FileSystemopen () method in an HDFS source code;

and for each read request, asynchronously starting a single thread to record access data corresponding to the read request.

4. The method according to claim 1, after receiving the SQL request and before acquiring the access data of Hive to the HDFS according to the SQL request, further comprising:

hive.metastore.warehouse.dir/dbname.db/tablename/partion

5. The method according to claim 1, after obtaining the access data of Hive to the HDFS, and before scoring the expected score of the next period of access of the specified HDFS directory in the Hive table according to the access data, further comprising;

establishing a corresponding relation between a directory and a storage strategy in a relational database;

and storing the access data according to the corresponding relation.

6. The method of any of claims 1 to 5, wherein transferring data in a heterogeneous storage based on the expectation score comprises:

and transferring the data in a heterogeneous storage based on the HDFS mover according to the expected score of the next period of the Hive table.

7. The method of claim 6, wherein transferring data in a heterogeneous storage based on an HDFS mover comprises:

and periodically using the HDFS move to perform disk migration.

8. A data processing apparatus, characterized in that the apparatus comprises:

the receiving module is used for receiving a Structured Query Language (SQL) request; acquiring access data of Hive to the HDFS according to the SQL request;

the processing module is used for assigning an expected score for the next period of the HDFS directory to be accessed in the Hive table according to the access data;

the execution module is used for transferring the data in the heterogeneous storage according to the expectation score;

the access data, including the frequency of access times,

the processing module is further configured to count the access time frequency according to the following formula to obtain an expected value of the next period, where the expected value is used as an expected score for accessing the next period in the specified HDFS directory in the Hive table:

9. A data processing apparatus, characterized by comprising: at least one processor, at least one memory, and computer program instructions stored in the memory that, when executed by the processor, implement the method of any of claims 1-7.

10. A computer-readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1-7.