CN107092676A

CN107092676A - A kind of data processing method and device

Info

Publication number: CN107092676A
Application number: CN201710253998.XA
Authority: CN
Inventors: 丁良奎
Original assignee: Guangdong Inspur Big Data Research Co Ltd
Current assignee: Guangdong Inspur Smart Computing Technology Co Ltd
Priority date: 2017-04-18
Filing date: 2017-04-18
Publication date: 2017-08-25

Abstract

The invention discloses a kind of data processing method and device, this method includes：The pending data stored in Distributed Computing Platform is converted into the Parquet forms by row storage；The pending data of Parquet forms is loaded into internal memory by row group using Spark, and pending data is decompressed and sent after unserializing to the corresponding calculating of FPGA Heterogeneous Computings device progress in internal memory.In technical scheme disclosed in the present application, the pending data stored in Distributed Computing Platform is first converted into the Parquet forms by row storage, the pending data of Parquet forms is loaded into internal memory according still further to row group and sent after relevant treatment to the progress correspondence calculating of FPGA Heterogeneous Computings device.The pending data of Parquet forms is read out according to row group in the application, the data processing speed of FPGA Heterogeneous Computing devices is substantially increased.

Description

A kind of data processing method and device

Technical field

The present invention relates to deep learning isomery acceleration technique field, more specifically to a kind of data processing method and Device.

Background technology

The development of big data and high-performance calculation platform greatly advances the paces of deep learning, in deep learning, When data set and very big parameter model, culture large-scale model is just needed by distributed computing framework.Apache Spark There is unsurpassed internal memory to calculate and expansible Distributed Calculation performance, can effectively improve the super ginseng in deep learning Number tuning and the deployment of extensive model.

In two critical activitys of deep learning：, it is necessary to very high inherent degree of parallelism, substantial amounts of floating-point in classification and convolution Computing capability and matrix computations.GPU and FPGA relies on its outstanding isomery speed-up computation and Floating-point Computation ability, greatly carries The high speed and performance of deep learning, relative CPU processing have required server less, the characteristics of power consumption is lower.

Spark, by parsing after metadata, stores a relative displacement for the data of variable-length in internal memory And length, the pointer for pointing to the row data stored with this length is served as using UnsafeRow examples, when handling data line, Just digital independent into internal memory；And FPGA can only temporarily read in this data line when accelerating data processing.Invention human hair Existing, this data reading mode accelerates Spark to calculate for FPGA can concurrency apparently without make full use of FPGA so that FPGA data reading speed is slower, and the processing speed for further resulting in FPGA for data is slower.

In summary, how to provide it is a kind of can make full use of FPGA can concurrency to improve its data processing speed Technical scheme, is current those skilled in the art's urgent problem to be solved.

The content of the invention

It is an object of the invention to provide a kind of data processing method and device, with make full use of FPGA can concurrency to carry Its high data processing speed.

To achieve these goals, the present invention provides following technical scheme：

A kind of data processing method, including：

The pending data stored in Distributed Computing Platform is converted into the Parquet forms by row storage；

The pending data of the Parquet forms is loaded into internal memory by row group using Spark, and it is right in internal memory The pending data carries out decompression and sent after unserializing to the corresponding calculating of FPGA Heterogeneous Computings device progress.

It is preferred that, pending data is converted into the Parquet forms by row storage, including：

Generation includes the Parquet files of whole pending datas, and in this document cuts whole pending datas Be divided into lines group, wherein in each row group the row comprising all pending datas row block, and either rank block is stored with pair The pending data of corresponding part in should arranging.

It is preferred that, whole pending data cuttings are embarked on journey group in the Parquet files, including：

Whole pending data cuttings are embarked on journey group in the Parquet files, wherein the size of each row group It is identical with the size of data block in the Distributed Computing Platform.

It is preferred that, the pending data of the Parquet forms is loaded into internal memory by row group using Spark, including：

Using advanced with Spark c language setting function by the pending data of the Parquet forms by row group It is loaded into internal memory.

It is preferred that, will in internal memory carry out decompression and unserializing after pending data to FPGA Heterogeneous Computings device it Afterwards, in addition to：

Obtain the result of calculation of the FPGA Heterogeneous Computings device return and the result of calculation returned into the Spark, The result of calculation is that the FPGA Heterogeneous Computings device calculates what is obtained using the pending data progress correspondence received.

A kind of data processing equipment, including：

Modular converter, is used for：The pending data stored in Distributed Computing Platform is converted into by row storage Parquet forms；

Processing module, is used for：The pending data of the Parquet forms is loaded into internal memory by row group using Spark In, and in internal memory the pending data is decompressed and unserializing after send to FPGA Heterogeneous Computings device and carry out pair It should calculate.

It is preferred that, the modular converter includes：

Converting unit, is used for：Generation includes the Parquet files of whole pending datas, and in this document will be complete Portion's pending data cutting is embarked on journey group, wherein in each row group the row comprising all pending datas row block, it is and any Row block is stored with the pending data of corresponding part in respective column.

It is preferred that, the converting unit includes：

Conversion subunit, is used for：Whole pending data cuttings are embarked on journey group in the Parquet files, wherein often The size of the individual row group is identical with the size of data block in the Distributed Computing Platform.

It is preferred that, the processing module includes：

Loading unit, is used for：Utilize function the treating the Parquet forms that the setting of c language is advanced with Spark Processing data is loaded into internal memory by row group.

It is preferred that, in addition to：

Module is returned, is used for：The pending data after decompression and unserializing will be carried out in internal memory to FPGA isomery meters Calculate after device, obtain the result of calculation of the FPGA Heterogeneous Computings device return and return to the result of calculation described Spark, the result of calculation is that the FPGA Heterogeneous Computings device carries out corresponding calculate using the pending data received Arrive.

The invention provides a kind of data processing method and device, wherein this method includes：By in Distributed Computing Platform The pending data of storage is converted into the Parquet forms by row storage；Using Spark by the Parquet forms wait locate Reason data are loaded into internal memory by row group, and in internal memory the pending data is decompressed and unserializing after send to FPGA Heterogeneous Computings device carries out correspondence calculating.In technical scheme disclosed in the present application, it will first be stored in Distributed Computing Platform Pending data be converted into by row storage Parquet forms, the pending data of Parquet forms is added according still further to row group It is loaded into internal memory send after relevant treatment to FPGA Heterogeneous Computings device and carries out correspondence calculating.Different from needing in the prior art Want to read the pending data of Parquet forms according to row group in the reading manner of data line data line, the application Take, can make full use of FPGA Heterogeneous Computing devices can concurrency so that the digital independent speed of FPGA Heterogeneous Computing devices Degree is very fast, and then substantially increases the data processing speed of FPGA Heterogeneous Computing devices.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the accompanying drawing used required in technology description to be briefly described, it should be apparent that, drawings in the following description are only this The embodiment of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can also basis The accompanying drawing of offer obtains other accompanying drawings.

Fig. 1 is a kind of flow chart of data processing method provided in an embodiment of the present invention；

Fig. 2 is a kind of structural representation of data processing equipment provided in an embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than whole embodiments.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.

Referring to Fig. 1, it illustrates a kind of flow chart of data processing method provided in an embodiment of the present invention, can wrap Include：

S11：The pending data stored in Distributed Computing Platform is converted into the Parquet forms by row storage.

Whether it is by row storage that the pending data being stored in Distributed Computing Platform can be first determined in the application Parquet forms, if it is, not processing, if it is not, then needing these pending datas being converted into by row storage Parquet forms.The data of Parquet forms compare line be stored with reduction disk storage space, supporting vector computing, obtain The advantage of more preferable scan performance.The executive agent of each step can be number in a kind of data processing method disclosed in the present application According to processing unit, and the data processing equipment can be integrated in Spark, thus the application collect each step executive agent it is equal Can be Spark.

S12：The pending data of Parquet forms is loaded into internal memory by row group using Spark, and it is right in internal memory Pending data carries out decompression and sent after unserializing to the corresponding calculating of FPGA Heterogeneous Computings device progress.

Pending data is converted into after the Parquet forms by row storage, then when needing to load it in internal memory then It directly can be added into internal memory by row group using Spark, and then the correspondence provided in Parquet function libraries is provided in internal memory Interface decompressed and the relevant treatment such as unserializing after send to FPGA Heterogeneous Computings device so that FPGA Heterogeneous Computing devices Corresponding calculate is carried out to the data received.

In technical scheme disclosed in the present application, first the pending data stored in Distributed Computing Platform is converted into by row The pending data of Parquet forms, is loaded into internal memory and carries out relevant treatment by the Parquet forms of storage according still further to row group After send to FPGA Heterogeneous Computings device carry out correspondence calculating.Reading different from needing data line data line in the prior art Take in mode, the application and the pending data of Parquet forms is read out according to row group, FPGA can be made full use of different Structure computing device can concurrency so that the data reading speed of FPGA Heterogeneous Computing devices is very fast, and then substantially increases The data processing speed of FPGA Heterogeneous Computing devices.

A kind of data processing method provided in an embodiment of the present invention, pending data is converted into by row storage Parquet forms, can include：

Generation includes the Parquet files of whole pending datas, and in this document cuts whole pending datas Be divided into lines group, wherein in each row group the row comprising all pending datas row block, and either rank block is stored with respective column The pending data of middle corresponding part.

It is then specifically in Parquet it should be noted that pending data is converted into the Parquet forms by row storage All pending data levels are cut into Row group (row group) in file, a Row group includes correspondence in all row Partial column chunk (row block), a column chunk stores the pending data of the row, namely horizontal cutting is got To row group, and group inside is gone then by row cutting, so that the part row block of each all row of the row group comprising pending data, and institute There is row group to be grouped together and can be obtained by whole pending datas；Two are cut into if desired for the pending data for arranging 10 rows 4 Individual row group, then corresponding data are as first row group in arranging preceding 5 row and this 5 row 4, and then 5 rows and this 5 row are in 4 arrange Corresponding data as second row group, finally give two be 5 rows 4 row row group.So as to quick square in this way Just pending data is converted into Parquet forms, further ensure data processing method provided in an embodiment of the present invention Smooth realization.

A kind of data processing method provided in an embodiment of the present invention, cuts whole pending datas in Parquet files Be divided into lines group, can include：

Whole pending data cuttings are embarked on journey group in Parquet files, wherein the size of each row group with it is distributed The size of data block is identical in calculating platform.

It should be noted that row group size can be configured according to actual needs, be set in the application with Data block size in Distributed Computing Platform is identical, without carrying out cutting or combination to data block, simplifies operating process.

A kind of data processing method provided in an embodiment of the present invention, using Spark by the pending data of Parquet forms It is loaded into internal memory, can includes by row group：

The pending data of Parquet forms is loaded by row group using the function that the setting of c language is advanced with Spark Enter in internal memory.

It should be noted that technical scheme provided in an embodiment of the present invention can be realized based on c language, wherein can be advance Corresponding function realization is write using c language pending data is read and is loaded into internal memory and subsequently treated by capable organize Processing transmission of processing data etc..Specifically, the application realizes a kind of data model being present in internal memory with c language, The model is based on parquet storage formats, and with reference to Java Native Interface, (JNI, JNI provide scala and C The communication of language and OpenCL, it is allowed to which code and data are interacted between different language), it is possible to achieve Parquet files The functions such as metadata parsing, digital independent, decompression, unserializing simultaneously can provide the interface interacted with Scala.So as to Using c language it is simple and quick realize technical scheme disclosed in the present application.

A kind of data processing method provided in an embodiment of the present invention, will carry out decompression and treating after unserializing in internal memory After processing data to FPGA Heterogeneous Computing devices, it can also include：

Obtain the result of calculation of FPGA Heterogeneous Computings device return and the result of calculation is returned into Spark, result of calculation Correspondence is carried out using the pending data received calculate what is obtained for FPGA Heterogeneous Computings device.

Pending data after will be treated is sent to FPGA Heterogeneous Computing devices, the calculation in FPGA Heterogeneous Computing devices Method is carried out after corresponding data processing, feeds back to Spark, is easy to Spark to be based on result of calculation and is carried out next step operation.

It is further to note that in embodiments of the invention for corresponding technical scheme realization principle in the prior art Consistent part is simultaneously unspecified, in order to avoid excessively repeat.

The embodiment of the present invention additionally provides a kind of data processing equipment, as shown in Fig. 2 can include：

Modular converter 11, is used for：The pending data stored in Distributed Computing Platform is converted into by row storage Parquet forms；

Processing module 12, is used for：The pending data of Parquet forms is loaded into internal memory by row group using Spark, And pending data is decompressed in internal memory and sent after unserializing to the corresponding calculating of FPGA Heterogeneous Computings device progress.

A kind of data processing equipment provided in an embodiment of the present invention, modular converter can include：

Converting unit, is used for：Generation includes the Parquet files of whole pending datas, and in this document will be complete Portion's pending data cutting is embarked on journey group, wherein in each row group the row comprising all pending datas row block, and either rank block Be stored with the pending data of corresponding part in respective column.

A kind of data processing equipment provided in an embodiment of the present invention, converting unit can include：

Conversion subunit, is used for：Whole pending data cuttings are embarked on journey group in Parquet files, wherein each row The size of group is identical with the size of data block in Distributed Computing Platform.

A kind of data processing equipment provided in an embodiment of the present invention, processing module can include：

Loading unit, is used for：Using advancing with the function of c language setting in Spark by the pending of Parquet forms Data are loaded into internal memory by row group.

A kind of data processing equipment provided in an embodiment of the present invention, can also include：

Module is returned, is used for：The pending data after decompression and unserializing will be carried out in internal memory to FPGA isomery meters Calculate after device, obtain the result of calculation of FPGA Heterogeneous Computings device return and the result of calculation is returned into Spark, calculate knot Fruit carries out correspondence using the pending data received for FPGA Heterogeneous Computings device and calculates what is obtained.

The explanation of relevant portion refers to the embodiment of the present invention in a kind of data processing equipment provided in an embodiment of the present invention The detailed description of corresponding part, will not be repeated here in a kind of data processing method provided.

The foregoing description of the disclosed embodiments, enables those skilled in the art to realize or using the present invention.To this A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and generic principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited It is formed on the embodiments shown herein, and is to fit to consistent with features of novelty with principles disclosed herein most wide Scope.

Claims

1. a kind of data processing method, it is characterised in that including：

The pending data of the Parquet forms is loaded into internal memory by row group using Spark, and to described in internal memory Pending data carries out decompression and sent after unserializing to the corresponding calculating of FPGA Heterogeneous Computings device progress.

2. according to the method described in claim 1, it is characterised in that pending data is converted into the Parquet by row storage Form, including：

Generation includes the Parquet files of whole pending datas, and is in this document cut into whole pending datas Row group, wherein in each row group the row comprising all pending datas row block, and either rank block is stored with respective column The pending data of middle corresponding part.

3. method according to claim 2, it is characterised in that by whole pending datas in the Parquet files Cutting is embarked on journey group, including：

Whole pending data cuttings are embarked on journey group in the Parquet files, wherein size and the institute of each row group The size for stating data block in Distributed Computing Platform is identical.

4. method according to claim 3, it is characterised in that utilize Spark by the pending number of the Parquet forms It is loaded into according to by row group in internal memory, including：

The pending data of the Parquet forms is loaded by row group using the function that the setting of c language is advanced with Spark Enter in internal memory.

5. method according to claim 4, it is characterised in that decompression and treating after unserializing will be carried out in internal memory and is located After reason data to FPGA Heterogeneous Computing devices, in addition to：

Obtain the result of calculation of the FPGA Heterogeneous Computings device return and the result of calculation is returned into the Spark, it is described Result of calculation is that the FPGA Heterogeneous Computings device calculates what is obtained using the pending data progress correspondence received.

6. a kind of data processing equipment, it is characterised in that including：

Modular converter, is used for：The pending data stored in Distributed Computing Platform is converted into the Parquet lattice by row storage Formula；

Processing module, is used for：The pending data of the Parquet forms is loaded into internal memory by row group using Spark, and The pending data is decompressed in internal memory and unserializing after send to FPGA Heterogeneous Computings device carry out to accrued Calculate.

7. device according to claim 6, it is characterised in that the modular converter includes：

Converting unit, is used for：Generation includes the Parquet files of whole pending datas, and will all treat in this document Processing data cutting is embarked on journey group, wherein in each row group the row comprising all pending datas row block, and either rank block Be stored with the pending data of corresponding part in respective column.

8. device according to claim 7, it is characterised in that the converting unit includes：

Conversion subunit, is used for：Whole pending data cuttings are embarked on journey group in the Parquet files, wherein each institute The size for stating row group is identical with the size of data block in the Distributed Computing Platform.

9. device according to claim 8, it is characterised in that the processing module includes：

Loading unit, is used for：Using advancing with the function of c language setting in Spark by the pending of the Parquet forms Data are loaded into internal memory by row group.

10. device according to claim 9, it is characterised in that also include：

Module is returned, is used for：The pending data after decompression and unserializing to FPGA Heterogeneous Computings will be carried out in internal memory to fill After putting, obtain the result of calculation of the FPGA Heterogeneous Computings device return and the result of calculation is returned into the Spark, institute State result of calculation and carry out what correspondence calculating was obtained using the pending data received for the FPGA Heterogeneous Computings device.