CN110321329A

CN110321329A - Data processing method and device based on big data

Info

Publication number: CN110321329A
Application number: CN201910525331.XA
Authority: CN
Inventors: 周朝卫; 刘垒
Original assignee: Unihub China Information Technology Co Ltd
Current assignee: Unihub China Information Technology Co Ltd; Zhongying Youchuang Information Technology Co Ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2019-10-11

Abstract

The embodiment of the present application provides a kind of data processing method and device based on big data, and method includes: to carry out data cleansing to the original target file under target directory, obtains at least one first object file；According to the file size of each first object file, all first object files are divided into multiple combinations, wherein, the sum of the file size of all first object files in each combination is no more than the first preset reference numerical value and combined number is the minimum value in all possible number；All first object files in each permutation and combination are subjected to file mergences, obtain the second file destination, wherein second file destination is correspondingly arranged on the data block of a distributed file system；The application can make that the storage efficiency of mass data is higher, resource allocation is more reasonable and subsequent calculated performance is more excellent.

Description

Data processing method and device based on big data

Technical field

This application involves data processing fields, and in particular to a kind of data processing method and device based on big data.

Background technique

Hadoop is a distributed system infrastructure developed by apache foundation, and user can not know about In the case where distributed bottom level details, distributed program is developed, and then the globality that can make full use of cluster carries out high speed fortune It calculates and stores, the design that the frame of Hadoop is most crucial is exactly: HDFS (Hadoop Distributed File System, point Cloth file system) and MapReduce.HDFS provides storage for the data of magnanimity, and MapReduce is then the data of magnanimity Provide calculating.

Storage efficiency when in the related technology data are carried out with file storage is lower, and the lesser file of data volume also occupies One individual HDFS data block, resource allocation is unreasonable, so that the data number of blocks that entire HDFS cluster has is excessive, influences The operation efficiency that follow-up data calculates.

Therefore, a kind of data processing method and device based on big data is needed, is stored with solving file in the related technology The technical issues of efficiency is lower, and resource allocation is unreasonable, influences the operation efficiency of follow-up data calculating.

Summary of the invention

For the problems of the prior art, the application provides a kind of data processing method and device based on big data, energy Enough make that the storage efficiency of mass data is higher, resource allocation is more reasonable and subsequent calculated performance is more excellent.

In order to solve the above technical problems, the application the following technical schemes are provided:

In a first aspect, the application provides a kind of data processing method based on big data, comprising:

Data cleansing is carried out to the original target file under target directory, obtains at least one first object file；

According to the file size of each first object file, all first object files are divided into multiple groups Close, wherein the file size of all first object files in each combination and be no more than the first preset reference numerical value and group The number of conjunction is the minimum value in all possible number；

All first object files in each permutation and combination are subjected to file mergences, obtain the second file destination, wherein Second file destination is correspondingly arranged on the data block of a distributed file system.

Further, the original target file under target directory carries out data cleansing, obtain at least one first File destination, comprising:

Entered according to the file that the file generated time of the original target file and original target file enter target directory The library time establishes temp directory corresponding with the target directory；

At least one the first object file obtained after data cleansing is stored into the temp directory.

Further, the file size according to each first object file, by all first object texts Part is divided into multiple combinations, comprising:

The file size of creation storage is the memory space of the first preset reference numerical value；

The one first object file is stored into the memory space；

Another first object file is obtained, another described first object file is stored to existing memory space In, if storage failure, a memory space is created again and stores another described first object file to described In memory space, repeat to obtain another first object file until all first object files are stored to memory space；

It is combined at least one first object file in each memory space as one.

Judge whether the file size of the first object file is greater than the second preset reference numerical value, wherein second is default Numerical benchmark is less than the first preset reference numerical value；

If so, being combined each first object file for being greater than the second preset reference numerical value as one；

If it is not, the file size of creation storage is the memory space of the first preset reference numerical value；

The first object file that one is less than or equal to the second preset reference numerical value is stored into the memory space；

Another first object file for being less than or equal to the second preset reference numerical value is obtained, described another is less than or equal to The first object file of second preset reference numerical value is stored into existing memory space, if storage failure, creates one again A memory space simultaneously stores another described first object file for being less than or equal to the second preset reference numerical value to described In memory space, repetition obtain another less than or equal to the second preset reference numerical value first object file until it is all be less than etc. It stores in the first object file of the second preset reference numerical value to memory space；

It is combined at least one first object file in each memory space as one.

Data cleansing is carried out to the original target file under the target directory, according to the file of the original target file The data compression ratio that file format is converted during size, the first preset reference numerical value and data cleansing, obtains at least one The first object file.

Further, described according to the file size of the original target file, the first preset reference numerical value and data The data compression ratio that file format is converted in cleaning process, obtains at least one described first object file, comprising:

According to file during the file size of the original target file, the first preset reference numerical value and data cleansing The data compression ratio of format conversion obtains the compression number of the first object file；

The file format of the original target file after data cleansing is converted into column storage format, obtain to A few first object file, wherein the number of the first object file is less than or equal to the compression number.

Further, further includes:

It is corresponding when being stored comprising the original target file to target directory in the file name of the first object file The file warehousing time and the corresponding original target file of first object file file generated time corresponding time identifier；

According to the file warehousing time and the time identifier, the data inputting delay of the original target file is obtained Information.

Further, original target file under described to target directory carries out data cleansing, obtain at least one the After one file destination, comprising:

It is corresponding when being stored comprising the original target file to target directory in the file name of second file destination The file warehousing time；

Judgement stores in the formal catalogue of the data block with the presence or absence of the number in title including the file warehousing time According to block；

If so, deleting all second file destinations in the data block, and will be passed through in the corresponding temp directory Second file destination formed after data cleansing is crossed to migrate into the corresponding data block of the formal catalogue.

Further, the original target file under target directory carries out data cleansing, comprising:

Data cleansing task is created, and the data cleansing task is submitted in the dynamic resource queue of cluster；

The data cleansing task is obtained from the dynamic resource queue and according to the data cleansing task to target Original target file under catalogue carries out data cleansing.

Second aspect, the application provide a kind of data processing equipment based on big data, comprising:

Data cleansing module obtains at least one for carrying out data cleansing to the original target file under target directory First object file；

File combination module, for the file size according to each first object file, by all first mesh Mark file be divided into multiple combinations, wherein the file size of all first object files in each combination and be no more than first Preset reference numerical value and combined number are the minimum value in all possible number；

File combination module is obtained for all first object files in each permutation and combination to be carried out file mergences Second file destination, wherein second file destination is correspondingly arranged on the data block of a distributed file system.

Further, the data cleansing module includes:

Temp directory establishes unit, for according to the file generated time of the original target file and original target file Into the file warehousing time of target directory, temp directory corresponding with the target directory is established；

Data cleansing unit obtains at least one for carrying out data cleansing to the original target file under target directory First object file；

Temp directory storage unit, for by least one the first object file obtained after data cleansing store to In the temp directory.

Further, the file combination module includes:

Reference space creating unit, the file size for creating storage are the memory space of the first preset reference numerical value；

First object file storage unit, for storing a first object file into the memory space；

Second file destination storage unit will another described first object for obtaining another first object file File is stored into existing memory space, if storage failure, creates a memory space and will be described another again A first object file is stored into the memory space, and repetition obtains another first object file until all first objects File is stored to memory space；

File destination assembled unit, for using at least one first object file in each memory space as a group It closes.

Further, the file combination module includes:

File size judging unit, for judging whether the file size of the first object file is greater than the second default base Quasi- numerical value, wherein the second preset reference numerical value is less than the first preset reference numerical value；

File arrangements assembled unit, for being greater than the second preset reference when the file size for determining the first object file It when numerical value, is combined each first object file for being greater than the second preset reference numerical value as one, if it is not, then according to every All first object files are divided into multiple combinations by the file size of a first object file.

Further, the data cleansing unit includes:

Data cleansing subelement, for carrying out data cleansing to the original target file under the target directory, according to institute State the data that file format is converted during file size, the first preset reference numerical value and the data cleansing of original target file Compression ratio obtains at least one described first object file.

Further, the data cleansing subelement includes:

Compression number determines subelement, for the file size according to the original target file, the first preset reference number The data compression ratio that file format is converted during value and data cleansing obtains the compression number of the first object file；

Data Format Transform subelement, for turning the file format of the original target file after data cleansing It is changed to column storage format, obtains at least one described first object file, wherein the number of the first object file is less than Equal to the compression number.

Further, further includes:

First file designation module, for including the original target file in the file name of the first object file The file generated of corresponding file warehousing time and the corresponding original target file of first object file when storing to target directory Time corresponding time identifier；

Delay monitoring module, for obtaining the original object according to the file warehousing time and the time identifier The data inputting of file postpones information.

Further, the data cleansing unit includes:

Second file designation subelement, for including the original object text in the file name of second file destination Part corresponding file warehousing time when storing to target directory；

Duplicate checking judgment sub-unit, with the presence or absence of including institute in title in the formal catalogue for judging to store the data block State the data block of file warehousing time；

Duplicate checking handles subelement, for when there are include described in title in the formal catalogue that judgement stores the data block When the data block of file warehousing time, all second file destinations in the data block are deleted, and will be corresponding described interim Second file destination formed after data cleansing in catalogue is migrated into the corresponding data block of the formal catalogue.

Further, the data cleansing module includes:

Cleaning task submits unit, is submitted to cluster for creating data cleansing task, and by the data cleansing task Dynamic resource queue in；

Cleaning task execution unit, for obtaining the data cleansing task from the dynamic resource queue and according to institute It states data cleansing task and data cleansing is carried out to the original target file under target directory.

The third aspect, the application provides a kind of electronic equipment, including memory, processor and storage are on a memory and can The computer program run on a processor, the processor realize the data based on big data when executing described program The step of processing method.

Fourth aspect, the application provide a kind of computer readable storage medium, are stored thereon with computer program, the calculating The step of data processing method based on big data is realized when machine program is executed by processor.

As shown from the above technical solution, the application provides a kind of data processing method and device based on big data, passes through To the advanced row data cleansing of original target file under target directory in HDFS file system, first object file is obtained, wherein The file size of the first object file after data cleansing be not more than the first preset reference numerical value, but there may be The smaller situation of file size, can additionally occupy storage resource, then further according to each first object file All first object files are divided into multiple combinations by file size, so that all first object files in each combination File size sum as close possible to the first preset reference numerical value, i.e. the maximum capacity of each data block in HDFS, subsequently, All first object files in each combination are merged, obtain second file destination, and be correspondingly arranged on for it The data block of one HDFS reaches preferable to be used in the minimum number of the data block of the HDFS of storage original target file Storage efficiency, resource allocation is more reasonable, and then improves the operation efficiency that follow-up data calculates.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.

Fig. 1 is one of the flow diagram of data processing method based on big data in the embodiment of the present application；

Fig. 2 is the two of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 3 is the three of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 4 is the four of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 5 is the five of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 6 is the six of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 7 is the seven of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 8 is the eight of the flow diagram of the data processing method based on big data in the embodiment of the present application；

Fig. 9 is one of the module diagram of data processing equipment based on big data in the embodiment of the present application；

Figure 10 is the two of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 11 is the three of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 12 is the four of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 13 is the five of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 14 is the six of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 15 is the seven of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 16 is the eight of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 17 is the nine of the module diagram of the data processing equipment based on big data in the embodiment of the present application；

Figure 18 is the structural schematic diagram of the electronic equipment in the embodiment of the present application.

Specific embodiment

To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, technical solutions in the embodiments of the present application carries out clear, complete description, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall in the protection scope of this application.

In view of storage efficiency when in the related technology data are carried out with file storage is lower, the lesser file of data volume An individual HDFS data block is occupied, resource allocation is unreasonable, so that the data number of blocks mistake that entire HDFS cluster has It is more, affect follow-up data calculating operation efficiency the problem of, the application provides a kind of data processing method based on big data And device, by obtaining the first mesh to the advanced row data cleansing of original target file under target directory in HDFS file system Mark file, wherein the file size of the first object file after data cleansing is not more than the first preset reference numerical value, But the situation smaller there may be file size, storage resource can be additionally occupied, then further according to each described All first object files are divided into multiple combinations by the file size of one file destination, so that all in each combination The sum of the file size of first object file is as close possible to the first preset reference numerical value, i.e. the maximum of each data block in HDFS All first object files in each combination are subsequently merged, obtain second file destination, and be by capacity Its data block for being correspondingly arranged on a HDFS, be used in storage original target file HDFS data block quantity most It is few, reach preferable storage efficiency, resource allocation is more reasonable, and then improves the operation efficiency that follow-up data calculates.

In order to make that the storage efficiency of mass data is higher, resource allocation is more reasonable and subsequent calculated performance more Excellent, the application provides a kind of embodiment of data processing method based on big data, referring to Fig. 1, the number based on big data Specifically include following content according to processing method:

Step S101: data cleansing is carried out to the original target file under target directory, obtains at least one first object File.

It is understood that data acquisition can be stored into HDFS first after coming, HDFS establishes target directory for storing The original target file, it is preferable that the HDFS established a target directory every one minute, for storing in this minute The original target file of storage.

As a preferred embodiment, the bibliographic structure of the target directory may include when establishing the catalogue when Between and corresponding computer room number, such as bibliographic structure be "/hadoop/accesslog/ $ { 1 minute catalogue }/$ { computer room id } ", Corresponding specific example is "/hadoop/accesslog/201806280955/1034 ".

It is understood that can be executed by existing big data computing engines to the original object in target directory The data cleansing of file works, such as using Spark engine, appoints to the Spark engine input data cleansing to be executed When business, the key parameter that can control cleaning output result, such as convergence parameter can also be inputted simultaneously, after determining cleaning with this First object file rough file size and quantity of documents.

Specifically, it is determined that the purpose of convergence parameter is the small documents in order to tentatively solve the problems, such as to generate after data cleansing, receive Hold back the calculating process of parameter are as follows:

(1) original target file desired file size A after data cleansing is determined, it is preferred that the desired file The value of size can be the first preset reference numerical value, obtain the file of all original target files to be cleaned under target directory Size and B；

(2) file format changes corresponding data compression ratio C when determining data cleansing, and according to convergence parameter meter It calculates formula and determines convergence parameter, the convergence parameter calculation formula can be with are as follows: B/ (A/C), for example, determining original target file warp Desired file size is 120M (close to the maximum size of the data block of a HDFS), institute under target directory after crossing data cleansing Have the file size of original target file to be cleaned and be 2400M, file format is converted to by TXT format when data cleansing ORC format, and the data compression ratio between TXT format and ORC format is it is known that be 0.25, then by convergence parameter calculation formula Know corresponding convergence parameter are as follows: 2400/ (120/0.25)=5, at this time original target file big enable after data cleansing 5 first object files are enough generated, and the file size of each first object file is no more than 120M.

Step S102: according to the file size of each first object file, by all first object files point For multiple combinations, wherein the file size of all first object files in each combination and be no more than the first preset reference Numerical value and combined number are the minimum value in all possible number.

It is understood that the small documents in order to further solve the problems, such as first object file, can be used vanning and calculate Method merges small documents, and the judgment basis of the small documents can be less than the second preset reference numerical value, for example, in HDFS The maximum size of each data block is 128M, and the second preset reference numerical value can use 80M.Due to the appearance of data block each in HDFS The amount upper limit is 128M, i.e., the file lower than 128M can occupy 1 data block, and be higher than the data occupied on the document of 128M The quantity of block are as follows: (file size/128)+1, but actually can have one due to the indivisible characteristic of first object file Memory space in a little data blocks has a case that larger idle, the storage of first object file is unreasonable, then HDFS cluster Data block will be more, and the data number of blocks of HDFS cluster is more, bigger to the consumption of resource, operation when to subsequent calculating Energy influence is also bigger, just needs effectively to merge small documents at this time, reduces the quantity of data block to the greatest extent.

It is understood that the realization process of the bin packing algorithm can be with are as follows:

(1) it initializes: assuming that the capacity (size of memory space) of a chest (memory space) is 120M (close Block block size 128M, the file occupied space after merging can slightly become larger, and the size of memory space can be according to Block block Flexible in size is chosen).

(2) it takes file ordering: taking out first object file A (20M), B (60M), C (65M) and the D (90M) of the condition that meets, According to file size descending sort, D (90M), C (65M), B (30M) and A (20M) are obtained.

(3) file is traversed, chest is distributed:

A, first file D (90M) is taken out, applies for that a capacity is the chest E of 128M, file is cased, at this time chest In file size be 90M.

B, second file C (65M) is taken out, trial puts files into first chest E, if chest E off-capacity, Apply for a new chest F, the file size in this stylish chest F is 65M；It attempts C (65M) being put into first chest E herein When middle, the file size in chest E can be made to increase to 90M+65M=155M, the 128M capacity of chest itself be had exceeded, so not It is feasible.

C, third file B (30M) is taken out, according to the sequence of chest application, attempts file B (30M) being put into first Chest E is not above chest certainly if the file size in chest E is 90M+30M=120M after file B is put into chest E The capacity 128M of body, thus it is feasible；If infeasible, it tries file B (30M) to be put into next chest F；If final Also without suitable chest, then apply for a new chest G, file B (30M) is put into new chest G.

D, other All Files are successively traversed, cases according to the sequence of the application of chest, then applies without suitable chest New chest, until all files are all cased.

E, the chest of only one file is filtered, it is this without merging.

Step S103: all first object files in each permutation and combination are subjected to file mergences, obtain the second target File, wherein second file destination is correspondingly arranged on the data block of a distributed file system.

It realizes it is understood that the included file mergences calling interface of Spark computing engines can be used to multiple the The file mergences of one file destination, specifically,

(1) file to be combined is moved to the middle directory of merging, bibliographic structure are as follows: "/middle directory/merging batch Number/tmp/ chest id/ file ".

(2) chest is loaded using the data source API of Spark, each chest uses a DataFrame, restrains size It (coalesce) is 1.Middle directory is stored data into, bibliographic structure: "/middle directory/merging batch number/data/ file ".

(3) data after merging are moved to former catalogue.

(4) combined middle directory is deleted.

In the other embodiments of the application, the function of file mergences can also be realized using other third party's computing engines Can, the application does not limit specific third party's computing engines of specific method used when carrying out file mergences and use specifically It is fixed, all first object files in each permutation and combination can will be subjected to file mergences.

As can be seen from the above description, the data processing method provided by the embodiments of the present application based on big data, it can be to HDFS The advanced row data cleansing of original target file in file system under target directory, obtains first object file, wherein passes through number File size according to the first object file after cleaning is bigger, and quantity of documents is less, then according to each first mesh All first object files are divided into multiple combinations, so that all first in each combination by the file size for marking file The sum of the file size of file destination is as close possible to the first preset reference numerical value, i.e. the maximum of each data block holds in HDFS Amount, subsequently, all first object files in each combination are merged, and obtain second file destination, and are it It is correspondingly arranged on the data block of a HDFS, to be used in the minimum number of the data block of the HDFS of storage original target file, Reach preferable storage efficiency, resource allocation is more reasonable, and then improves the operation efficiency that follow-up data calculates.

It is incoherent in order to effectively reduce the quantity of file destination, the file size for improving file destination and removal Data are interfered, can also specifically include creation in an embodiment of the data processing method based on big data of the application Temp directory corresponding with target directory is come the step of carrying out data cleansing, referring to fig. 2, which specifically includes in following Hold:

Step S201: target directory is entered according to the file generated time of the original target file and original target file The file warehousing time, establish corresponding with target directory temp directory.

It is understood that come with its file generated time in the original target file, when by the original object When file is entered into HDFS, also would know that the file warehousing time of its typing, for example, an original target file generation when Between for 03 minute 14 points of on November 07th, 2018, the time being entered into HDFS is 29 minutes 14 points of on November 07th, 2018, then may be used Determine that corresponding target directory can be " src/201811071429 ", corresponding temp directory can be " accesslog/ Temp/201811071429/d=181107/h=14/m5=05 ", the specifically " accesslog/ in the temp directory Temp/201811071429 " is corresponding with the target directory, and " d=181107 " characterization file in the temp directory is raw It is " on November 07th, 2018 " at the time, " h=14 " characterizes the file generated time as " " at 14 points, and " m5=05 " is characterized The file generated time is " 00 point to 05 point of 5 minutes sections ".

Step S202: data cleansing is carried out to the original target file under target directory, obtains at least one first object File.

Step S203: at least one the first object file obtained after data cleansing is stored to the temp directory In.

It is understood that the original target file generates at least one first object file after data cleansing, In the first object file storage and the temp directory, specifically, the generation time of such as one original target file is 03 minute 14 points of on November 07th, 2018, the time being entered into HDFS are 29 minutes 14 points of on November 07th, 2018, pass through number There are 3 first object files according to generating after cleaning, temp directory respectively is " accesslog/temp/201811071429/ D=181107/h=14/m5=05/201811071429-0.orc ", " accesslog/temp/201811071429/d= 181107/h=14/m5=05/201811071429-1.orc " and " accesslog/temp/201811071429/d= 181107/h=14/m5=05/201811071429-3.orc ".

It is lesser to data volume in order to effectively reduce the quantity of file destination and improve the file size of file destination Small documents are reasonably merged, so that the data block minimum number of the HDFS finally occupied, subsequent computational efficiency is higher, at this It can also specifically include that small documents are merged in one embodiment of the data processing method based on big data of application Step, referring to Fig. 3, which specifically includes following content:

Step S301: the file size for creating storage is the memory space of the first preset reference numerical value.

Step S302: a first object file is stored into the memory space.

Step S303: obtaining another first object file, another described first object file is stored to existing In memory space, if storage failure, a memory space is created again and deposits another described first object file Storage repeats to obtain another first object file until all first object files are stored to storage into the memory space Space.

Step S304: it is combined at least one first object file in each memory space as one.

It is understood that bin packing algorithm can be used and carried out to small documents in order to further solve the problems, such as small documents Merge, the judgment basis of the small documents can be the file less than 80M, due to the maximum size of data block each in HDFS For 128M, i.e., the file lower than 128M can occupy 1 data block, and be higher than the quantity of the data block of the file occupancy of 128M are as follows: The data number of blocks of (file size/128)+1, HDFS cluster is more, bigger to the consumption of resource, operation when to subsequent calculating Performance influence is also bigger, just needs effectively to merge small documents at this time, reduces the quantity of data block to the greatest extent.

(1) it initializes: assuming that the capacity of a chest is that (close to Block block size 128M, the file after merging accounts for 120M Can slightly be become larger with space), i.e., the described first preset reference numerical value is 120M, and single file size to be combined is arranged and is up to 80M。

(3) file is traversed, chest is distributed:

E, the chest of only one file is filtered, it is this without merging.

In the other embodiments of the application, the realization process of the bin packing algorithm may be:

The small documents under catalogue are read, small documents are moved to temp directory, the number of partitions is set are as follows: file under temp directory The sum of size/128M is passed to the number of partitions using coalesce, is then merged small documents using the external data source api of Spark At big file.

In a kind of citing, there is the file size of 4 first object files to be respectively as follows: 30M, 60M, 65M and 90M；

File total size: 30+60+65+90=245M

The then number of partitions=245/128=2

File using coalesce, after merging are as follows:

File 1=(60M+90M)=150M occupies 2 block；

File 2=(30M+65M)=95M, occupy 1 block.

In order to targetedly be merged when carrying out small documents merging, i.e. itself data volume biggish text Part without merge, or merge after file size as close possible to but be no more than mono- data block of HDFS maximum size, So that the data block minimum number of the HDFS finally occupied, subsequent computational efficiency is higher, in the number based on big data of the application According in an embodiment of processing method, can also specifically include when being merged to small documents to advanced row file size judge The step of, referring to fig. 4, which specifically includes following content:

Step S401: judge whether the file size of the first object file is greater than the second preset reference numerical value, wherein Second preset reference numerical value is less than the first preset reference numerical value.

Step S402: if so, using each first object file for being greater than the second preset reference numerical value as one Combination.

Step S403: if it is not, the file size of creation storage is the memory space of the first preset reference numerical value；

Step S404: the first object file that one is less than or equal to the second preset reference numerical value is stored to the memory space In.

Step S405: obtaining another first object file for being less than or equal to the second preset reference numerical value, will be described another A first object file for being less than or equal to the second preset reference numerical value is stored into existing memory space, if storage failure, Again it creates a memory space and described another is less than or equal to the first object file of the second preset reference numerical value Store into the memory space, repetition obtain another less than or equal to the second preset reference numerical value first object file until All first object files for being less than or equal to the second preset reference numerical value are stored to memory space.

Step S406: it is combined at least one first object file in each memory space as one.

It is understood that the second preset reference numerical value is the file size criterion of small documents, Ke Yiwei 80M, first object document definition of the file size less than 80M is small documents, needs to carry out small documents merging treatment, and file is big Small is more than the first object file of 80M, since own files size is sufficiently large, after combining with other first object files The sum of file size be often above the first preset reference numerical value, therefore can no longer carry out permutation and combination, directly As an individually combination, in some other embodiment of the application, the second preset reference numerical value (i.e. small text The file size criterion of part) or other numerical value, specific value of the application to the second preset reference numerical value Size is not especially limited.

In order to being capable of benefit in the file size of the first object file obtained after data cleansing and quantity of documents Permutation and combination efficiency when subsequent small documents merge, in an embodiment of the data processing method based on big data of the application In, it can also specifically include the step of inputting special parameter control output result in data cleansing, which specifically includes There is following content: data cleansing being carried out to the original target file under the target directory, according to the original target file The data compression ratio that file format is converted during file size, the first preset reference numerical value and data cleansing, obtains at least One first object file.

It is smaller specific in order to which the file format of original target file is converted to the scale of construction when carrying out data cleansing File format, so that subsequent computational efficiency is higher, in an embodiment of the data processing method based on big data of the application, It can also specifically include the step of carrying out file format conversion to original target file, referring to Fig. 5, which specifically includes Following content:

Step S501: according to the file size of the original target file, the first preset reference numerical value and data cleansing The data compression ratio of file format conversion obtains the compression number of the first object file in the process.

Step S502: the file format of the original target file after data cleansing is converted into column storage lattice Formula obtains at least one described first object file, wherein the number of the first object file is less than or equal to the compression Number.

It is understood that the purpose for determining convergence parameter is in order to which the small documents for tentatively solving to generate after data cleansing are asked Topic, the calculating process of convergence parameter are as follows:

(1) it determines original target file desired file size A after data cleansing, obtains that target directory is lower to be needed The file size of the original target file of cleaning and B；

Specifically, the full name of ORC is (Optimized Row Columnar), and ORC file format is that a kind of Hadoop is raw Column storage format in state circle, its generation are initially produced from Apache Hive at the beginning of 2013, for reducing Hadoop data space and acceleration Hive inquiry velocity.Similar with Parquet, it is not a simple column storage Format is still to divide whole table according to row group first, carries out storing by column in each row group.ORC file is self-described , its metadata is serialized using Protocol Buffers, and the data in file are compressed deposited with reducing as far as possible Store up space consumption, at present also by the query engines such as Spark SQL, Presto support, but Impala for ORC currently without It supports, still uses Parquet as main column storage format.ORC project can be mentioned by Apache project fund within 2015 It is upgraded to the top project of Apache.ORC has following some advantages:

1, ORC is column storage, and there are many compressing file modes, and have very high compression ratio.

2, file can cutting (Split).Therefore, file memory format of the ORC as table is used in Hive, not only HDFS storage resource is saved, the input data amount of query task is reduced, and the MapTask used also just reduces.

3, a variety of indexes, row group index, bloom filter index are provided.

4, ORC can support complicated data structure (such as Map etc.).

In order to intuitively be supervised to the entry time of file, generation time and corresponding storage delay situation Control, can also specifically include according to the first mesh in an embodiment of the data processing method based on big data of the application The step of file warehousing time and file generated time for marking file are monitored data inputting delay situation, should referring to Fig. 6 Step specifically includes following content:

Step S601: it is stored comprising the original target file to target mesh in the file name of the first object file Corresponding file warehousing time and when the corresponding file generated time of the corresponding original target file of first object file when record Between identify.

Step S602: according to the file warehousing time and the time identifier, the number of the original target file is obtained Postpone information according to typing.

It is understood that come with its file generated time in the original target file, when by the original object When file is entered into HDFS, also would know that the file warehousing time of its typing, for example, an original target file generation when Between for 03 minute 14 points of on November 07th, 2018, the time being entered into HDFS is 29 minutes 14 points of on November 07th, 2018, then may be used Determine that corresponding target directory can be " src/201811071429 ", corresponding temp directory can be " accesslog/ Temp/201811071429/d=181107/h=14/m5=05 ", it follows that the generation time of the original target file For " on November 07th, 2,018 14 points 03 minute ", and the time of its data inputting is " on November 07th, 2,018 14 points 29 minutes ", wherein Typing delay in 26 minutes is produced, the application can supervise the typing of each original target file delay situation accordingly Control.

When in order to the file migration after data cleansing to target directory in temp directory, text is carried out in advance Part duplicate checking can also specifically include according to mesh in an embodiment of the data processing method based on big data of the application The step of file name in heading record carries out file duplicate checking, referring to Fig. 7, which specifically includes following content:

Step S701: it is stored comprising the original target file to target mesh in the file name of second file destination Corresponding file warehousing time when record.

Step S702: judgement, which stores, whether there is in title in the formal catalogue of the data block comprising the file warehousing The data block of time.

Step S703: if so, deleting all second file destinations in the data block, and will be corresponding described interim Second file destination formed after data cleansing in catalogue is migrated to the formal catalogue.

It is understood that the file of the temp directory is moved to before formal target directory, i.e., in mobile text Before number of packages evidence, need to prejudge whether the filename in formal catalogue is started with the timestamp of this cleaning, if there is (where there is the reason of be: first time cleaning task failure, may only clean abnormal end after partial data, this is attached most importance to New scheduling), then the file that filename is started under actual data catalogue with the timestamp that this is cleaned is deleted, then by temp directory File be moved to formal data directory.It can guarantee that data will not repeat even if rescheduling in this way.

In order to improve the treatment effeciency of cleaning task, the one of the data processing method based on big data of the application It can also specifically include ginseng the step of data cleansing task is submitted the processing of cluster dynamic resource queue process in embodiment See Fig. 8, which specifically includes following content:

Step S801: creating data cleansing task, and the data cleansing task is submitted to the dynamic resource team of cluster In column.

Step S802: obtaining the data cleansing task from the dynamic resource queue and is appointed according to the data cleansing It is engaged in carrying out data cleansing to the original target file under target directory.

It is understood that the dynamic resource that cluster can be used submits this data cleansing task to queue, specifically, To queue assignment memory and cpu resource, specified queue name when task is provided, it can thus be appreciated that the application can also realize following technology Effect: when using resource queue, CPU, the memory for guaranteeing that this required by task is wanted are exclusive；The money of other tasks is not seized Source；Total resources required for planning cluster according to queue.

In order to make that the storage efficiency of mass data is higher, resource allocation is more reasonable and subsequent calculated performance more It is excellent, the application provide a kind of all or part of the content for realizing the data processing method based on big data based on big The embodiment of the data processing equipment of data, referring to Fig. 9, the data processing equipment based on big data specifically includes as follows Content:

Data cleansing module 10 obtains at least one for carrying out data cleansing to the original target file under target directory A first object file；

File combination module 20, for the file size according to each first object file, by all described first File destination is divided into multiple combinations, wherein the file size of all first object files in each combination and be no more than the One preset reference numerical value and combined number are the minimum value in all possible number.

File combination module 30 is obtained for all first object files in each permutation and combination to be carried out file mergences To the second file destination, wherein second file destination is correspondingly arranged on the data block of a distributed file system.

As can be seen from the above description, the data processing equipment provided by the embodiments of the present application based on big data, it can be to HDFS The advanced row data cleansing of original target file in file system under target directory, obtains first object file, wherein passes through number File size according to the first object file after cleaning is bigger, and quantity of documents is less, then according to each first mesh All first object files are divided into multiple combinations, so that all first in each combination by the file size for marking file The sum of the file size of file destination is as close possible to the first preset reference numerical value, i.e. the maximum of each data block holds in HDFS Amount, subsequently, all first object files in each combination are merged, and obtain second file destination, and are it It is correspondingly arranged on the data block of a HDFS, to be used in the minimum number of the data block of the HDFS of storage original target file, Reach preferable storage efficiency, resource allocation is more reasonable, and then improves the operation efficiency that follow-up data calculates.

It is incoherent in order to effectively reduce the quantity of file destination, the file size for improving file destination and removal Data are interfered, in an embodiment of the data processing equipment based on big data of the application, referring to Figure 10, the data cleansing Module 10 includes:

Temp directory establishes unit 11, for according to the file generated time of the original target file and original object text Part enters the file warehousing time of target directory, establishes temp directory corresponding with the target directory.

Data cleansing unit 12 obtains at least one for carrying out data cleansing to the original target file under target directory A first object file.

Temp directory storage unit 13, at least one first object file storage for that will be obtained after data cleansing To in the temp directory.

It is lesser to data volume in order to effectively reduce the quantity of file destination and improve the file size of file destination Small documents are reasonably merged, so that the data block minimum number of the HDFS finally occupied, subsequent computational efficiency is higher, at this In one embodiment of the data processing equipment based on big data of application, referring to Figure 11, the file combination module 20 includes:

Reference space creating unit 21, the file size for creating storage are that the storage of the first preset reference numerical value is empty Between.

First object file storage unit 22, for storing a first object file into the memory space.

Second file destination storage unit 23 will another described first mesh for obtaining another first object file Mark file store into existing memory space, if storage failure, again create a memory space and will it is described separately One first object file is stored into the memory space, and repetition obtains another first object file until all first mesh Mark file is stored to memory space.

File destination assembled unit 24, for using at least one first object file in each memory space as one Combination.

In order to targetedly be merged when carrying out small documents merging, i.e. itself data volume biggish text Part without merge, or merge after file size as close possible to but be no more than mono- data block of HDFS maximum size, So that the data block minimum number of the HDFS finally occupied, subsequent computational efficiency is higher, in the number based on big data of the application According in an embodiment of processing unit, referring to Figure 12, the file combination module 20 includes:

File size judging unit 25, for judging it is default whether the file size of the first object file is greater than second Numerical benchmark, wherein the second preset reference numerical value is less than the first preset reference numerical value.

File arrangements assembled unit 26, for being greater than the second default base when the file size for determining the first object file When quasi- numerical value, combined each first object file for being greater than the second preset reference numerical value as one, if it is not, then basis All first object files are divided into multiple combinations by the file size of each first object file.

In order to being capable of benefit in the file size of the first object file obtained after data cleansing and quantity of documents Permutation and combination efficiency when subsequent small documents merge, in an embodiment of the data processing equipment based on big data of the application In, referring to Figure 13, the data cleansing unit 12 includes:

Data cleansing subelement 121, for carrying out data cleansing to the original target file under the target directory, according to The number that file format is converted during the file size of the original target file, the first preset reference numerical value and data cleansing According to compression ratio, at least one described first object file is obtained.

It is smaller specific in order to which the file format of original target file is converted to the scale of construction when carrying out data cleansing File format, so that subsequent computational efficiency is higher, in an embodiment of the data processing equipment based on big data of the application, Referring to Figure 14, the data cleansing subelement 121 includes:

Compression number determines subelement 1211, for the file size according to the original target file, the first default base The data compression ratio that file format is converted during quasi- numerical value and data cleansing obtains the compression of the first object file Number.

Data Format Transform subelement 1212, for by the tray of the original target file after data cleansing Formula is converted to column storage format, obtains at least one described first object file, wherein the number of the first object file Less than or equal to the compression number.

In order to intuitively be supervised to the entry time of file, generation time and corresponding storage delay situation Control, in an embodiment of the data processing equipment based on big data of the application, referring to Figure 15, further includes:

First file designation module 40, for including the original object text in the file name of the first object file Part when storing to target directory the file of corresponding file warehousing time and the corresponding original target file of first object file it is raw At time corresponding time identifier.

Delay monitoring module 50, for obtaining the original mesh according to the file warehousing time and the time identifier The data inputting for marking file postpones information.

When in order to the file migration after data cleansing to target directory in temp directory, text is carried out in advance Part duplicate checking, in an embodiment of the data processing equipment based on big data of the application, referring to Figure 16, the data cleansing list First 12 include:

Second file designation subelement 122, for including the original mesh in the file name of second file destination Mark file corresponding file warehousing time when storing to target directory.

Duplicate checking judgment sub-unit 123, with the presence or absence of being wrapped in title in the formal catalogue for judging to store the data block Data block containing the file warehousing time.

Duplicate checking handles subelement 124, for when there are include in title in the formal catalogue that judgement stores the data block When the data block of the file warehousing time, all second file destinations in the data block are deleted, and will be corresponding described Second file destination formed after data cleansing in temp directory is migrated to the formal catalogue.

In order to improve the treatment effeciency of cleaning task, the one of the data processing equipment based on big data of the application In embodiment, referring to Figure 17, the data cleansing module 10 includes:

Cleaning task submits unit 14, is submitted to collection for creating data cleansing task, and by the data cleansing task In the dynamic resource queue of group.

Cleaning task execution unit 15, for obtaining the data cleansing task and basis from the dynamic resource queue The data cleansing task carries out data cleansing to the original target file under target directory.

In order to further explain this programme, the application also provides a kind of data processing dress using above-mentioned based on big data The specific application example for realizing the data processing method based on big data is set, specifically includes following content:

With the continuous improvement that government and the public require the network information security, to internet provider producer to access day Higher requirements are also raised for will retention and access log inquiry.Provided according to relevant network security: network operator should adopt Detection, record network operation state, the technical measures of network safety event are taken, and retains relevant network log according to the rules not Less than six months, in the case where needing to retain basic demand in 6 months to Internet log, it was higher to need a set of storage efficiency, Resource allocation is more reasonable, calculated performance more preferably solution.

1, data are acquired into HDFS:

HDFS catalogue, 1 minute catalogue are built according to system timestamp, subsequent cleaning task was dispatched according to 1 minute Once, this 1 minute catalogue being read, according to newly-built catalogue per minute, it is ensured that data can be cleaned after 1 minute, If that will at least wait just processed after five minutes according to 5 minutes newly-built catalogues.If new according to 1 time less than 1 minute Catalogue is built, complexity is increased, specifically, bibliographic structure can be "/hadoop/accesslog/ { 1 minute catalogue }/{ machine Room id }, such as :/hadoop/accesslog/201806280955/1034 ".

2, data cleansing:

(1) data directory of HDFS is read, src/ minutes character strings, such as src/201811071429 catalogue, 201811071429 be to enter HDFS according to data, and according to the catalogue that current system timestamp is established, system timestamp is accurate to Minute.

(2) data are cleaned, generate ORC formatted file in temp directory, determine convergence (coalesce) parameter, it is preliminary to solve Small documents problem.Small documents target is to control the file size after cleaning in 120M or so.Specifically, convergence (coalesce) ginseng Number treatment process:

A, the total size of present lot All Files (before cleaning) is obtained.

B, according to compression ratio, convergence parameter is determined.Convergence parameter calculation formula: total file size/(120M/0.25). For example, there is 1000 small documents before cleaning, total file size is 2400M, then convergence parameter are as follows: 2400/ (120/0.25)= 5.File mean size after cleaning are as follows: 2400M/5=120M.Since the time of original document may be across multiple times point Area (such as according to 5 minutes subregions), such as falls into some time subregion and there was only several logs, that can generate only several in this subregion Many time subregions are crossed in the delay of the small documents of log, especially data, it will cause more small documents.

(3) each file that temp directory is generated, in addition timestamp, that is, the minutes stamp of catalogue before cleaning.

(4) file of temp directory is moved to formal data directory, before mobile data, judges formal data The filename of catalogue whether with this cleaning timestamp beginning, if there is (where there is the reason of be: the first subtask is clear Wash unsuccessfully, may only clean abnormal end after partial data, be this time to reschedule), then it deletes under actual data catalogue Then the file of temp directory is moved to formal data directory with the file of this timestamp beginning cleaned by filename. It can guarantee that data will not repeat even if rescheduling in this way.

3, merge small documents:

(1) ORC format is converted to after cleaning, memory space reduces by 75% or so.

(2) according to the size of each computer room different data sources, control convergence parameter (coalesce), tentatively solution small documents Problem.Small documents are solved the problems, such as cleaning process herein, reference can be made to the cleaning process of front.

(3) after cleaning, merge small documents using bin packing algorithm, the small documents based on bin packing algorithm merge, and close according to file And rule calculates optimal Merge Scenarios, consumes the accurate merging that least cluster resource realizes small documents.

Judgment basis: less than 80M file of small documents, is defined as small documents, needs to be merged using bin packing algorithm.HDFS Block block size be 128M, the file lower than 128M can occupy 1 block, the block block occupied higher than 128M are as follows: file Size/128M+1.The block number of cluster is more, bigger to the consumption of resource, influences performance.Small documents merge purpose be exactly Reduce the block number of blocks of cluster to the greatest extent.

Small documents merge scheme there are several types of:

Scheme one: the All Files under catalogue are merged

Read the All Files under catalogue, the sum of file size/128M under the number of partitions=catalogue, using coalesce or Repartition is passed to the number of partitions, and small documents are merged into big file by the external data source api of spark.

There are the problem of:

A, merge the All Files under catalogue, data volume is big, and number of files is more, needs very more resources.

If, without shuffle, performance will be got well b, using coalesce, but it will lead to the data distribution after merging not Equilibrium, the file after merging may include the file of 10M, 129M, occupy 3 block blocks.Preferably amalgamation result may be 11M, 128M account for 2 block blocks.

If data distribution is uniform c, using repartiton, but occupy a large amount of network I/O, disk I/O and CPU Resource.

Scheme two: using coalesce, only merges the small documents under catalogue

The small documents under catalogue are read, small documents are moved to temp directory, file size under the number of partitions=temp directory The sum of/128M, it is passed to the number of partitions using coalesce, small documents are merged into big file by the external data source api of spark.

There are problems: file size is unreasonable.

Such as 4 files: 30M 60M 65M 90M, file total size: 30+60+65+90=245M, the number of partitions=245/ 128=2, using coalesce, file after merging it may be so that

File 1=(60M+90M)=150M occupies 2 block

File 2=(30M+65M)=95M occupies 1 block

But under observing, it is found that the Merge Scenarios optimized are:

File 1=(30M+90M)=120M occupies 1 block

File 2=(60M+65M)=125M occupies 1 block

Scheme three: on the basis of scheme two, bin packing algorithm is used

Realization process is as follows:

A, initialize: setting chest capacity is 120M (the file occupied space close to Block block size 128M, after merging Can slightly become larger), single file size to be combined is set and is up to 80M.

B, it takes file ordering: the file for the condition that meets is taken out, according to file size descending sort.

C, file is traversed, chest is distributed.

First file is taken out, applies for that a capacity is the chest of 128M, file is cased.

Second file is taken out, trial puts files into first chest, if chest off-capacity, application one new Chest.

Third file is taken out, according to the sequence of chest application, trial puts files into chest, finds suitable chest Afterwards, that is, it returns.If applying for a new chest without suitable chest.

Other All Files are successively traversed, are cased according to the sequence of the application of chest, are then applied without suitable chest new Chest, until all files are all cased.

The chest of only one file is filtered, it is this without merging.

D, file to be combined is moved to the middle directory of merging, bibliographic structure are as follows:/middle directory/merging batch number/ Tmp/ chest id/ file.

E, chest is loaded using the data source API of Spark, each chest uses a DataFrame, restrains size It (coalesce) is 1.Middle directory is stored data into, bibliographic structure :/middle directory/merging batch number/data/ file.

F, the data after merging are moved to former catalogue.

G, combined middle directory is deleted.

4, task is submitted to queue using resource

Mainly to queue assignment memory and cpu resource, specified queue name when task is provided

(1) when not using resource queue, resource interactions between task；Task more than consumption resource can seize other tasks Resource, cause other task executions very slow or application less than resource.

(2) when using resource queue, CPU, the memory for guaranteeing that this required by task is wanted are exclusive；Other tasks are not seized Resource；Total resources required for planning cluster according to queue.

5, data delay monitoring

System timestamp on filename band when data loading (HDFS), can delay situation to data supervise Control.

Embodiments herein is also provided in the data processing method based on big data that can be realized in above-described embodiment The specific embodiment of a kind of electronic equipment of Overall Steps, referring to Figure 18, the electronic equipment specifically includes following content:

Processor (processor) 601, memory (memory) 602, communication interface (Communications Interface) 603 and bus 604；

Wherein, the processor 601, memory 602, communication interface 603 complete mutual lead to by the bus 604 Letter；The communication interface 603 for realizing based on big data data processing equipment, online operation system, client device with And other participate in the information transmission between mechanism；

The processor 601 is used to call the computer program in the memory 602, and the processor executes the meter The Overall Steps in the data processing method based on big data in above-described embodiment are realized when calculation machine program, for example, the place Reason device realizes following step when executing the computer program:

As can be seen from the above description, electronic equipment provided by the embodiments of the present application, it can be to target mesh in HDFS file system The advanced row data cleansing of original target file under record, obtains first object file, wherein described the after data cleansing The file size of one file destination is bigger, and quantity of documents is less, then according to the file size of each first object file, All first object files are divided into multiple combinations, so that the file size of all first object files in each combination Sum as close possible to the first preset reference numerical value, i.e. the maximum capacity of each data block in HDFS, subsequently, by each combination In all first object files merge, obtain second file destination, and be correspondingly arranged on a HDFS's for it Data block reaches preferable storage efficiency, provides to be used in the minimum number of the data block of the HDFS of storage original target file Source distribution is more reasonable, and then improves the operation efficiency that follow-up data calculates.

Embodiments herein is also provided in the data processing method based on big data that can be realized in above-described embodiment A kind of computer readable storage medium of Overall Steps is stored with computer program on the computer readable storage medium, should The Overall Steps of the data processing method based on big data in above-described embodiment are realized when computer program is executed by processor, For example, the processor realizes following step when executing the computer program:

As can be seen from the above description, computer readable storage medium provided by the embodiments of the present application, it can be to HDFS file system The advanced row data cleansing of original target file in system under target directory, obtains first object file, wherein passes through data cleansing The file size of the first object file afterwards is bigger, and quantity of documents is less, then according to each first object file File size, all first object files are divided into multiple combinations so that in each combination all first objects text The sum of the file size of part is as close possible to the first preset reference numerical value, i.e. the maximum capacity of each data block in HDFS, then so Afterwards, all first object files in each combination are merged, obtains second file destination, and be correspondingly arranged for it There is the data block of a HDFS, to be used in the minimum number of the data block of the HDFS of storage original target file, reaches preferable Storage efficiency, resource allocation is more reasonable, so improve follow-up data calculate operation efficiency.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for hardware+ For program class embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to side The part of method embodiment illustrates.

It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.

Although this application provides the method operating procedure as described in embodiment or flow chart, based on conventional or noninvasive The labour for the property made may include more or less operating procedure.The step of enumerating in embodiment sequence is only numerous steps One of execution sequence mode, does not represent and unique executes sequence.It, can when device or client production in practice executes To execute or parallel execute (such as at parallel processor or multithreading according to embodiment or method shown in the drawings sequence The environment of reason).

System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.It is a kind of typically to realize that equipment is computer.Specifically, computer for example may be used Think personal computer, laptop computer, vehicle-mounted human-computer interaction device, cellular phone, camera phone, smart phone, individual Digital assistants, media player, navigation equipment, electronic mail equipment, game console, tablet computer, wearable device or The combination of any equipment in these equipment of person.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It will be understood by those skilled in the art that the embodiment of this specification can provide as the production of method, system or computer program Product.Therefore, in terms of this specification embodiment can be used complete hardware embodiment, complete software embodiment or combine software and hardware Embodiment form.

This specification embodiment can describe in the general context of computer-executable instructions executed by a computer, Such as program module.Generally, program module includes routines performing specific tasks or implementing specific abstract data types, journey Sequence, object, component, data structure etc..This specification embodiment can also be practiced in a distributed computing environment, in these points Cloth calculates in environment, by executing task by the connected remote processing devices of communication network.In distributed computing ring In border, program module can be located in the local and remote computer storage media including storage equipment.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so being described relatively simple, related place is referring to embodiment of the method Part explanation.In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", The description of " specific example " or " some examples " etc. means specific features described in conjunction with this embodiment or example, structure, material Or feature is contained at least one embodiment or example of this specification embodiment.In the present specification, to above-mentioned term Schematic representation be necessarily directed to identical embodiment or example.Moreover, description specific features, structure, material or Person's feature may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, in not conflicting feelings Under condition, those skilled in the art by different embodiments or examples described in this specification and different embodiment or can show The feature of example is combined.

The foregoing is merely the embodiments of this specification, are not limited to this specification embodiment.For ability For field technique personnel, this specification embodiment can have various modifications and variations.It is all this specification embodiment spirit and Any modification, equivalent replacement, improvement and so within principle should be included in the scope of the claims of this specification embodiment Within.

Claims

1. a kind of data processing method based on big data characterized by comprising

According to the file size of each first object file, all first object files are divided into multiple combinations, In, the sum of the file size of all first object files in each combination is no more than the first preset reference numerical value and of combination Number is the minimum value in all possible number；

All first object files in each permutation and combination are subjected to file mergences, obtain the second file destination, wherein described Second file destination is correspondingly arranged on the data block of a distributed file system.

2. data processing method according to claim 1, which is characterized in that the original object text under target directory Part carries out data cleansing, obtains at least one first object file, comprising:

When entering the file warehousing of target directory according to the file generated time of the original target file and original target file Between, establish temp directory corresponding with the target directory；

3. data processing method according to claim 1, which is characterized in that described according to each first object file File size, all first object files are divided into multiple combinations, comprising:

The one first object file is stored into the memory space；

Another first object file is obtained, another described first object file is stored into existing memory space, if Storage failure is then created a memory space again and is stored another described first object file empty to the storage Between in, repeat to obtain another first object file until all first object files are stored to memory space；

It is combined at least one first object file in each memory space as one.

4. data processing method according to claim 1, which is characterized in that described according to each first object file File size, all first object files are divided into multiple combinations, comprising:

Judge whether the file size of the first object file is greater than the second preset reference numerical value, wherein the second preset reference Numerical value is less than the first preset reference numerical value；

Another first object file for being less than or equal to the second preset reference numerical value is obtained, described another is less than or equal to second The first object file of preset reference numerical value is stored into existing memory space, if storage failure, creates an institute again It states memory space and stores another described first object file for being less than or equal to the second preset reference numerical value to the storage In space, repetition obtains another less than or equal to the first object file of the second preset reference numerical value until all be less than or equal to the The first object file of two preset reference numerical value is stored to memory space；

It is combined at least one first object file in each memory space as one.

5. data processing method according to claim 2, which is characterized in that the original object text under target directory Part carries out data cleansing, obtains at least one first object file, comprising:

Data cleansing is carried out to the original target file under the target directory, the file according to the original target file is big The data compression ratio that file format is converted during small, the first preset reference numerical value and data cleansing, obtains at least one institute State first object file.

6. data processing method according to claim 5, which is characterized in that the text according to the original target file The data compression ratio that file format is converted during part size, the first preset reference numerical value and data cleansing, obtains at least one A first object file, comprising:

According to file format during the file size of the original target file, the first preset reference numerical value and data cleansing The data compression ratio of conversion obtains the compression number of the first object file；

The file format of the original target file after data cleansing is converted into column storage format, obtains at least one A first object file, wherein the number of the first object file is less than or equal to the compression number.

7. data processing method according to claim 1, which is characterized in that further include:

Corresponding text when being stored comprising the original target file to target directory in the file name of the first object file The file generated time corresponding time identifier of part entry time and the corresponding original target file of first object file；

According to the file warehousing time and the time identifier, the data inputting delay letter of the original target file is obtained Breath.

8. data processing method according to claim 2, which is characterized in that it is described to target directory under original object File carries out data cleansing, after obtaining at least one first object file, comprising:

Corresponding text when being stored comprising the original target file to target directory in the file name of second file destination Part entry time；

Judgement stores in the formal catalogue of the data block with the presence or absence of the data block in title including the file warehousing time；

If so, deleting all second file destinations in the data block, and number will be passed through in the corresponding temp directory It migrates according to second file destination formed after cleaning into the corresponding data block of the formal catalogue.

9. data processing method according to claim 1, which is characterized in that the original object text under target directory Part carries out data cleansing, comprising:

The data cleansing task is obtained from the dynamic resource queue and according to the data cleansing task to target directory Under original target file carry out data cleansing.

10. a kind of data processing equipment based on big data characterized by comprising

Data cleansing module, under target directory original target file carry out data cleansing, obtain at least one first File destination；

File combination module, for the file size according to each first object file, by all first object texts Part is divided into multiple combinations, wherein the file size of all first object files in each combination and default no more than first Numerical benchmark and combined number are the minimum value in all possible number；

File combination module obtains second for all first object files in each permutation and combination to be carried out file mergences File destination, wherein second file destination is correspondingly arranged on the data block of a distributed file system.

11. data processing equipment according to claim 10, which is characterized in that the data cleansing module includes:

Temp directory establishes unit, for according to the file generated time of the original target file and original target file entrance The file warehousing time of target directory establishes temp directory corresponding with the target directory；

Data cleansing unit, under target directory original target file carry out data cleansing, obtain at least one first File destination；

Temp directory storage unit, for storing at least one the first object file obtained after data cleansing to described In temp directory.

12. data processing equipment according to claim 10, which is characterized in that the file combination module includes:

Second file destination storage unit will another described first object file for obtaining another first object file Store into existing memory space, if storage failure, create again a memory space and will it is described another the One file destination is stored into the memory space, and repetition obtains another first object file until all first object files It stores to memory space；

File destination assembled unit, for being combined at least one first object file in each memory space as one.

13. data processing equipment according to claim 10, which is characterized in that the file combination module includes:

File size judging unit, for judging whether the file size of the first object file is greater than the second preset reference number Value, wherein the second preset reference numerical value is less than the first preset reference numerical value；

File arrangements assembled unit, for being greater than the second preset reference numerical value when the file size for determining the first object file When, it is combined each first object file for being greater than the second preset reference numerical value as one, if it is not, then according to each institute All first object files are divided into multiple combinations by the file size for stating first object file.

14. data processing equipment according to claim 11, which is characterized in that the data cleansing unit includes:

Data cleansing subelement, for carrying out data cleansing to the original target file under the target directory, according to the original The data compression that file format is converted during the file size of beginning file destination, the first preset reference numerical value and data cleansing Than obtaining at least one described first object file.

15. data processing equipment according to claim 14, which is characterized in that the data cleansing subelement includes:

Compression number determines subelement, for according to the file size of the original target file, the first preset reference numerical value with And the data compression ratio that file format is converted during data cleansing obtains the compression number of the first object file；

Data Format Transform subelement, for being converted to the file format of the original target file after data cleansing Column storage format obtains at least one described first object file, wherein the number of the first object file is less than or equal to The compression number.

16. data processing equipment according to claim 10, which is characterized in that further include:

First file designation module, for including that the original target file stores in the file name of the first object file The file generated time of corresponding file warehousing time and the corresponding original target file of first object file when to target directory Corresponding time identifier；

Delay monitoring module, for obtaining the original target file according to the file warehousing time and the time identifier Data inputting postpone information.

17. data processing equipment according to claim 11, which is characterized in that the data cleansing unit includes:

Second file designation subelement, for including that the original target file is deposited in the file name of second file destination Corresponding file warehousing time when storage to target directory；

Duplicate checking judgment sub-unit, with the presence or absence of including the text in title in the formal catalogue for judging to store the data block The data block of part entry time；

Duplicate checking handles subelement, for when there are include the file in title in the formal catalogue that judgement stores the data block When the data block of entry time, all second file destinations in the data block are deleted, and by the corresponding temp directory Middle second file destination formed after data cleansing is migrated into the corresponding data block of the formal catalogue.

18. data processing equipment according to claim 10, which is characterized in that the data cleansing module includes:

Cleaning task submits unit, is submitted to the dynamic of cluster for creating data cleansing task, and by the data cleansing task In state resource queue；

Cleaning task execution unit, for obtaining the data cleansing task from the dynamic resource queue and according to the number Data cleansing is carried out to the original target file under target directory according to cleaning task.

19. a kind of electronic equipment including memory, processor and stores the calculating that can be run on a memory and on a processor Machine program, which is characterized in that the processor realizes that claim 1 to 9 is described in any item based on big when executing described program The step of data processing method of data.

20. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the computer program quilt The step of claim 1 to 9 described in any item data processing methods based on big data are realized when processor executes.