CN107346312A

CN107346312A - A kind of big data processing method and system

Info

Publication number: CN107346312A
Application number: CN201610294824.3A
Authority: CN
Inventors: 岑春祥; 王升元; 苏文平; 郄威; 孟利青
Original assignee: China Mobile Group Inner Mongolia Co Ltd
Current assignee: China Mobile Group Inner Mongolia Co Ltd
Priority date: 2016-05-05
Filing date: 2016-05-05
Publication date: 2017-11-14

Abstract

The invention discloses a kind of big data processing method, including：Obtain the original document for including different type large data files；The original document is split as to multiple subdata files of different classification according to the type of large data files；The corresponding server of distribution is sorted out according to different to the multiple subdata file, and the multiple subdata file handled simultaneously on different server.The present invention further simultaneously discloses a kind of big data processing system.

Description

A kind of big data processing method and system

Technical field

The present invention relates to technical field of information processing, more particularly to a kind of big data processing method and system.

Background technology

In many application scenarios, often there is following data handling procedure：Sender is by some different types Data file be stored in certain form in a file, then by this document folder be compressed after send out Recipient is given, is parsed after recipient receives compressed file, then to the content in the compressed file, And logical process.

In above-mentioned data handling procedure, if data file is not very big, and recipient is to processing time When again without very high requirement, then single server or single thread can be used to be handled.In this case, System still can normal operation, simply recipient handle time of these file datas may be longer.But In actual applications, people are frequently encountered the data processing needs of big data quantity, such as：School eduaction people Member needs to report student data, the processing of large-scale website daily record and two large scale systems to Bureau of Education step by step Between data syn-chronization etc..At this moment, it is necessary to which the file data of transmission is very big or quantity of documents is a lot, and connect Debit has very high requirement to processing time again, such as：Recipient requires the number of files that sender sends According to must be disposed in 1 minute (or in the shorter time).Now, if only relying on separate unit clothes The processing system of business device or single thread cannot meet the demand.

In addition, under many circumstances, the file data of sender to recipient is that timing transmits, such as often 5 minutes transmission once, and recipient can tolerate the maximum delay of data transfer be it is conditional, now, If recipient handles these endless data in predetermined time interval, vicious circle will be formed so that Data in the last cycle are also untreated to be finished, and new data are sent again, and the data of such recipient are prolonged When will be more and more, finally there is the phenomenon of system crash.

To solve the above problems, big data is entered using K averages (K-MEANS) algorithm in the prior art Row clustering processing, but the processing procedure is usually directed to the situation that data bulk n is fixed value, and for n For the situation of changing value, in processing procedure, n often changes once, such as n value increases by 1, corresponds to Need data to be processed will increase a new data record, then need to re-execute the complete of whole algorithm Process.Thus considerably increase the operating process of whole system, it is more likely that handle at the appointed time not Complete need data to be processed, so as to bring very big delay to recipient.

In summary, using prior art, the operating process for how to reduce system as far as possible, in regulation It is interior to have handled Volume data, alleviate the delay process of data, there is no effective solution.

The content of the invention

In view of this, the embodiment of the present invention it is expected to provide a kind of big data processing method and system, can be to big Data volume data are fast and effectively handled, to solve that big data quantity can not have been handled at the appointed time Data and caused by handle delay, and the problem of system crash.

To reach above-mentioned purpose, what the technical scheme of the embodiment of the present invention was realized in：

The embodiment of the present invention provides a kind of big data processing method, and methods described includes：

Obtain the original document for including different type large data files；

The original document is split as to multiple subdatas text of different classification according to the type of large data files Part；

The corresponding server of distribution is sorted out according to different to the multiple subdata file, and in different server It is upper that the multiple subdata file is handled simultaneously.

In such scheme, the acquisition includes the original document of different type large data files, including：

Create the form for realizing typesetting function through secondary development；

The incidence relation established between the display logic and memory database of the form；

Identify to the operational order of the form, according to the operational order and incidence relation, from described interior The original document for including different type large data files is obtained in deposit data storehouse, and is presented in a tabular form；

Wherein, the memory database is used to store different types of large data files.

In such scheme, associating between the display logic for establishing the form and memory database After system, methods described also includes：

According to the line number of the form, index is established to the data in memory database, and according to the foundation Index read memory database in corresponding data.

In such scheme, during multiple subdata files that the original document is split as to different classification described, Methods described also includes：

Collection is to sql sentences corresponding to the original document fractured operation；

Parse the tables of data in the sql sentences and the field and field value in the tables of data；

Field and field value in the tables of data and the tables of data, automatic code generating, and it is right The code of the generation is compiled, and is generated dynamic link library file or executable program file, is performed fractionation Original document comprising large data files.

In such scheme, when carrying out fractured operation, if the quantity that request is split is more than prior limitation, inquire about The historical data of user；

Corresponding classification is searched in classification based on the historical data after cluster.

The embodiment of the present invention also provides a kind of big data processing system, and the system includes：Acquiring unit, tear open Subdivision and processing unit；Wherein,

The acquiring unit, for obtaining the original document for including different type large data files；

The split cells, for the original document to be split as not reaching the same goal according to the type of large data files Multiple subdata files of class；

The processing unit, for sorting out the corresponding service of distribution according to different to the multiple subdata file Device, and the multiple subdata file is handled simultaneously on different server.

In such scheme, the acquiring unit includes：

Form creating unit, for creating the form for realizing typesetting function through secondary development；

Incidence relation establishes unit, the pass between display logic and memory database for establishing the form Connection relation；

First processing units, for identifying to the operational order of the form, according to the operational order and Incidence relation, the original document for including different type large data files is obtained from the memory database, and Present in a tabular form；

In such scheme, the acquiring unit also establishes unit including index, for being built in the incidence relation Vertical unit is established after the incidence relation between the display logic and memory database of the form, according to described The line number of form, the data in memory database are established with index, and according in the reading of the index of the foundation Corresponding data in deposit data storehouse.

In such scheme, the original document is split as to multiple subnumbers of different classification in the split cells During according to file, the system also includes：

Collecting unit, for gathering to sql sentences corresponding to the original document fractured operation；

Resolution unit, for parsing the tables of data in the sql sentences and the field in the tables of data And field value；

Second processing unit, for the field and field value in the tables of data and the tables of data, Automatic code generating, and the code of the generation is compiled, generate dynamic link library file or executable Program file, perform the original document for splitting and including large data files.

The big data processing method and system that the embodiment of the present invention is provided, acquisition include different type big data The original document of file；The original document is split as the more of different classification according to the type of large data files Individual sub- data file；To the multiple subdata file according to the corresponding server of different classification distribution, and The multiple subdata file is handled simultaneously on different server.It so, it is possible at the appointed time Volume data is quickly and efficiently handled, alleviates the delay process of data.

In addition, the embodiment of the present invention, which is based on form, the operation such as can be consulted Volume data, analyzed, The functions such as the overall situation is sorted in real time, big data is presented also are supported simultaneously.In addition, the embodiment of the present invention passes through The field of sql sentences is analyzed, splits big data automatically, both ensure that the efficiency for splitting big data, is ensured again The validity of big data；And based on user's history data, the corresponding classification of lookup in the classification after cluster, So as to obtain dynamic amount according to predetermined mapping ruler, the expense of hardware resource is saved.

Brief description of the drawings

Fig. 1 is the implementation process schematic diagram of big data processing method of the embodiment of the present invention；

Fig. 2 is the specific implementation schematic flow sheet of big data processing method of the embodiment of the present invention；

Fig. 3 is the structural representation of big data processing system of the embodiment of the present invention.

Embodiment

The characteristics of in order to more fully hereinafter understand the embodiment of the present invention and technology contents, below in conjunction with the accompanying drawings Realization to the embodiment of the present invention is described in detail, and appended accompanying drawing purposes of discussion only for reference, is not used for Limit the present invention.

As shown in figure 1, in the embodiment of the present invention big data processing method implementation process, comprise the following steps：

Step 101：Obtain the original document for including different type large data files；

This step 101 specifically includes：

S1011：Create the form for realizing typesetting function through secondary development；

Here, the typesetting function includes but is not limited to：The overall situation is sorted in real time, line number is shown, row freeze, Column heading enter a new line automatically display, blank column filtering etc. function.

Wherein, the row head ranking function of the form through secondary development is configured to the knot with memory database The Order by operations that fruit collection is ranked up are bound, and global sequence work(is carried out by clicking on gauge outfit to realize Energy.

Here, specifically how secondary development is carried out to form and belongs to prior art, will not be repeated here.

S1012：The incidence relation established between the display logic and memory database of the form；

Here, it is described establish the incidence relation between the display logic and memory database of the form after, Methods described also includes：

S1013：Identify to the operational order of the form, according to the operational order and incidence relation, from The original document for including different type large data files is obtained in the memory database, and is in a tabular form It is existing.

Here, the operational order of the identification is the operational order of the various typesettings for form of user's input.

According to the operational order and incidence relation to the form, big data is obtained from memory database After measuring data, first the Volume data is cached to intermediate file, then existed again according to intermediate file Volume data is presented on form, so, Volume data is presented by form, and it is arranged During version, the EMS memory occupation of system can be reduced, realizes the presentation and operation of Volume data.

The secondary development provided in an embodiment of the present invention that typesetting function is realized based on form, and by described through two The form of secondary exploitation is bound with memory database, needs to enter Volume data by form in user When row access, analysis etc. operate, the form also supports the work(such as global real-time sequence, big data quantity presentation simultaneously Energy.

Step 102：The original document is split as the multiple of different classification according to the type of large data files Subdata file；

Wherein, described split the original document according to the type of large data files can be specifically：According to The naming rule of the different type large data files is split；

Here, the original document is split as the more of different classification according to the type of large data files described During individual sub- data file, methods described also includes：

Field and field value in the tables of data and the tables of data, automatic code generating, and it is right The code of the generation is compiled, and generates dynamic link library (dll) file or executable program (exe) text Part, perform the original document for splitting and including large data files.

Here, when carrying out fractured operation, if the quantity that request is split is more than prior limitation, inquire about user's Historical data；

Here, the classification after the cluster, can be that system carries out clustering processing to the big data of predetermined quantity The classification obtained afterwards.

The cluster operation, the data object that can choose predetermined quantity perform cluster.For big data, The data object that data scale selects representative quantity can be regarded.

Wherein, for one group of data object of each user, data object corresponding to a user can be with Including one or more data, therefore, cluster can be performed to the data object including one or more data.

Here, the prior limitation, can be manually set, or set in itself by system.

In the case of the latter, prior limitation can be calculated according to certain rule.For example, it can lead to Cross the outlier deleting madel based on density and determine the reserve quota.

The specific setting to prior limitation is described further below, is comprised the following steps：

S2321：Fractionation quantity in user history information is ranked up；

For example, it can be ranked up by order from big to small or from small to large, then the order difference after sorting For：D1, d2 ..., dn, wherein, n is the integer more than 1.

S2322：The d of following formula will be met_iPoint is judged as outlier；

|d_i-d_i-k|>C, i=k+1 ..., n (1)

In above-mentioned (1) formula, i represents i-th order, d1, d2 ..., dn be according to from big to small or Order from small to large be ranked up after fractionation quantity, C is predetermined threshold value, and k is pre-determined distance.Pass through Above formula calculate, if i-th order apart from the amount of money of its k order be more than predetermined threshold value when, d_iPoint is recognized To be outlier.

S2323：Reject outlier；

S2324：Maximum in group after rejecting outlier is set as prior limitation.

Such as：The order amount of money after being ranked up according to order from small to large is respectively d1=100, D2=110, d3=123, d4=195, d5=229, d6=1410, d7=2100.C is set to 300, k and is set to 3。

Then by above-mentioned (1) Shi Ke get：

| d4-d1 |=195-100=95<300；

| d5-d2 |=229-110=119<300；

| d6-d3 |=1410-123=1287>300；

| d7-d4 |=2100-195=1905>300；

Therefore, judge d6, d7 point for outlier.Above-mentioned outlier d6, d7 is rejected from group, then Point in remaining group includes d1~d5；Because the maximum in d1~d5 is 229, then by the maximum 229 are set as prior limitation.

Here, the fractionation may include splitting the fractionation species in information, that is, it is probably numerous to split quantity The quantity of certain class commodity in information.

The information that splits can include or not the fractionation information under different scenes for the fractionation information With the situation of the sequence information under scene, correspondingly, the prior limitation can include pre- under different scenes Fixed limit volume.The prior limitation under different scenes can be calculated using above-mentioned S2321~S2324.

Cluster in the embodiment of the present invention can be completed previously according to the big data of predetermined quantity, be received newly In the case that user sends request of data, it is not necessary to re-start the big data including the new reception data Cluster, on the contrary, only classification need to be corresponded to based on being searched in classification of the user's history data after cluster, from And obtain dynamic amount according to predetermined mapping ruler.So, the expense of hardware resource can be saved.

Step 103：To the multiple subdata file according to the corresponding server of different classification distribution, and The multiple subdata file is handled simultaneously on different server.

The specific implementation process of big data processing method of the embodiment of the present invention is described in detail below.

As shown in Fig. 2 in the embodiment of the present invention big data processing method specific implementation flow, it is including following Step：

Step 201：Obtain the original document for including different type large data files；

Step 202：Judge whether the classification of large data files in original document is more than five classes, if being less than or waiting In five classes, then step 203 is performed, if being more than five classes, step 207 is jumped to, terminates this handling process；

Step 203：Original document is split as multiple subdata files according to the different type of large data files；

Step 204：Statistics primary sources number is A, secondary sources number is B, the 3rd class data Number is C, the 4th class data amount check is D, the 5th class data amount check is E, calculates and compiles according to equation below Code value N；

N=A*10¹+B*10²+C*10³+D*10⁴+E*10⁵

(primary sources number A such as .doc class files) --- * 10¹；

(secondary sources number B such as .jpg class files) --- * 10²；

(the 3rd class data amount check C such as .txt class files) --- * 10³；

(the 4th class data amount check D such as .pdf class files) --- * 10⁴；

(the 5th class data amount check E such as .exe class files) --- * 10⁵；

Wherein, if data category is less than five classes, the number of corresponding data classification calculates by 0.

Step 205：Encoded radio N is conveyed to server；

Step 206：Server carries out data recombination to N；

That is, in step 103, processing of the server to multiple subdata files comprises the following steps： Server carries out data recombination according to equation below：

N/10¹=A (only retains a position using the mode of rounding up)；

N/10²=B (only retains a position using the mode of rounding up)；

N/10³=C (only retains a position using the mode of rounding up)；

N/10⁴=D (only retains a position using the mode of rounding up)；

N/10⁵=E (only retains a position using the mode of rounding up)；

Step 207：Terminate.

Such scheme utilizes single encoded radio N, conveys Volume data to server, avoids more item numbers Conveyed jointly according to server, so as to the congestion brought to server channels and confusion.The embodiment of the present invention is led to The control of concurrency policies is crossed, multiple servers can be disposed while large-data documents are split and handled, The disposal ability of system is greatly improved, ensures that quickly and efficiently processing counts greatly system at the appointed time According to amount data；Moreover, this split and handle file by file designation rule distribution different server Concurrency policies, ensure only have a server to be split to original document, for every after fractionation Individual sub- data file, also correspondingly there is a server to handle it, so as to avoid resource contention.

For example for, it is assumed that it is A that primary sources, which have " .doc class files " number, such as 1； It is B that secondary sources, which have " .jpg class files " number, such as 6；3rd class data have " .txt classes File " number is C, such as 8；It is D that 4th class data, which have " .pdf class files " number, such as 4 It is individual；It is E that 5th class data, which have " .exe class files " number, such as 5；

So, formula N=A*10 is utilized¹+B*10²+C*10³+D*10⁴+E*10⁵, calculation code value N, i.e., N=1*10¹+6*10²+8*10³+4*10⁴+5*10⁵=548610；Encoded radio N=548610 is sent to server, Server is handled as follows for encoded radio N afterwards：

548610/10¹=54861 (only retaining a position using the mode of rounding up), i.e. A=1；

548610/10²=5486.1 (only retaining a position using the mode of rounding up), i.e. B=6；

548610/10³=548.61 (only retaining a position using the mode of rounding up), i.e. C=8；

548610/10⁴=54.861 (only retaining a position using the mode of rounding up), i.e. D=4；

548610/10⁵=5.4861 (only retaining a position using the mode of rounding up), i.e. E=5.

To realize the above method, the embodiment of the present invention additionally provides a kind of big data processing system, such as Fig. 3 institutes Show, the system includes acquiring unit 31, split cells 32, processing unit 33；Wherein,

Acquiring unit 31, for obtaining the original document for including different type large data files；

Split cells 32, for the original document to be split as into different classification according to the type of large data files Multiple subdata files；

Processing unit 33, for sorting out the corresponding service of distribution according to different to the multiple subdata file Device, and the multiple subdata file is handled simultaneously on different server.

Here, the acquiring unit 31 includes：

Form creating unit 311, for creating the form for realizing typesetting function through secondary development；

Incidence relation establishes unit 312, for establishing between the display logic and memory database of the form Incidence relation；

First processing units 313, for identifying the operational order to the form, according to the operational order And incidence relation, the original document for including different type large data files is obtained from the memory database, And present in a tabular form；

Wherein, the acquiring unit 31 also establishes unit 314 including index, for being built in the incidence relation Vertical unit 312 is established after the incidence relation between the display logic and memory database of the form, according to The line number of the form, the data in memory database are established with index, and read according to the index of the foundation Take corresponding data in memory database.

When the original document is split as multiple subdata files of different classification by the split cells 32, The system also includes：

Collecting unit 321, for gathering to sql sentences corresponding to the original document fractured operation；

Resolution unit 322, for parsing the tables of data in the sql sentences and the word in the tables of data Section and field value；

Second processing unit 323, for the field and field in the tables of data and the tables of data Value, automatic code generating, and the code of the generation is compiled, dll files or exe files are generated, Perform the original document for splitting and including large data files.

Wherein, when carrying out fractured operation, if the quantity that request is split is more than prior limitation, inquire about user's Historical data；

In actual applications, the acquiring unit 31, split cells 32, processing unit 33 can by positioned at On terminal server central processing unit (CPU, Central Processing Unit), microprocessor (MPU, Micro Processor Unit), digital signal processor (DSP, Digital Signal Processor) or existing Field programmable gate array (FPGA, Field Programmable Gate Array) etc. is realized.

The embodiment of the present invention obtains the original document for including different type large data files；By the original document Multiple subdata files of different classification are split as according to the type of large data files；To the multiple subdata File sorts out the corresponding server of distribution according to different, and simultaneously to the multiple subnumber on different server Handled according to file.It so, it is possible quickly and efficiently to handle Volume data at the appointed time, Alleviate the delay process of data.

The foregoing is only a preferred embodiment of the present invention, is not intended to limit the protection model of the present invention Enclose, all any modification, equivalent and improvement made within the spirit and principles of the invention etc. all should Within protection scope of the present invention.

Claims

1. a kind of big data processing method, it is characterised in that methods described includes：

Obtain the original document for including different type large data files；

2. according to the method for claim 1, it is characterised in that the acquisition includes the big number of different type According to the original document of file, including：

3. according to the method for claim 2, it is characterised in that in the display for establishing the form After incidence relation between logical AND memory database, methods described also includes：

4. according to the method described in claim 1,2 or 3, it is characterised in that it is described will be described original When file declustering is multiple subdata files of different classification, methods described also includes：

5. according to the method for claim 1, it is characterised in that when carrying out fractured operation, if request The quantity of fractionation is more than prior limitation, inquires about the historical data of user；

6. a kind of big data processing system, it is characterised in that the system includes：Acquiring unit, split list Member and processing unit；Wherein,

7. system according to claim 6, it is characterised in that the acquiring unit includes：

8. system according to claim 7, it is characterised in that the acquiring unit is also built including index Vertical unit, for establishing the display logic and memory database that unit establishes the form in the incidence relation Between incidence relation after, according to the line number of the form, index is established to the data in memory database, And corresponding data in memory database are read according to the index of the foundation.

9. according to the system described in claim 6,7 or 8, it is characterised in that will in the split cells When the original document is split as multiple subdata files of different classification, the system also includes：

10. system according to claim 6, it is characterised in that when carrying out fractured operation, if please Ask the quantity of fractionation to be more than prior limitation, inquire about the historical data of user；