CN108920410A

CN108920410A - A kind of big data processing unit and method

Info

Publication number: CN108920410A
Application number: CN201810648667.0A
Authority: CN
Inventors: 王旭生; 梁娜; 王健; 邱志祺; 安逸
Original assignee: North China University of Science and Technology
Current assignee: North China University of Science and Technology
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2018-11-30

Abstract

The invention discloses a kind of big data processing unit and methods, including primary processor, preliminary treatment device and secondary processor, master data processing module is equipped in the primary processor, the primary processor is connect with multiple preliminary treatment devices, preliminary data processing module is equipped in the preliminary treatment device, a preliminary treatment device is connect with multiple secondary processors.Target data to be processed is obtained by preliminary treatment device, the data shape key index pre-established is obtained by secondary processor, data shape key index includes the self attributes information of data, and the data shape of target data to be processed is determined according to the data shape key index pre-established.It completes data exchange in time to alleviate Corporation system while handling the burden of mass data, the accuracy of data processing is high, and data processing improves work efficiency in time.

Description

A kind of big data processing unit and method

Technical field

The present invention relates to data processing field, specifically a kind of big data processing unit and method.

Background technique

With the arriving of cloud era, big data has also attracted more and more concerns.Analyst team thinks that big data is logical It is commonly used to describe a large amount of unstructured datas and semi-structured data that a company creates, these data are downloading to relationship type Database is for meeting overspending time and money when analyzing.Big data analysis is often linked together with cloud computing, because in real time Large data set analysis need frame as MapReduce to distribute work to tens of, hundreds of or even thousands of computers Make.

Big data needs special technology, effectively to handle a large amount of tolerance by the data in the time.Suitable for big The technology of data, including MPP database, data mining power grid, distributed file system, distributed data base, Cloud computing platform, internet and expansible storage system.

Summary of the invention

The purpose of the present invention is to provide a kind of big data processing unit and methods, to solve to propose in above-mentioned background technique The problem of.

To achieve the above object, the present invention provides the following technical solutions：

A kind of big data processing unit, including primary processor, preliminary treatment device and secondary processor, in the primary processor Equipped with master data processing module, the primary processor is connect with multiple preliminary treatment devices, is equipped in the preliminary treatment device preliminary Data processing module, a preliminary treatment device are connect with multiple secondary processors, and secondary data is equipped in the secondary processor Processing module sets in the master data processor, preliminary treatment device and secondary processor and is equipped with data detection module, data inspection Module is surveyed for detecting input data exception.

A kind of big data processing method, includes the following steps：

Step S1 obtains target data to be processed by preliminary treatment device.

Step S2 obtains the data shape key index pre-established, data shape key index by secondary processor Self attributes information including data.

In one embodiment, the data shape key index pre-established may include but be not limited in following index It is one or more：

The data volume of tentation data table, specifies compound primary key at the data volume of one or more subregions of tentation data table Number, the maximum value of field value, the minimum value of field value, word that number, the value of field after data volume, field duplicate removal are NULL Maximum length in segment value, minimum length, the calculated result of specific field, the number that field intermediate value is 0, field intermediate value in field value The number and the hundred of the data volume of whole table for being NULL with the percentage and field intermediate value of the data volume of whole table for 0 number Divide ratio.

Before handling data, need to define the data shape key index to be calculated, key index mainly has：From whole The data volume of whole table or the data volume of a subregion from the point of view of body surface.

The data volume of specified compound primary key.

The index that each field calculates：

Total amount after field duplicate removal：Number after this field duplicate removal；

NULL total amount：The value of this field is the number of null；

Maximum value：This field value maximum value is sought, if it is field (string type etc., just according to max of nonumeric class The logic calculation of function default.

Minimum value：This field value maximum value is sought, if it is field (string type etc., just according to min of nonumeric class The logic calculation of function default.

Maximum length：Maximum length in this field value；

Maximum length citing：One of field value in this field value in maximum length；

Minimum length：Minimum length in this field value；

Minimum length citing：One of field value in minimum length in this field value.

The calculating of field enumerated value：Specific field, such as specified gende：Field, calculated result may be：Male： 1000；Female：20000

0 value number：The number that this field intermediate value is 0.

0 value accounting：The data volume of the table of 0 value number/whole

Null value accounting：The data volume of the table of null value number/whole

Step S3, primary processor 1 determine target data to be processed according to the data shape key index pre-established Data shape.

Data shape just refers to after taking a table that the data inside this table all look like.Data shape report It is that the basic checkpoint having polymerize to table data by the index of these data shapes can find out this table at a glance in fact Data problem, the thought more than, the speed that can be determined with java ' realizes a platform, as long as user fills in some configurations, so that it may The data shape for taking target data reports.Cost of implementation is low, to master basic thought, with java or sql It realizes quickly.

The above method of the embodiment of the present invention obtains the data shape pre-established by obtaining target data to be processed State key index, data shape key index include the self attributes information of data, crucial according to the data shape pre-established Index determines the data shape of target data to be processed.Above-mentioned technical proposal reflects a certain partial data with key index Data shape, so as to the data shape of simple, quick, complete reflection data.

In one embodiment, which can also include the steps of S4：

Target data to be processed is written in a tentation data table step S4.

In the present embodiment, target data to be processed is as unit of table.Such as want across table, as long as or a part of data, It just will be in the specified table of data write-in to be processed.Be exactly in a word by target data to be processed, be placed in a tables of data from And convenient for being analyzed data, being handled.

As a further solution of the present invention：Input equipment is connected in the secondary processor, in the input equipment Equipped with data input module.

As further scheme of the invention：Input equipment, the input equipment are connected in the secondary processor It is interior to be equipped with data input module.

As further scheme of the invention：Secondary data processing module is equipped in the secondary processor.

As further scheme of the invention：Between the master data processor and preliminary treatment device, preliminary treatment device It is transmitted in both directions mode between secondary processor.

As further scheme of the invention：Step S3 may include following steps S31-S33：

Step S31 receives the table name of the tentation data table of input；

Step S32 cleans tentation data table according to the data shape key index pre-established, obtains predetermined number According to the data shape of the target data in table.

Step S33 receives the subregion to be processed of the preset data table of input and/or receives the one or more of input wait count Calculate the field of enumerated value.

As further scheme of the invention：The operating process of step S3 is：Newly-built task, is filled out corresponding according to prompt Information.For example, it is to be calculated to fill in the table name of tables of data to be analyzed, the subregion to be processed of preset data table and one or more The field of enumerated value.After filling in, executive button is clicked, can start to analyze the data in table, task execution is complete After can to user send notify, user can be with click logs button real time inspection task execution progress.

Compared with prior art, the beneficial effects of the invention are as follows：Target data to be processed is obtained by preliminary treatment device, The data shape key index pre-established is obtained by secondary processor, data shape key index includes itself belonging to for data Property information, the data shape of target data to be processed is determined according to the data shape key index pre-established.Due to pass Key index completes data exchange in time and alleviates Corporation system to handle simultaneously largely to reflect the data shape of a certain partial data The accuracy of the burden of data, data processing is high, and data processing is timely, can understand enterprise and handle relevant issues in time, It improves work efficiency.

Detailed description of the invention

Fig. 1 is a kind of structural schematic diagram of big data processing unit.

In figure：1- primary processor, 2- preliminary treatment device, 3- secondary processor.

Specific embodiment

The technical solution of the patent is explained in further detail With reference to embodiment.

Referring to Fig. 1, a kind of big data processing unit, including primary processor 1, preliminary treatment device 2 and secondary processor 3, It is equipped with master data processing module in the primary processor 1, the primary processor 1 is connect with multiple preliminary treatment devices 2, described preliminary Preliminary data processing module is equipped in processor 2, a preliminary treatment device 2 is connect with multiple secondary processors 3, at the secondary It manages and is equipped with secondary data processing module in device 3, be connected with input equipment in the secondary processor 3, set in the input equipment There is data input module.It is set in the master data processor 1, preliminary treatment device 2 and secondary processor 3 and is equipped with Data Detection mould Block, data detection module is for detecting input data exception, between the master data processor 1 and preliminary treatment device 2, preliminary place Managing between device 2 and secondary processor 3 is transmitted in both directions mode.

A kind of big data processing method, includes the following steps：

Step S1 obtains target data to be processed by preliminary treatment device 2.

Step S2 obtains the data shape key index pre-established, data shape key index by secondary processor 3 Self attributes information including data.

The data volume of specified compound primary key.

The index that each field calculates；

NULL total amount：The value of this field is the number of null；

Maximum value：This field value maximum value is sought, if it is field (string type etc., just according to max of nonumeric class The logic calculation of function default；

Maximum length：Maximum length in this field value；

Minimum length：Minimum length in this field value；

Minimum length citing：One of field value in minimum length in this field value；

0 value number：The number that this field intermediate value is 0.

0 value accounting：The data volume of the table of 0 value number/whole；

Null value accounting：The data volume of the table of null value number/whole；

Data shape just refers to after taking a table that the data inside this table all look like.Data shape report It is that the basic checkpoint having polymerize to table data by the index of these data shapes can find out this table at a glance in fact Data problem.With above thought, the speed that can be determined with java ' realizes a platform, as long as user fills in some configurations, so that it may The data shape for taking target data reports.Cost of implementation is low, to master basic thought, with java or sql It realizes quickly.

In one embodiment, which can also include the steps of：

Target data to be processed is written in a tentation data table step S4.

In one embodiment, step S3 may include following steps S31-S33：

Step S31 receives the table name of the tentation data table of input；

Step S3 can also include the steps of S33：Step S33, receive the preset data table of input subregion to be processed and/ Or receive the field of the one or more enumerated value to be calculated of input.

The operating process of step S3 is：Newly-built task, fills out corresponding information according to prompt.For example, filling in number to be analyzed According to the field of the table name of table, the subregion to be processed of preset data table and one or more enumerated values to be calculated.After filling in, Executive button is clicked, can start to analyze the data in table, can send and notify to user after task execution is complete, user is also It can be with click logs button real time inspection task execution progress.

Such as：The total amount and the total amount after duplicate removal of field, can to Bian Jian carry out uniqueness verification, it is distant and out of sight go out repeated data, See whether total amount meets expection in combination with business；Field minimax length can verify dirty data, such as the ultra-long data of address, Sellerid is the data of units；The ratio distribution of field enumerated value can verify the reasonability of enumerated value；Field maximum value is most Small value is in combination with reasonability from the point of view of business；Field null value number is in combination with reasonability from the point of view of business.In particular, result after cleaning When table will be supplied to using directly showing, need to meet constraint of the application system to data shape, for example certain field is not It can be null, the length of certain field cannot be too long etc..

The preferred embodiment of the patent is described in detail above, but this patent is not limited to above-mentioned embodiment party Formula within the knowledge of one of ordinary skill in the art can also be under the premise of not departing from this patent objective It makes a variety of changes.

Claims

1. a kind of big data processing unit, including primary processor (1), preliminary treatment device (2) and secondary processor (3), the master Master data processing module is equipped in processor (1), the primary processor (1) connect with multiple preliminary treatment devices (2), and one preliminary Processor (2) is connect with multiple secondary processors (3), the master data processor (1), preliminary treatment device (2) and secondary treatment It is set in device (3) and is equipped with data detection module, data detection module is for detecting input data exception.

2. a kind of big data processing unit according to claim 1, which is characterized in that set in the preliminary treatment device (2) There is preliminary data processing module.

3. a kind of big data processing unit according to claim 1, which is characterized in that connect on the secondary processor (3) It is connected to input equipment, is equipped with data input module in the input equipment.

4. a kind of big data processing unit according to claim 1, which is characterized in that set in the secondary processor (3) There is secondary data processing module.

5. a kind of big data processing unit according to claim 1, which is characterized in that the master data processor (1) and Between preliminary treatment device (2), between preliminary treatment device (2) and secondary processor (3) be transmitted in both directions mode.

6. a kind of big data processing method, which is characterized in that include the following steps：

Step S1 obtains target data to be processed by preliminary treatment device 2；

Step S2 obtains the data shape key index pre-established by secondary processor 3, and data shape key index includes The self attributes information of data；

Step S3, primary processor 1 determine the data of target data to be processed according to the data shape key index pre-established Form.

7. a kind of big data processing method according to claim 6, which is characterized in that step S3 may include following steps S31-S33：

Step S31 receives the table name of the tentation data table of input；

Step S32 cleans tentation data table according to the data shape key index pre-established, obtains tentation data table In target data data shape；

Step S33 receives the subregion to be processed of the preset data table of input and/or receives the to be calculated piece one or more of input The field of act value.

8. a kind of big data processing method according to claim 6, which is characterized in that the operating process of step S3 is：Newly Task is built, corresponding information is filled out according to prompt.For example, fill in the table name of tables of data to be analyzed, preset data table it is to be processed The field of subregion and one or more enumerated values to be calculated.After filling in, executive button is clicked, can be started in table Data are analyzed, and can be sent and be notified to user after task execution is complete, user can be with click logs button real time inspection task Implementation progress.