CN109241107A

CN109241107A - Big data controlling device based on Hadoop

Info

Publication number: CN109241107A
Application number: CN201810879556.0A
Authority: CN
Inventors: 鄂海红; 宋美娜; 白杨
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-08-03
Filing date: 2018-08-03
Publication date: 2019-01-18

Abstract

The invention discloses a kind of big data controlling device based on Hadoop, comprising: data administer information management module, for safeguarding that the data of each data source administer operation information；Data source capability module, for carrying out improvement operation to the data for importing big data platform；Data preview module, for showing the essential information of each tables of data from the angle of structured database；Metadata management module, for the metadata information in tables of data to be presented to user's various dimensions；For checking the specific missing information of each field in tables of data, and corresponding fill rule is arranged to complete the filling of missing information in data quality management module；Multisource data fusion module summarizes for multiple tables of data of multiple data sources to be carried out fusion again, after obtaining new tables of data, new tables of data is further analyzed.The device completes multiple functional modules using big data component, provides highly reliable data basis for subsequent analysis and inquiry.

Description

Big data controlling device based on Hadoop

Technical field

The present invention relates to technical field of data processing, in particular to a kind of big data controlling device based on Hadoop.

Background technique

Currently, universal with big data technology and related application, data have become in addition to manpower, material object, finance, skill Another critical asset except art, intellectual property and relationship.By analyzing data with existing, enterprise can be become more apparent upon Recent traffic-operating period, user's service condition etc., to optimize the operation of enterprise more accurately.But under present condition by In the truth to business data and do not know about, analysis personnel need take a significant amount of time pursue one's vocational study database document or Consultation service personnel, and Data Preparation also needs special data engineering teacher to be ETL, be easy to cause the speed of delivery with On not, it is also easy to go wrong in pilot process.With massaging device development to certain phase, data resource will become war Slightly assets, and effective data administer the necessary condition for being only data assets formation.Effective data are administered for ensuring data It is accurate, appropriateness share and protect most important.As enterprise administers the gradually attention of link to data, have already appeared some Commercial data controlling device, mainly comprising functional modules such as metadata management, data standard management, data quality managements

In the related technology, including following technical scheme: (1) defining metadata；Import the metadata；To the metadata It is administered and is analyzed, obtain analysis result；Metadata map is obtained according at least to the analysis result.(2) it proposes first only It stands in the normal data resource set of application, integrate and functionalization and hardware and software platform processing, one of overall importance, distributed number of formation According to standardization support and QCC quality control center；By the big concentrations of metadata, meta-model, associated metadata elements to each field etc. with Unified resourceization processing realizes that the standardization, standardization and quality to each application layer data resource control；At data normalization Reason mainly for thousands of metadata standards, the object class of data standard, defines class, characteristic class, expression class, codomain class, application It is carried out with management suitability, the data of each application field of S1~Sn pass through the specification number recalled in interface repository and " normal data source " It is handled according to comparison is standardized with suitability.(3) at least one tables of data is obtained, wherein at least one described tables of data is come From at least one information for hospital device HIS (Hospital Information System, information for hospital device)；Described in determination The feature of data in each of at least one tables of data tables of data；The feature is used to indicate the classification of the data； According to the corresponding relationship of the feature of storage and data result, the result of the data in each tables of data is determined；Wherein, described Corresponding relationship is to pass through machine according to the feature of the data in each tables of data and with obtained data result before current time Study determination.

However, the focal point of the big data controlling device of the relevant technologies is substantially in the management of metadata, for first number According to definition, use and analyze etc. unified standard, to reach the specification improvement to metadata information.But these are managed Scheme is excessively specialized, and the user for needing relevant professional knowledge could understand.Meanwhile for the data under big data scene Improvement is not limited only to metadata management, further includes the links such as data quality management, multisource data fusion, data modeling, these rings Section is to subsequent analysis and dredge operation no less important.In addition, current big data controlling device is all directed to and a certain specifically makes With scene, there is certain limitation in terms of use, management and extension.

Summary of the invention

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, it is an object of the invention to propose that a kind of big data controlling device based on Hadoop, the device effectively mention The applicability and practicability that high big data is administered are simple easily to realize.

In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of big data improvement dress based on Hadoop It sets, comprising: data administer information management module, for safeguarding that the data of each data source administer operation information, and provide improvement The copy function of operation；Data source capability module for carrying out improvement operation to the data for importing big data platform, and supports knot The improvement operation of the MySQL data source types and Hive data source types of structure database；Data preview module is used for from described The angle of structured database shows the essential information of each tables of data；Metadata management module, for being presented to user's various dimensions Metadata information in tables of data；Data quality management module, for checking that the specific missing of each field in the tables of data is believed Breath, and corresponding fill rule is set to complete the filling of the missing information；Multisource data fusion module is used for multiple numbers Fusion again is carried out according to multiple tables of data in source to summarize, and after obtaining new tables of data, the new tables of data is carried out into one Step analysis.

The big data controlling device based on Hadoop of the embodiment of the present invention, using big data component complete data preview, The functional modules such as metadata management, multisource data fusion, the quality of data help user to understand that data really contain from multiple angles Justice provides highly reliable data basis for subsequent analysis and inquiry, meanwhile, complicated operation is hidden in below, is externally mentioned For can click interface so that the user for not having big data professional skill can also be operated with the improvement of complete paired data, The practicability of device has been fully demonstrated, so that the applicability and practicability of big data improvement are effectively increased, it is simple easily to realize.

In addition, the big data controlling device according to the above embodiment of the present invention based on Hadoop can also have it is following attached The technical characteristic added:

Further, in one embodiment of the invention, the data preview module is further used for through table shape Formula and bar graph form show the essential information, wherein the histogram reflects the record number that each tables of data possesses, And the detailed essential information of form display data table.

Further, in one embodiment of the invention, the data preview module is also provided for based on current number According to the change historical information and output information in source.

Further, in one embodiment of the invention, be further used for will be same for the multisource data fusion module It carries out summarizing fusion according to any primary attribute between the different data table of data source；And/or by the different data of different data sources It completes to merge according to any primary attribute between table.

Further, in one embodiment of the invention, the multisource data fusion module is with data quality management mould It is realized based on the data obtained after block processing and by SQL statement and is merged.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partially become from the following description Obviously, or practice through the invention is recognized.

Detailed description of the invention

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, in which:

Fig. 1 is the structural schematic diagram according to the big data controlling device based on Hadoop of one embodiment of the invention；

Fig. 2 is the structural representation according to the big data controlling device based on Hadoop of a specific embodiment of the invention Figure.

Specific embodiment

The embodiment of the present invention is described below in detail, examples of the embodiments are shown in the accompanying drawings, wherein from beginning to end Same or similar label indicates same or similar element or element with the same or similar functions.Below with reference to attached The embodiment of figure description is exemplary, it is intended to is used to explain the present invention, and is not considered as limiting the invention.

The big data controlling device based on Hadoop proposed according to embodiments of the present invention is described with reference to the accompanying drawings.

Fig. 1 is the structural schematic diagram of the big data controlling device based on Hadoop of one embodiment of the invention.

As shown in Figure 1, should big data controlling device 10 based on Hadoop include: data administer information management module 100, Data source capability module 200, data preview module 300, metadata management module 400, data quality management module 500 and multi-source Data fusion module 600.

Wherein, data administer information management module 100 and are used to safeguard that the data of each data source to administer operation information, and mention For administering the copy function of operation.Data source capability module 200 is used to carry out improvement operation to the data for importing big data platform, And support the improvement operation of the MySQL data source types and Hive data source types of structuring database.Data preview module 300 For showing the essential information of each tables of data from the angle of structured database.Metadata management module 400 is used for more to user The metadata information in tables of data is presented in dimension.Data quality management module 500 is used to check the specific of each field in tables of data Missing information, and corresponding fill rule is set to complete the filling of missing information.Multisource data fusion module 600 is used for will be more Multiple tables of data of a data source carry out fusion again and summarize, and after obtaining new tables of data, carry out new tables of data into one Step analysis.The device 10 of the embodiment of the present invention completes multiple functional modules using big data component, helps user from multiple angles Understand data real meaning, provides highly reliable data basis for subsequent analysis and inquiry.

It is understood that as shown in Fig. 2, the device 10 of the embodiment of the present invention includes that data administer information management module 100, data source capability module 200, data preview module 300, metadata management module 400,500 and of data quality management module Multisource data fusion module 600, wherein each module completes different data and administers operation, and then helps user more preferably geographical Business datum is solved, while carrying out the data preparation of high quality for the analysis of subsequent data.The device 10 of the embodiment of the present invention solves Industry and enterprise analyzes many data problems encountered in scene in data mining, data, and wherein data, which administer information management, is Uniform management module in device 10, for managing the essential information that data administer operation.Operation is administered when user increases data newly When, according to data source capability, data preview, metadata management, data quality management, multisource data fusion process for using successively Each module operation is executed, and then completes to administer operating process for the complete data of a certain data source.Separately below to each module Concrete function describes in detail.

In one embodiment of the invention, data administer information management module 100 and administer information management, number for data Refer to that administering operation information to the data of each data source safeguards according to information management is administered, so that user can be before Processing is iterated in the improvement operation carried out；Meanwhile data administer information management module 100 and also provide answering for improvement operation Function processed reaches common understanding or improvement that treatment process is perfect operation may be copied operation for a certain, reduces user's Repeatable operation process.

Specifically, it is the uniform management module in whole device 10 that data, which administer information management module 100, for each The data of data source administer operation can all form unique one in data source Basic Information Table (data_source_info) Record, and be shown in the homepage of apparatus platform.User can both administer operation continuing with data based on existing data Source carries out improvement operation, also can choose " newdata improvement ", to do not carry out also any data administer the data source of operation into Row operation.Meanwhile data improvement information management module 100 also can carry out statistical operation to each data source, summarize platform from whole On each data source operation information etc., specifically include:

(1) statistics of total item number, i.e., the data for carrying out or completing on this platform administer the data source number of operation Amount.The data information is obtained by calculating the record sum in data source Basic Information Table (data_source_info)；

(2) each tables of data committed memory higher preceding ten that the statistics of table committed memory, i.e. statistics import big data platform Data table name is opened, the information is according to the Storage_num (physics in tables of data Basic Information Table (table_basic_info) Memory space occupies) field is ranked up and obtains；

(3) statistics of popular table, that is, count by higher preceding ten tables of data of number of operations in this platform, which passes through Sort method is carried out to Table_id (tables of data ID) field in data change history lists (data_modify_info) to obtain.

Further, data source capability module 200 is used to carry out improvement operation to the data for having been introduced into big data platform, And support the improvement operation of two kinds of data source types of structuring database MySQL and Hive.Specifically, the embodiment of the present invention is supported The data of the two kinds of data source of MySQL and Hive administer operation, so, in data source capability module 200, user is first Selection needs to carry out the data source types that data administer operation, and device 10 can be from tables of data Basic Information Table (table_basic_ Info all data table names read under current data source in) need to carry out the data source that data administer operation for user's selection Title.

Further, in one embodiment of the invention, data preview module 300 is further used for passing through form Essential information is shown with bar graph form, wherein histogram reflects the record number that each tables of data possesses, and form is shown The detailed essential information of tables of data.

Wherein, in one embodiment of the invention, data preview module 300 is also provided for based on current data source Change historical information and output information.

It is understood that data preview is the essential information for showing each tables of data from the angle of database to user；Number Two kinds of display form, that is, forms and bar graph form are provided according to previewing module 300.Wherein, histogram reflects each tables of data How much is the record number possessed, and form then illustrates the detailed essential information of tables of data.Meanwhile device 10 provides a user Change historical information and output information based on current data source, to facilitate user to be best understood from the service condition of data source.

Specifically, data preview module 300 is the case where helping user to understand business datum on the whole, comprising current The record sum of each tables of data and output information, the change historical information of data source etc. under select data source, specifically include:

(1) the record sum of tables of data passes through the Row_ in statistics table Basic Information Table (table_basic_info) Num (possessing field quantity), field obtained, and front end is shown in the form of histogram, more intuitively to view each number According to the record number in table.

(2) the output information of tables of data summarize by the insertion time of every record in statistics raw data table Out, it is equally presented in a manner of histogram, conveniently checks the operational circumstances in a period of time to certain tables of data.

(3) the change historical information of data, the information are obtained by data change history lists (data_modify_info).

(4) preceding ten records of tables of data, ten records are before former tables of data is directly inquired by MySQL or HiveQL It can.

Further, the letter of the detailed metadata in specific tables of data is presented to user in 400 various dimensions of metadata management module It ceases (field information, partition information, index information etc.), business datum meaning can be more clearly understood for a user, Be also convenient for user as needed intelligently obtain business datum environment in metadata information；Meanwhile metadata management module 400 mentions For a variety of truths for showing form and user being helped to quickly understand data.

Specifically, data administer in a critically important part how be by hundreds and thousands of tables of data in database Or the data information of disparate databases is presented to the user with visualizing.The major function of metadata management module 400 is to aid in User quickly understands the concrete condition and metadata meaning of business datum.

Wherein, for Hive database, the embodiment of the present invention is by being arranged corresponding profile parameters, the member of database Data are stored in the specified database Hive of MySQL, wherein comprising the relevant metadata table of Hive database (DBS, DATABASE_PARAMS), Hive table and the relevant metadata table of view (TBLS, TABLE_PARAMS, TBL_PRIVS), Hive File stores relevant metadata table (SDS, SD_PARAMS, SERDES, SERDE_PARAMS) of information etc..For MySQL data Library, metadata information are stored in information_schema database, wherein comprising data essential information (TABLES, COLUMNS, VIEWS), partition information (PARTITIONS) etc..For two kinds of database, it is all made of Java and passes through JDBC Mode connect the mode of database and obtain information.In addition, providing a variety of display forms (form, number based on the module 400 According to cloud atlas, blood relationship management):

(1) details for the tables of data metadata information that form is shown, user can according to need modification data Information, the modified information such as the SQL type of storage or column description can be in metadata fields information table (table_field_ Info) respective field is updated, the respective field in original data source can be also updated.Meanwhile this operation is related Record information can be also inserted into data change history lists (data_modify_info).

(2) it for data cloud atlas, is realized by knowledge mapping technology.In knowledge mapping, each node indicates real generation " entity " present in boundary, " relationship " of each edge between entity and entity.The presentation mode of map is used for reference, it will in this system Ready-portioned theme and tables of data as node, incidence relation between tables of data as side be stored in chart database (such as Neo4j in), user can more intuitively understand the relationship between tables of data by browsing the figure.

(3) for blood relationship management, the mainly parsing of source table and object table, for example, source table passes through table Naming conventions It is parsed, object table mainly passes through the sentences solution such as " insert into table " and " insert overwrite table " Analysis, finally obtains the relationship between table and table.

It should be noted that the design of database is as shown in table 1, table 1 is database structure table.

Table 1

Further, data quality management module 500 is used to have due to the possible business datum of setting of filling in of operation system Situations such as a large amount of missings, mistake are filled out, causes the miss rate of data excessively high, user checks the tool of each field in tables of data by this module Body deletion condition, and corresponding fill rule is set and completes filling.

Specifically, the major function of data quality management module 500 is the missing number of each data table data filling of inquiry It measures and calculates miss rate.This function is by calling pandas library function in python can be realized, the result deposit checked out In tables of data deletion condition information table (data_missing_info) and show user.Certain field higher for miss rate, User can choose configuration fill rule and complete configuration, and system, which is provided, fills or make by oneself according to median filling, according to mode Adopted fill rule.Wherein, first two filling mode first calculates the median of the original filling data of field and mode is filled； Customized filling is the customized filling content of user, such as filling " -1 " etc..Passed through according to the fill rule that user selects Pandas library function is completed to recalculating miss rate after the filling of specific field, and is updated in tables of data deletion condition information table Corresponding field missing information.

Further, in one embodiment of the invention, multisource data fusion module 600 is further used for same number According to carrying out summarizing fusion according to any primary attribute between the different data table in source；And/or by the different data table of different data sources Between according to any primary attribute complete merge.

It is understood that in actual big data analysis and excavating in scene, may be not limited only to for a certain number It is analyzed according to tables of data existing in source, needs to carry out multiple tables of data of multiple data sources fusion again and summarize, obtain It is analyzed it again after new tables of data.Therefore, the multisource data fusion module 600 in the device of that embodiment of the invention 10 is intended to Solve the problems, such as that different data source data hits library.

Wherein, there are mainly two types of modes for multisource data fusion: single library fusion and the fusion of more libraries.Wherein, single library fusion refers to It is according to a certain primary attribute between the different data table of same data source (for example, number information, regional information, job category letter Breath etc.) it carries out summarizing fusion；More library fusions refer to complete according to a certain primary attribute between the different data table of different data sources At fusion.The module 600 is melted based on the data obtained after the processing of data quality management module 500 by SQL statement realization It closes.

Specifically, multisource data fusion module 600 is the operation module of procedure.Specific design and implementation process are such as Under:

(1) fused type (single library fusion/more libraries fusion) is selected.If selecting " single library fusion ", need to select to be merged The data table name of operation；If selecting " more library fusions ", system can be from data source Basic Information Table (data_source_ Info other data source name identical with current data Source Type are read in), and the data source name merged is selected for user, then Equally with " single library fusion ", selection carries out the data table name of mixing operation.

(2) fusion rule is configured, each fusion rule is based on certain two tables of data in selected tables of data and is matched It sets.System chooses Property Name all in tables of data from metadata fields information table (table_field_info) acquisition, uses The field name that family selects two tables of data in fusion rule to need.Configuration fusion rule form is " Table_ A.Column_A=Table_B.Column_B ".

(3) high level rules configure, i.e., whether fused tables of data, which allows to retain, repeats record or Repeating Field.

(4) storage setting, configures fused title and storage location, default storage is under current data source.Together When, system, which can be shown, to be selected with other data source name of current data source same type for user.

(5) complete fusion, be based on the configured fusion rule in step (1)-(4), be converted into corresponding MySQL or HiveQL sentence, operates the tables of data in corresponding data source, and completion fusion forms new tables of data and is stored under specified data source Complete fusion.Meanwhile data fusion Basic Information Table (data_ will be stored in about the essential information of this fusion process Fusion_info it is shown in) and in the homepage of multisource data fusion module 600, helps user to understand and be based on current data The data fusion operation that source has been completed.

To sum up, the device of the embodiment of the present invention solves existing big data governing system substantially only comprising metadata management Content, there is no from data to big data resource carry out missing filling, multisource data fusion etc. operate the problem of, To provide sufficient data preparation for subsequent analysis and excavation, and on the basis of metadata management, data preview, number are increased newly According to modules such as quality management, multisource data fusions.The device of the embodiment of the present invention is suitable for various data and administers scene demand, than Such as: user being helped to quickly understand data service metadata information；Missing data in business datum is filled and has obtained high quality Data；The E-R relationship that concatenation improves each tables of data is carried out to the tables of data of multi-data source.

In addition, in view of requirement of the mass data to processing environment is handled in big data scene in time, the embodiment of the present invention Device completes big data using Hadoop ecology component and administers each link operation.Firstly, use premise is data guiding structure Change and is stored in database (MySQL) or the Hive of Hadoop ecology.Based on the data imported, user will be seen that importing number According to basic condition (metadata information including data in tables of data and table), and based on this data complete data quality management, Multisource data fusion operation carries out sufficient data preparation for subsequent analysis link.The embodiment of the present invention is led from user data Enter the result data to after improvement output, need by selection data source, data preview, metadata management, data quality management, Multisource data fusion totally five steps.In addition, being iterated place in the improvement operation carried out before for the convenience of the user Reason, the device of the embodiment of the present invention also design data administer information management module to the data of each data source administer operation information into Row management.

The big data controlling device based on Hadoop proposed according to embodiments of the present invention completes number using big data component According to functional modules such as preview, metadata management, multisource data fusion, the qualities of data, user is helped to understand data from multiple angles Real meaning provides highly reliable data basis for subsequent analysis and inquiry, meanwhile, complicated operation is hidden in below, Externally provide can click interface so that the user for not having big data professional skill can also be with the improvement of complete paired data Operation, has fully demonstrated the practicability of device, so that the applicability and practicability of big data improvement are effectively increased, it is simple easily real It is existing.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiment or examples.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, modifies, replacement and variant.

Claims

1. a kind of big data controlling device based on Hadoop characterized by comprising

Data administer information management module, for safeguarding that the data of each data source administer operation information, and provide improvement operation Copy function；

Data source capability module for carrying out improvement operation to the data for importing big data platform, and supports structuring database MySQL data source types and Hive data source types improvement operation；

Data preview module, for showing the essential information of each tables of data from the angle of the structured database；

Metadata management module, for the metadata information in tables of data to be presented to user's various dimensions；

Data quality management module for checking the specific missing information of each field in the tables of data, and is arranged and fills out accordingly Rule is filled to complete the filling of the missing information；And

Multisource data fusion module summarizes for multiple tables of data of multiple data sources to be carried out fusion again, new to obtain After tables of data, the new tables of data is further analyzed.

2. the big data controlling device according to claim 1 based on Hadoop, which is characterized in that the data preview mould Block is further used for showing the essential information by form and bar graph form, wherein the histogram reflects institute State the record number that each tables of data possesses, and the detailed essential information of form display data table.

3. the big data controlling device according to claim 1 or 2 based on Hadoop, which is characterized in that the data are pre- Module of looking at also provides for change historical information and output information based on current data source.

4. the big data controlling device according to claim 1 based on Hadoop, which is characterized in that the multi-source data melts Molding block is further used for carrying out summarizing fusion according to any primary attribute between the different data table by same data source；And/or It will complete to merge according to any primary attribute between the different data table of different data sources.

5. the big data controlling device according to claim 4 based on Hadoop, which is characterized in that the multi-source data melts Block is molded based on the data obtained after data quality management resume module and is realized by SQL statement and is merged.