CN108009290A

CN108009290A - A kind of data modeling and storage method of track traffic command centre gauze big data

Info

Publication number: CN108009290A
Application number: CN201711426597.6A
Authority: CN
Inventors: 陈莉莉; 张赛桥; 胡波; 狄颖琪; 张振山
Original assignee: Nari Technology Co Ltd; NARI Nanjing Control System Co Ltd
Current assignee: Nari Rail Transit Technology Co ltd; Nari Technology Co Ltd
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-05-08
Anticipated expiration: 2037-12-25
Also published as: CN108009290B

Abstract

The invention discloses a kind of data modeling and storage method of track traffic command centre gauze big data, the data modeling and storage organization method of use.In structural data modeling process, by the structural data in gauze big data, with reference to Hadoop platform and feature of the module and the data application situation of track transportation industry, select rational data model, data modeling and storage data are carried out, is elaborated especially for the method for logic modeling：In ephemeral data file deposit Hbase, to being stored in after historical data unified Modeling in Hbase, intermediate data layer and Data Mart layer, are built in table deposit Hive with dropping the method for normal form；To unstructured data, application is pressed using Hadoop platform and chronological classification stores small documents, realizes the full-text search and analysis of unstructured data.Therefore specification and the storage of efficient data are realized to gauze big data.

Description

A kind of data modeling and storage method of track traffic command centre gauze big data

Technical field

The present invention relates to a kind of data modeling and storage method of track traffic command centre gauze big data, belong to track Traffic surveillance and control system technical field.

Background technology

With the quickening of Construction of Urban Rail Traffic, the subway line in each city gradually develops to gauzeization, and track is handed over Logical gauze data type is more and more, and data volume is also increasing, and the tidal data recovering of magnanimity has arrived track traffic command centre. In field of track traffic, how effectively to collect, arrange, store or even handle and analyze these structurings and unstructured number According to carrying out the data mining and data analysis of depth, wherein valuable information excavated, so as to improve the operation water of track traffic It is flat, science decision ability is lifted, Improve Efficiency reduces cost, lifts information service and safety assurance ability, has become industry The emphasis of concern.

At present, the data with existing construction of warehouse case of track traffic command centre is also considerably less, and all uses MPP DB The data warehouse of framework carries out structural data storage, but with the increase of data volume and increasing for data type, MPP DB Price is high, the weakness that is difficult to extend, cannot store unstructured data, cannot carry out stream process is exposed.

Hadoop is a relatively inexpensive and good framework.But if doing data storage using Hadoop platform, this Kind platform has the framework and data processing method different from MPP DB, its structure is more flexible, the data magnitude bigger of storage, Support high concurrent and in real time processing, be easy to extend.But do not indexed in the storage of the data of Hadoop, data block is bigger than MPP DB Very much, so accurate data inquiry, the speed of watch and the combined inquiry of watch are slower than MPP DB, conventional MPP DB based on model The Data Modeling Method of formula modeling cannot indiscriminately imitate Hadoop platform, therefore the characteristics of be directed to Hadoop platform, select properly Component, to the structural data in the gauze big data of track traffic, design and new model two kinds with dimensionality analysis and normal form The Data Modeling Method that method is combined, plays the shortcomings that to evade Hadoop platform and to greatest extent its advantage, to realize line The safety of net big data and efficient storage and access.

The unstructured data of conventional track traffic command centre, such as video, image, voice, journal file, the page Crawl etc., is all to be stored using disk array, simply realizes storage backup functionality, unstructured data can not realized entirely Text retrieval is so that further analysis.The demand day that track transportation industry is analyzed unstructured data and excavated at the same time Gradually rise, it is necessary to carry out content retrieval and the processing of unstructured data, this is also required to introduce Hadoop platform substitution disk battle array The memory module of row.

The content of the invention

Purpose：In order to overcome the deficiencies in the prior art, the present invention provides a kind of track traffic command centre gauze The data modeling and storage method of big data.

Technical solution：In order to solve the above technical problems, the technical solution adopted by the present invention is：

A kind of data modeling and storage method of track traffic command centre gauze big data, are carried out using Hadoop platform Data store, and specific steps include as follows：

Step 1：For the structural data in Metro Network big data, include each subsystem that each circuit gathers The time series data of system, is pooled to the big data platform of gauze command centre, and platform uses Hadoop framework；

Step 2：Structural data is modeled, including conceptual modelling, logic modeling and physical modeling；

Step 3：For a large amount of unstructured datas of track traffic command centre, unstructured data is compressed Afterwards, store in HBase, and table storage associated metadata is built in HBase.

Preferably, the conceptual modelling：According to the application of rail traffic structure gauze big data, and data Actual conditions, carry out conceptual modelling, being associated between each Subject elements of track traffic；Associated Subject elements It is as follows：Safeguard using-subsystem, gauze-circuit-station, Customer information-passenger flow-satisfaction, signal-train operation-train- Driver, equipment-point-equipment running status record, linkage-sequence-alarm-event.

Preferably, logic modeling is carried out to each data field of data warehouse, included the following steps：

Step 2.2.1：Ephemeral data area logic modeling；

Step 2.2.2：Historical data area logic modeling step；

Step 2.2.3：Data Mart logic modeling.

Preferably, ephemeral data area logic modeling：The every line that different circuit integrators is sent up The data on road, store in a manner of less than 1M in HBase in ephemeral data area, and storage time is in 1 month.

Preferably, historical data area logic modeling, includes the following steps：

Step 2.2.2.1：Time series data in gauze big data includes two classes, ordinal number when one kind is the change of equipment point According to another kind of is the time series data of passenger flow；

Step 2.2.2.2：For the time series data of equipment point change, the index RowKey of HBase is with input data point All fronts net uniqueness index and the data variation ageing of the data point form；It is with character for all fronts net uniqueness index String is modeled as keyword, the organizational form such as table 1 of index：

Circuit

Station

Using

Equipment

Vertex type

Point

Transformation period

Table 1；

Step 2.2.2.3：For the time series data of passenger flow, the organizational form such as table 2 of index：

Card number

Time out of the station

Table 2.

Preferably, the Data Mart logic modeling, includes the following steps：

Step 2.2.3.1：Data Mart is designed according to the specific data application of track traffic command centre, Data Mart Table store into Hive；The data model of table is designed according to the characteristics of Hadoop itself in Hive, using dimensionality analysis method Design, carries out data drop normal form processing；By way of polymerization, different attributes is converged in an entity and is deposited Storage；I.e. using the wide table more than attribute column, content and the relevant attribute of function, there are in same table；

Step 2.2.3.2：For ISCS data, in circuit centered on the point for characterizing device attribute build table in a manner of change Build table mode for facility center management in gauze, i.e., all outputs characterization point for each equipment needing application and equipment Attribute all concentrates that there are in a table；The table of building of facility center management includes output characterization and counts out as n in the gauze, specifically Such as table 3：

Table 3；

Step 2.2.3.3：For passenger flow year analytical table, passenger flow year analytical table be data cube, have three dimensions：Month Part, station, passenger flow attribute information, three dimensions is closed into table 4：

Table 4；

Step 2.2.3.4：For travelling data, actual train running graph and planned train graph expansion, train status are believed Breath, load factor, motion time, dwell time relevant information are combined in table 5：

Table 5.

Preferably, the physical modeling, includes the following steps：

Step 2.3.1：Input data to importing Hadoop platform, is organized in a manner of the small documents less than 64M, Using Avro forms, the ephemeral data area being stored in HBase；

Step 2.3.2：Historical data area data storage for HBase, sets multiple Region, data are according to Rowkey Design storage to transformation period, transformation period in relevant Region, include in Rowkey be random, use on long-term In ensureing random storage of the data on each Region, i.e. data can be evenly distributed to each back end；

Step 2.3.3：In the zoning design of Hive, using month subregion, i.e., annual 12 subregions；Table is combed Reason, string types are replaced with int, substitute date types with timestamp, Float or Double are replaced with Decimal types Type.

Preferably, the step 3 includes the following steps：

Step 3.1：Since quantity of documents is more, so the partition directory organizational form of storage is document source application, it is divided into Five classes：Historical archives, emergency command, system log, report, network crawl content, such as table 6, then classify according to concrete application, Archive finally realized for a phase to rear four kinds of every half a years, compressing file is done using Snappy technologies during storage, for non- The further retrieval and content analysis of structural data；

Table 6.

Beneficial effect：A kind of data modeling of track traffic command centre gauze big data provided by the invention and storage side Method, it is special with reference to Hadoop platform and component by the structural data in gauze big data in structural data modeling process The data application situation of point and track transportation industry, selects rational data model, carries out data modeling and storage data：Temporarily In data file deposit Hbase, to being stored in after historical data unified Modeling in Hbase, intermediate data layer and Data Mart layer, with The method of drop normal form is built in table deposit Hive；To unstructured data, application is pressed using Hadoop platform and chronological classification stores Small documents, realize the full-text search and analysis of unstructured data.

This method realizes gauze big data specification and the storage of efficient data and extemporaneous retrieval, further data are dug Pick and data analysis, and then instruct the function of rail transportation operation.

Brief description of the drawings

Fig. 1 is the data organization and storage organization schematic diagram of gauze big data；

Embodiment

The present invention is further described below in conjunction with the accompanying drawings.

Structural data in gauze big data and unstructured data are respectively processed by the present invention.As shown in Figure 1 Structural data, according to the technical characterstic of Hadoop platform, divides different levels, stage by stage by the business procedure of data processing Different components is selected, comes storage and the structure of tissue data, carries out the design of data modeling；Unstructured data uses small text Part mode is stored in Hadoop platform.

Step 1：For the structural data in Metro Network big data, include each subsystem that each circuit gathers The time series data of system, is pooled to the big data platform of gauze command centre, and platform uses Hadoop framework.

Step 2：Structural data is modeled, including conceptual modelling, logic modeling and physical modeling.

Step 3：For a large amount of unstructured datas of track traffic command centre, these files in track transportation industry Feature is that all less file is respectively less than 100M, after being compressed to file, stores in HBase, and builds table in HBase and deposit Store up associated metadata.

Step 2.1：The conceptual modelling：According to the reality of the application of rail traffic structure gauze big data, and data Situation, carries out conceptual modelling, being associated between each Subject elements of track traffic；Associated Subject elements are such as Under：Using-subsystem, gauze-circuit-station, Customer information-passenger flow-satisfaction, signal-train operation-train maintenance-department Machine, equipment-point-equipment running status record, linkage-sequence-alarm-event.

Step 2.2：Logic modeling is carried out to each data field of data warehouse, is included the following steps：

Step 2.2.1：Ephemeral data area logic modeling：

The data for every circuit that different circuit integrators is sent up, store in HBase in a manner of less than 1M In ephemeral data area, storage time is in 1 month.

Step 2.2.2：Historical data area logic modeling, includes the following steps：

Step 2.2.2.1：Time series data in gauze big data includes two classes, ordinal number when one kind is the change of equipment point According to another kind of is the time series data of passenger flow.

Step 2.2.2.2：For the time series data of equipment point change, the index RowKey of HBase is with input data point All fronts net uniqueness index and the data variation ageing of the data point form.It is with character for all fronts net uniqueness index String is modeled as keyword, the organizational form such as table 1 of index：

Circuit

Station

Using

Equipment

Vertex type

Point

Transformation period

Table 1

The character string that circuit, station, application, equipment, vertex type, point are combined into is the index value of each data point, during change Between be data point transformation period, such classification building form, should be readily appreciated that and realize the unification of completely net, convenient into line number According to extension and new line access.It is easy to implement that data are averaged and equally distributed form can be presented on long-term in transformation period Distribution is stored in different back end, ensures the pressure balance between each back end.The index of track traffic command centre point There is the demand that batch reading is largely carried out to change in a period of time of point in analysis, this storage format can once read in one The multiple delta data of a point, realizes convenient and efficient data access.

Card number

Time out of the station

Table 2

Step 2.2.3：Data Mart logic modeling, includes the following steps：

Step 2.2.3.1：Data Mart is designed according to the specific data application of track traffic command centre, Data Mart Table store into Hive.The data model of table is designed according to the characteristics of Hadoop itself in Hive, using dimensionality analysis method Design, carries out data drop normal form processing；By way of polymerization, different attributes is converged in an entity and is deposited Storage；I.e. using the wide table more than attribute column, content and the relevant attribute of function, there are in same table；In gauze, original line The row of table in circuit-switched data model now according to specific requirements, are incorporated into the big table of hundreds of row less than 30 row and repeat to store, to increase Big data memory space, which exchanges for, effectively shortens read access time.For the logic modeling of Data Mart, each of data cube The facade expansion of dimension, compresses dimensionality reduction, various dimensions is merged into a dimension.It is convenient for year-on-year, ring ratio and drilling analysis, Data modeling can have bulk redundancy row, not follow normal form modeling principle.

Step 2.2.3.2：For ISCS data, in circuit centered on the point for characterizing device attribute build table in a manner of change Build table mode for facility center management in gauze, i.e., all outputs characterization point for each equipment needing application and equipment Attribute all concentrates that there are in a table.The table of building of facility center management includes output characterization and counts out as n in the gauze, specifically Such as table 3：

Table 3

Step 2.2.3.3：For passenger flow year analytical table, passenger flow year analytical table be data cube, have three dimensions：Month Part, station, passenger flow attribute information, three dimensions is closed into table 4.

Table 4

Step 2.2.3.4：For travelling data, actual train running graph and planned train graph expansion, train status are believed Breath, load factor, motion time, dwell time relevant information are combined in table 5.

Table 5

Step 2.3：The physical modeling, includes the following steps：

Step 2.3.1：The input data for importing Hadoop platform is organized in a manner of the small documents less than 64M, Using Avro forms, the ephemeral data area being stored in HBase.

Step 2.3.2：Historical data area data storage for HBase, sets multiple Region, data are according to Rowkey Design storage to transformation period, transformation period in relevant Region, include in Rowkey be random, use on long-term In ensureing random storage of the data on each Region, i.e. data can be evenly distributed to each back end.

Step 2.3.3：In the zoning design of Hive, do and optimize for the efficiency of Hive, it is using month subregion, i.e., annual 12 subregions.The table in physical modeling stage is combed, string types are replaced with int, date classes are substituted with timestamp Type, Float or Double types are replaced with Decimal types.

The step 3 includes the following steps：

Step 3.1：Since quantity of documents is more, so the partition directory organizational form of storage is document source application, it is divided into Five classes：Historical archives, emergency command, system log, report, network crawl content, such as table 6, then classify according to concrete application, Archive finally realized for a phase to rear four kinds of every half a years, compressing file is done using Snappy technologies during storage, for non- The further retrieval and content analysis of structural data.

Table 6.

Claims

1. a kind of data modeling and storage method of track traffic command centre gauze big data, using Hadoop platform into line number According to storage, it is characterised in that：Specific steps include as follows：

Step 1：For the structural data in Metro Network big data, including the subsystems of each circuit collection Time series data, is pooled to the big data platform of gauze command centre, and platform uses Hadoop framework；

Step 3：For a large amount of unstructured datas of track traffic command centre, after being compressed to unstructured data, deposit Store up in HBase, and table storage associated metadata is built in HBase.

2. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 1, It is characterized in that：The conceptual modelling：According to the actual conditions of the application of rail traffic structure gauze big data, and data, Conceptual modelling is carried out, being associated between each Subject elements of track traffic；Associated Subject elements are as follows：Should With-subsystem, gauze-circuit-station, Customer information-passenger flow-satisfaction, signal-train operation-train maintenance-driver, set Standby-point-equipment running status record, linkage-sequence-alarm-event.

3. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 1, It is characterized in that：Logic modeling is carried out to each data field of data warehouse, is included the following steps：

Step 2.2.1：Ephemeral data area logic modeling；

Step 2.2.2：Historical data area logic modeling step；

Step 2.2.3：Data Mart logic modeling.

4. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 3, It is characterized in that：Ephemeral data area logic modeling：The data for every circuit that different circuit integrators is sent up, with Mode less than 1M is stored in HBase in ephemeral data area, and storage time is in 1 month.

5. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 3, It is characterized in that：Historical data area logic modeling, includes the following steps：

Step 2.2.2.1：Time series data in gauze big data includes two classes, and one kind is the time series data of equipment point change, separately One kind is the time series data of passenger flow；

Step 2.2.2.2：For the time series data of equipment point change, the index RowKey of HBase is with all fronts of input data point Net uniqueness indexes and the data variation ageing of the data point forms；For all fronts net uniqueness index made with character string Modeled for keyword, the organizational form such as table 1 of index：

Circuit Station Using Equipment Vertex type Point Transformation period

Table 1；

Card number Time out of the station

Table 2.

6. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 3, It is characterized in that：The Data Mart logic modeling, includes the following steps：

Step 2.2.3.1：Data Mart is designed according to the specific data application of track traffic command centre, the table of Data Mart Store in Hive；The data model of table is designed according to the characteristics of Hadoop itself in Hive, is designed using dimensionality analysis method, Data are carried out with drop normal form processing；By way of polymerization, different attributes is converged in an entity and is stored；Adopt With the wide table more than attribute column, content and the relevant attribute of function, there are in same table；

Step 2.2.3.2：For ISCS data, in circuit centered on the point for characterizing device attribute build table in a manner of be changed to line Facility center management builds table mode in net, i.e., all outputs characterization point and the attribute of equipment that each equipment need application All there are in a table for concentration；The table of building of facility center management is counted out as n, specific such as table comprising output characterization in the gauze 3：

Table 3；

Step 2.2.3.3：For passenger flow year analytical table, passenger flow year analytical table be data cube, have three dimensions：Month, car Stand, passenger flow attribute information, three dimensions is closed into table 4：

Table 4；

Step 2.2.3.4：For travelling data, actual train running graph and planned train graph expansion, train status information, expire Load rate, motion time, dwell time relevant information are combined in table 5：

Table 5.

7. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 1, It is characterized in that：The physical modeling, includes the following steps：

Step 2.3.1：Input data to importing Hadoop platform, is organized in a manner of the small documents less than 64M, is used Avro forms, the ephemeral data area being stored in HBase；

Step 2.3.2：Historical data area data storage for HBase, sets multiple Region, data are set according to Rowkey's Meter storage is arrived in relevant Region, transformation period is included in Rowkey, transformation period is random on long-term, for protecting Random storage of the data on each Region is demonstrate,proved, i.e. data can be evenly distributed to each back end；

Step 2.3.3：In the zoning design of Hive, using month subregion, i.e., annual 12 subregions；Table is combed, is used Int replaces string types, substitutes date types with timestamp, Float or Double types are replaced with Decimal types.

8. the data modeling and storage method of a kind of track traffic command centre gauze big data according to claim 1, It is characterized in that：The step 3 includes the following steps：

Step 3.1：Since quantity of documents is more, so the partition directory organizational form of storage is document source application, it is divided into five classes： Historical archives, emergency command, system log, report, network crawl content, such as table 6, then classify, finally according to concrete application Archive realized for a phase to rear four kinds of every half a years, compressing file is done using Snappy technologies during storage, for non-structural Change the further retrieval and content analysis of data；

Table 6.