A kind of unstructured flow data real-time storage method of rail traffic
Technical field
The invention belongs to field of track traffic more particularly to a kind of unstructured flow data real-time storage sides of rail traffic
Method.
Background technique
At present the country urban rail transit construction carry out on a large scale, subway, light rail, tramcar etc. have much runed it is open-minded
Route and route is being built, the construction period is short, task is tight, and new line website quantity is increasing, and data scale is increasingly huge.Rail
Road field of traffic is related to multiple professions such as transport, building, vehicle, electromechanics, power supply, communication, signal, ring control, and each profession is logical daily
It crosses the modes such as artificial, equipment and acquires the data volume of generation in terms of million.These data have magnanimity, multi-source heterogeneous, generation speed
Degree and spread speed such as are exceedingly fast at the features, and have contained many useful information.By being carried out to these large scale scale heterogeneous data
The data mining of depth and data analysis, excavate wherein valuable information, and the operation level of rail traffic can be improved, promoted
Science decision ability, Improve Efficiency reduce cost, promote information service and safety assurance ability.In above-mentioned data, contain big
Structuring and non-structured flow data are measured, such as by the image data of various sensors or computer equipment uninterrupted sampling, letter
Number.How the research heat that storage inquiry is industry quickly and efficiently to be carried out to large-scale isomery rail traffic flow data
Point.
The storage scheme majority that existing rail traffic data-storage system uses is based on traditional Relational DataBase come structure
It builds, read-write delay is high, is unable to satisfy the storage speed demand of rail traffic flow data, while horizontal extension ability is poor, Wu Fashi
Answer the capacity requirement of the very fast growth of rail traffic flow data.In addition, rare Construction of Data Warehouse case, is all using MPP DB
Framework carries out data storage.MPP DB has certain performance advantage on various dimensions complex query, but scalability is poor, concurrency
Deficiency cannot store unstructured data, not can be carried out stream process, and disadvantages mentioned above makes MPP DB can't be rail traffic stream
The outstanding solution of data-storage system.
Hadoop is the distributed system infrastructure of a mainstream, and the HDFS provided is compared with MPP DB, structure spirit
It is living, it is easy to extend, can store unstructured data, the data volume of storage is bigger, supports high concurrent and in real time processing.But it is straight
The storage based on HDFS is connect not index, data block is larger, the inefficient operations such as accurate inquiry, query composition, it is thus impossible to
HDFS is directly used, suitable Hadoop component is selected, the characteristics of for rail traffic isomery flow data, designs new track
The storage scheme of traffic isomery flow data.
Summary of the invention
In order to overcome the shortcomings of that the unstructured flow data memory technology scheme of existing rail traffic, the present invention provide a kind of rail
The unstructured flow data real-time storage method of road traffic, solves following problems:
1. the conventional rails traffic storage system read-write delay based on relevant database and MDD DB is high, Wu Faman
Sufficient flow data storage speed demand.The present invention solves the problems, such as the quick real-time storage for a large amount of rail traffic flow datas.
2. the conventional rails traffic storage system horizontal extension ability based on relevant database and MDD DB is poor, and
Rail Transit System flow data, data volume increase fastly, and capacity requirement is big, and conventional rails traffic storage system is unable to satisfy storage and holds
Amount demand.The problem of present invention is solved for data scale rapid growth, and conventional store scheme faces capacity bottleneck.
3. the conventional rails traffic storage system based on relevant database and MDD DB can not store unstructured
Flow data.The present invention solves the problems, such as that conventional store scheme can not store unstructured rail traffic flow data real-time, quickly.
4. the efficient fast quick checking that the rail traffic data storage scheme based on HDFS is unable to satisfy application in track transportation
Inquiry demand.The present invention solves the problems, such as that the rail traffic data storage scheme search efficiency based on HDFS is low.
The purpose of the present invention is achieved through the following technical solutions: a kind of unstructured flow data of rail traffic is real-time
Storage method, method includes the following steps:
Step 1: the unstructured flow data of rail traffic being collected into rail traffic big data processing platform, platform is based on
Hadoop and its component Hbase distributed data base.
Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition
Step 2.1 website flow data retrieval scheme, specifically includes following sub-step:
Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and when
Between stamp combination can position specific data.
Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky is pre-created
region。
Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, is Hash to original index
Operation, obtains a prefix Prefix;
Prefix=HashCode (RowKey) %N
RowKey is updated according to prefix Prefix.
Step 2.2 equipment flow data retrieval scheme, specifically includes following sub-step:
Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and when
Between stamp combination can position specific data.
Step 2.2.2 carries out pre- subregion to equipment flow data, and before data are written to HBase, M sky is pre-created
region。
Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index,
Obtain a prefix Prefix;
Prefix=HashCode (RowKey) %M
RowKey is updated according to prefix Prefix.
Step 3: the unstructured flow data of the multi-source of acquisition is linked into data buffer zone.
Step 4: in step 3 amount of buffered data of queue reach call after threshold value HBase multithreading wiring method or
FlushCommits () method writes data into HBase.
Further, the unstructured flow data of rail traffic includes image file, audio file, video file.
Further, in the step 2.1.1, the design of RowKey structure is as follows:
Website |
Route |
Device type |
Device id |
Affiliated subsystem |
Data type |
|
|
|
|
|
|
Further, in the step 2.1.3, RowKey is updated to following form:
Prefix |
Website |
Route |
Device type |
Device id |
Affiliated subsystem |
Data type |
|
|
|
|
|
|
|
Further, in the step 2.2.1, the design of RowKey structure is as follows:
Device type |
Device id |
Route |
Affiliated subsystem |
Data type |
|
|
|
|
|
Further, in the step 2.1.1 and 2.2.1, by RowKey and timestamp combined index, user can be accurate
Specific data is positioned, efficiency data query can also be improved to data temporally range query.
Further, empty region be pre-created in step 2.2.2 several M, which are greater than in step 2.1.2, to be pre-created
Empty region several N.
Further, in the step 2.2.3, RowKey is updated to following form:
Prefix |
Device type |
Device id |
Route |
Affiliated subsystem |
Data type |
|
|
|
|
|
|
Further, the step 3 specifically: multiple buffering queues are constructed respectively to different types of data, by data
Timestamp timestamp obtains buffering queue number QueueID to M remainder after converting by Hash, and M is the data class of the data
The corresponding buffering queue number of type, places data into corresponding buffering queue according to QueueID.
QueueID=HashCode (timestamp) %M.
Further, in the step 3,3 buffering queues are constructed to every kind of data type.
The beneficial effects of the present invention are:
1. the present invention proposes a kind of unstructured flow data storage scheme of the rail traffic based on HBase, it is able to satisfy non-knot
Structure data storage requirement, storage speed demand, the storage capacity requirement of rail traffic flow data.
2. efficient retrieval scheme (step 2.1.1, step that the present invention proposes a kind of unstructured flow data of rail traffic
2.2.1), data search efficiency can be accelerated with this conceptual design HBase database RowKey.
3. the present invention proposes multi-source non-structural flow data multi-source buffer area, and proposes the side of corresponding optimization HBase index
Case (step 2.1.3, step 2.2.3) improves flow data write efficiency.
Detailed description of the invention
Fig. 1 is overall construction drawing;
Fig. 2 is multi-source buffer area structure chart.
Specific embodiment
In the following detailed description, with reference to the attached drawing for forming a part of the invention, wherein passing through graphic side
Formula shows implementable a specific embodiment of the invention.It should be understood that without departing from the scope of the invention, it can be utilized
Its embodiment and the change that structure or logic can be carried out.For example, the feature that an embodiment is shown or described can
For or in conjunction with other embodiment to generate another embodiment.It include these modifications and variations its object is to the present invention.
Use specific language (it is not necessarily to be construed as limitation the scope of the appended claims) description embodiment.Attached drawing do not press than
Example draws and only for purposes of discussion.
A kind of unstructured flow data real-time storage method of rail traffic proposed by the present invention, this method is based on HBase points
Cloth database carries out flow data storage, the specific steps are as follows:
Step 1: flow data unstructured for rail traffic, including the image file (picture of Image such as passenger's publication
Deng), audio file (Audio for example sound pick-up outfit generate audio files), video file (Video such as monitor video, three-dimensional animation
Deng), rail traffic big data processing platform is collected, platform is based on Hadoop and its component Hbase distributed data base.
Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition.Step is described in detail as follows:
Step 2.1 website flow data retrieval scheme: website flow data includes flow data relevant to website, such as site monitoring
Video.Detailed step is as follows:
Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and when
Between stamp combination can position specific data.The design of RowKey structure is as follows:
Website |
Route |
Device type |
Device id |
Affiliated subsystem |
Data type |
|
|
|
|
|
|
RowKey is combined with timestamp may make up following form index:
By RowKey and timestamp combined index, user can be accurately positioned specific data, can also be to data temporally
Range query, improve efficiency data query.
Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky is pre-created
Region (preferably 16).(in HBase storing data under default situations, automatically creates one when creating HBase table
Region subregion, when data are written, all HBase clients all can write data to this subregion, after this region is sufficiently large
Cutting is just carried out, this can cause single node load rise, and write efficiency reduces;In order to solve this problem, the side of pre- subregion is proposed
Case.)
Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, is Hash to original index
Operation, obtains a prefix Prefix.
Prefix=HashCode (RowKey) %N.
Wherein for RowKey as described in step 2.1.1, N is the number for dividing region in advance in step 2.1.2.Then
RowKey is updated to following form:
Above-mentioned steps are to eliminate the hot spot of HBase (Hot Spot) problem.Hot issue refers to that flow data can be on time
Between be sequentially sequentially inserted into a region, when this region reach certain threshold value after can be just inserted into other region, in this way
Mode greatly reduce data write efficiency.It is operated by Hash remainder and introduces prefix, it can be by continuous flow data more
It is uniformly inserted into the region of pre- subregion, to improve write efficiency.
Step 2.2 equipment flow data retrieval scheme, including regarded with various kinds of equipment and train related data, such as train supervision
Frequently, equipment is recorded;Detailed step it is following (be similar to step 1):
Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and when
Between stamp compositional modeling can position specific data.The design of RowKey structure is as follows:
Device type |
Device id |
Route |
Affiliated subsystem |
Data type |
|
|
|
|
|
RowKey is combined with timestamp may make up following form index:
By RowKey and timestamp combined index, user can be accurately positioned specific data, can also be to data temporally
Range query, improve efficiency data query.
Step 2.2.2 carries out pre- subregion to equipment flow data, and M sky before data are written to HBase, is being pre-created
region.(preferably 32, the pre- number of partitions of equipment flow data is more than website flow data, because device category is more, measures bigger)
Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index,
Obtain a prefix Prefix.
Prefix=HashCode (RowKey) %M.
Wherein RowKey is RowKey described in step 2.2.1, and M is to divide taking for region in advance in step 2.2.2
Value.Then RowKey is updated to following form:
Prefix |
Device type |
Device id |
Route |
Affiliated subsystem |
Data type |
|
|
|
|
|
|
Step 3: the unstructured flow data of the multi-source of acquisition being linked into data buffer zone, is realized to different type flow data
Real-time processing, improve batch write-in speed.Detailed step is as follows:
It constructs multiple buffering queues respectively to different types of data and (preferably constructs 3 bufferings to every kind of data type
Queue), obtaining buffering queue number QueueID to M remainder after the timestamp timestamp of data is converted by Hash, (M is
The corresponding buffering queue number of the data type of the data), it is placed data into corresponding buffering queue according to QueueID.
QueueID=HashCode (timestamp) %M
Step 4: in step 3 amount of buffered data of queue reach call after threshold value HBase multithreading wiring method or
FlushCommits () method writes data into HBase.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair
Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.