CN109947896A

CN109947896A - A kind of unstructured flow data real-time storage method of rail traffic

Info

Publication number: CN109947896A
Application number: CN201910181493.6A
Authority: CN
Inventors: 黄滔; 王刚; 高杨; 刘国庆
Original assignee: Zhejiang Bang Sheng Technology Co Ltd
Current assignee: Zhejiang Bang Sheng Technology Co Ltd; CRRC Tangshan Co Ltd
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2019-06-28

Abstract

The invention discloses a kind of unstructured flow data real-time storage methods of rail traffic, the unstructured flow data of rail traffic are collected rail traffic big data processing platform, platform is based on Hadoop and its component Hbase distributed data base；Efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition；The unstructured flow data of the multi-source of acquisition is linked into data buffer zone；The amount of buffered data of queue calls HBase multithreading wiring method or flushCommits () method to write data into HBase after reaching threshold value.The present invention proposes the unstructured flow data storage scheme of rail traffic based on HBase, is able to satisfy storage speed demand, the storage capacity requirement of unstructured data storage demand, rail traffic flow data.The present invention proposes the efficient retrieval scheme of the unstructured flow data of rail traffic, and data search efficiency can be accelerated by designing HBase database RowKey with this.The present invention proposes multi-source non-structural flow data multi-source buffer area, and proposes the scheme of corresponding optimization HBase index, improves flow data write efficiency.

Description

A kind of unstructured flow data real-time storage method of rail traffic

Technical field

The invention belongs to field of track traffic more particularly to a kind of unstructured flow data real-time storage sides of rail traffic Method.

Background technique

At present the country urban rail transit construction carry out on a large scale, subway, light rail, tramcar etc. have much runed it is open-minded Route and route is being built, the construction period is short, task is tight, and new line website quantity is increasing, and data scale is increasingly huge.Rail Road field of traffic is related to multiple professions such as transport, building, vehicle, electromechanics, power supply, communication, signal, ring control, and each profession is logical daily It crosses the modes such as artificial, equipment and acquires the data volume of generation in terms of million.These data have magnanimity, multi-source heterogeneous, generation speed Degree and spread speed such as are exceedingly fast at the features, and have contained many useful information.By being carried out to these large scale scale heterogeneous data The data mining of depth and data analysis, excavate wherein valuable information, and the operation level of rail traffic can be improved, promoted Science decision ability, Improve Efficiency reduce cost, promote information service and safety assurance ability.In above-mentioned data, contain big Structuring and non-structured flow data are measured, such as by the image data of various sensors or computer equipment uninterrupted sampling, letter Number.How the research heat that storage inquiry is industry quickly and efficiently to be carried out to large-scale isomery rail traffic flow data Point.

The storage scheme majority that existing rail traffic data-storage system uses is based on traditional Relational DataBase come structure It builds, read-write delay is high, is unable to satisfy the storage speed demand of rail traffic flow data, while horizontal extension ability is poor, Wu Fashi Answer the capacity requirement of the very fast growth of rail traffic flow data.In addition, rare Construction of Data Warehouse case, is all using MPP DB Framework carries out data storage.MPP DB has certain performance advantage on various dimensions complex query, but scalability is poor, concurrency Deficiency cannot store unstructured data, not can be carried out stream process, and disadvantages mentioned above makes MPP DB can't be rail traffic stream The outstanding solution of data-storage system.

Hadoop is the distributed system infrastructure of a mainstream, and the HDFS provided is compared with MPP DB, structure spirit It is living, it is easy to extend, can store unstructured data, the data volume of storage is bigger, supports high concurrent and in real time processing.But it is straight The storage based on HDFS is connect not index, data block is larger, the inefficient operations such as accurate inquiry, query composition, it is thus impossible to HDFS is directly used, suitable Hadoop component is selected, the characteristics of for rail traffic isomery flow data, designs new track The storage scheme of traffic isomery flow data.

Summary of the invention

In order to overcome the shortcomings of that the unstructured flow data memory technology scheme of existing rail traffic, the present invention provide a kind of rail The unstructured flow data real-time storage method of road traffic, solves following problems:

1. the conventional rails traffic storage system read-write delay based on relevant database and MDD DB is high, Wu Faman Sufficient flow data storage speed demand.The present invention solves the problems, such as the quick real-time storage for a large amount of rail traffic flow datas.

2. the conventional rails traffic storage system horizontal extension ability based on relevant database and MDD DB is poor, and Rail Transit System flow data, data volume increase fastly, and capacity requirement is big, and conventional rails traffic storage system is unable to satisfy storage and holds Amount demand.The problem of present invention is solved for data scale rapid growth, and conventional store scheme faces capacity bottleneck.

3. the conventional rails traffic storage system based on relevant database and MDD DB can not store unstructured Flow data.The present invention solves the problems, such as that conventional store scheme can not store unstructured rail traffic flow data real-time, quickly.

4. the efficient fast quick checking that the rail traffic data storage scheme based on HDFS is unable to satisfy application in track transportation Inquiry demand.The present invention solves the problems, such as that the rail traffic data storage scheme search efficiency based on HDFS is low.

The purpose of the present invention is achieved through the following technical solutions: a kind of unstructured flow data of rail traffic is real-time Storage method, method includes the following steps:

Step 1: the unstructured flow data of rail traffic being collected into rail traffic big data processing platform, platform is based on Hadoop and its component Hbase distributed data base.

Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition

Step 2.1 website flow data retrieval scheme, specifically includes following sub-step:

Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp combination can position specific data.

Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky is pre-created region。

Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, is Hash to original index Operation, obtains a prefix Prefix；

Prefix=HashCode (RowKey) %N

RowKey is updated according to prefix Prefix.

Step 2.2 equipment flow data retrieval scheme, specifically includes following sub-step:

Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp combination can position specific data.

Step 2.2.2 carries out pre- subregion to equipment flow data, and before data are written to HBase, M sky is pre-created region。

Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index, Obtain a prefix Prefix；

Prefix=HashCode (RowKey) %M

RowKey is updated according to prefix Prefix.

Step 3: the unstructured flow data of the multi-source of acquisition is linked into data buffer zone.

Step 4: in step 3 amount of buffered data of queue reach call after threshold value HBase multithreading wiring method or FlushCommits () method writes data into HBase.

Further, the unstructured flow data of rail traffic includes image file, audio file, video file.

Further, in the step 2.1.1, the design of RowKey structure is as follows:

Website	Route	Device type	Device id	Affiliated subsystem	Data type

Further, in the step 2.1.3, RowKey is updated to following form:

Prefix

Website

Route

Device type

Device id

Affiliated subsystem

Data type

Further, in the step 2.2.1, the design of RowKey structure is as follows:

Device type	Device id	Route	Affiliated subsystem	Data type

Further, in the step 2.1.1 and 2.2.1, by RowKey and timestamp combined index, user can be accurate Specific data is positioned, efficiency data query can also be improved to data temporally range query.

Further, empty region be pre-created in step 2.2.2 several M, which are greater than in step 2.1.2, to be pre-created Empty region several N.

Further, in the step 2.2.3, RowKey is updated to following form:

Prefix	Device type	Device id	Route	Affiliated subsystem	Data type

Further, the step 3 specifically: multiple buffering queues are constructed respectively to different types of data, by data Timestamp timestamp obtains buffering queue number QueueID to M remainder after converting by Hash, and M is the data class of the data The corresponding buffering queue number of type, places data into corresponding buffering queue according to QueueID.

QueueID=HashCode (timestamp) %M.

Further, in the step 3,3 buffering queues are constructed to every kind of data type.

The beneficial effects of the present invention are:

1. the present invention proposes a kind of unstructured flow data storage scheme of the rail traffic based on HBase, it is able to satisfy non-knot Structure data storage requirement, storage speed demand, the storage capacity requirement of rail traffic flow data.

2. efficient retrieval scheme (step 2.1.1, step that the present invention proposes a kind of unstructured flow data of rail traffic 2.2.1), data search efficiency can be accelerated with this conceptual design HBase database RowKey.

3. the present invention proposes multi-source non-structural flow data multi-source buffer area, and proposes the side of corresponding optimization HBase index Case (step 2.1.3, step 2.2.3) improves flow data write efficiency.

Detailed description of the invention

Fig. 1 is overall construction drawing；

Fig. 2 is multi-source buffer area structure chart.

Specific embodiment

In the following detailed description, with reference to the attached drawing for forming a part of the invention, wherein passing through graphic side Formula shows implementable a specific embodiment of the invention.It should be understood that without departing from the scope of the invention, it can be utilized Its embodiment and the change that structure or logic can be carried out.For example, the feature that an embodiment is shown or described can For or in conjunction with other embodiment to generate another embodiment.It include these modifications and variations its object is to the present invention. Use specific language (it is not necessarily to be construed as limitation the scope of the appended claims) description embodiment.Attached drawing do not press than Example draws and only for purposes of discussion.

A kind of unstructured flow data real-time storage method of rail traffic proposed by the present invention, this method is based on HBase points Cloth database carries out flow data storage, the specific steps are as follows:

Step 1: flow data unstructured for rail traffic, including the image file (picture of Image such as passenger's publication Deng), audio file (Audio for example sound pick-up outfit generate audio files), video file (Video such as monitor video, three-dimensional animation Deng), rail traffic big data processing platform is collected, platform is based on Hadoop and its component Hbase distributed data base.

Step 2: efficient retrieval scheme is constructed to the unstructured flow data of the multi-source of acquisition.Step is described in detail as follows:

Step 2.1 website flow data retrieval scheme: website flow data includes flow data relevant to website, such as site monitoring Video.Detailed step is as follows:

Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp combination can position specific data.The design of RowKey structure is as follows:

Website	Route	Device type	Device id	Affiliated subsystem	Data type

RowKey is combined with timestamp may make up following form index:

By RowKey and timestamp combined index, user can be accurately positioned specific data, can also be to data temporally Range query, improve efficiency data query.

Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky is pre-created Region (preferably 16).(in HBase storing data under default situations, automatically creates one when creating HBase table Region subregion, when data are written, all HBase clients all can write data to this subregion, after this region is sufficiently large Cutting is just carried out, this can cause single node load rise, and write efficiency reduces；In order to solve this problem, the side of pre- subregion is proposed Case.)

Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, is Hash to original index Operation, obtains a prefix Prefix.

Prefix=HashCode (RowKey) %N.

Wherein for RowKey as described in step 2.1.1, N is the number for dividing region in advance in step 2.1.2.Then RowKey is updated to following form:

Above-mentioned steps are to eliminate the hot spot of HBase (Hot Spot) problem.Hot issue refers to that flow data can be on time Between be sequentially sequentially inserted into a region, when this region reach certain threshold value after can be just inserted into other region, in this way Mode greatly reduce data write efficiency.It is operated by Hash remainder and introduces prefix, it can be by continuous flow data more It is uniformly inserted into the region of pre- subregion, to improve write efficiency.

Step 2.2 equipment flow data retrieval scheme, including regarded with various kinds of equipment and train related data, such as train supervision Frequently, equipment is recorded；Detailed step it is following (be similar to step 1):

Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and when Between stamp compositional modeling can position specific data.The design of RowKey structure is as follows:

Device type	Device id	Route	Affiliated subsystem	Data type

RowKey is combined with timestamp may make up following form index:

Step 2.2.2 carries out pre- subregion to equipment flow data, and M sky before data are written to HBase, is being pre-created region.(preferably 32, the pre- number of partitions of equipment flow data is more than website flow data, because device category is more, measures bigger)

Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index, Obtain a prefix Prefix.

Prefix=HashCode (RowKey) %M.

Wherein RowKey is RowKey described in step 2.2.1, and M is to divide taking for region in advance in step 2.2.2 Value.Then RowKey is updated to following form:

Prefix	Device type	Device id	Route	Affiliated subsystem	Data type

Step 3: the unstructured flow data of the multi-source of acquisition being linked into data buffer zone, is realized to different type flow data Real-time processing, improve batch write-in speed.Detailed step is as follows:

It constructs multiple buffering queues respectively to different types of data and (preferably constructs 3 bufferings to every kind of data type Queue), obtaining buffering queue number QueueID to M remainder after the timestamp timestamp of data is converted by Hash, (M is The corresponding buffering queue number of the data type of the data), it is placed data into corresponding buffering queue according to QueueID.

QueueID=HashCode (timestamp) %M

The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of unstructured flow data real-time storage method of rail traffic, which is characterized in that method includes the following steps:

Step 2.1.1 is for website flow data, using its whole network unique index as the RowKey of HBase, RowKey and timestamp Combination can position specific data.

Step 2.1.2 carries out pre- subregion to website flow data, and before data are written to HBase, N number of sky region is pre-created.

Step 2.1.3 optimizes website flow data index structure according to pre- partitioning strategies, does Hash operation to original index, Prefix Prefix, Prefix=HashCode (RowKey) %N is obtained, RowKey is updated according to prefix Prefix.

Step 2.2.1 is for equipment flow data, using its whole network unique index as the RowKey of HBase, RowKey and timestamp Combination can position specific data.

Step 2.2.2 carries out pre- subregion to equipment flow data, and before data are written to HBase, M sky region is pre-created.

Step 2.2.3 optimizes equipment flow data index structure according to pre- partitioning strategies, does Hash operation to original index, obtains One prefix Prefix, Prefix=HashCode (RowKey) %M updates RowKey according to prefix Prefix.

2. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 1, the unstructured flow data of rail traffic includes image file, audio file, video file.

3. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.1.1, the design of RowKey structure is as follows:

Website Route Device type Device id Affiliated subsystem Data type

。

4. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.1.3, RowKey is updated to following form:

。

5. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.2.1, the design of RowKey structure is as follows:

Device type Device id Route Affiliated subsystem Data type

。

6. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.1.1 and 2.2.1, by RowKey and timestamp combined index, user can be accurately positioned specific data, can also be with To data temporally range query, efficiency data query is improved.

7. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that step Empty region several M being pre-created in rapid 2.2.2 are greater than empty region several N being pre-created in step 2.1.2.

8. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute It states in step 2.2.3, RowKey is updated to following form:

Prefix Device type Device id Route Affiliated subsystem Data type

。

9. a kind of unstructured flow data real-time storage method of rail traffic according to claim 1, which is characterized in that institute State step 3 specifically: construct multiple buffering queues respectively to different types of data, the timestamp timestamp of data is passed through Buffering queue number QueueID is obtained to M remainder after Hash conversion, M is the corresponding buffering queue of data type of the data Number, places data into corresponding buffering queue according to QueueID.

QueueID=HashCode (timestamp) %M.

10. a kind of unstructured flow data real-time storage method of rail traffic according to claim 9, which is characterized in that In the step 3,3 buffering queues are constructed to every kind of data type.